Skip to content

Analyticsphere/qaqc_testing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

QAQC Testing Repository

A repository for the Connect4Cancer Quality Assurance and Quality Control (QAQC) process.


Table of Contents


Overview

The QAQC system validates data quality for the Connect4Cancer study by running configurable validation rules against BigQuery datasets. The system generates Excel reports highlighting data quality issues and can be executed via API, locally, or in parallel batch processing.


Architecture

The system consists of three main execution patterns:

1. API Execution (api.R)

  • Purpose: RESTful API for on-demand QAQC execution
  • Usage: Production deployments via Cloud Run
  • Parameters: dataset, rule range, data range
  • Environment: Uses environment variables for parameter passing

2. Local Development (qaqc.R)

  • Purpose: Direct script execution for development and testing
  • Usage: Local development and debugging
  • Configuration: Set parameters directly in script header

3. Parallel Processing (run_qaqc_in_parallel.R)

  • Purpose: Large-scale batch processing with chunked data
  • Usage: Processing entire datasets efficiently
  • Features: Automatic chunk management, retry logic, report concatenation

Core Components

Data Retrieval Layer (data_retrieval/)

Specialized functions for merging and retrieving different dataset types:

  • get_merged_biospecimen_and_recruitment_data.R
  • get_merged_module_[1-4]_data.R
  • get_merged_[survey_type]_data.R

Each function handles dataset-specific joins, data type conversions, and deduplication.

Validation Engine (qaqc.R)

The core validation system supports a wide range of QC rule types for basic validation, data types, cross-variable logic, string/format checks, and date comparisons.

πŸ‘‰ See the QC Types Overview for a complete reference and descriptions of each rule type.

Configuration System

  • Environment-specific configs: dev/, stg/, prod/ directories
  • Dataset-specific settings: Rules files, output folders, flags
  • Deployment configs: Dockerfiles, Cloud Build configurations

QC Types (QCTypes) Overview

The qctypes encompass a wide range of validation rules designed to ensure that the dataset adheres to defined standards of quality and consistency:

  • Basic Checks
  • Population Checks
  • Data Type Enforcement
  • String Length Constraints
  • Cross-Variable Logic
  • Matching Values

Each rule type is described in detail below:

Basic Validations

QC Type Description
valid Ensures ConceptID values are within the specified ValidValues.
NA or valid Allows ConceptID to be within ValidValues or NA.

Population Checks

QC Type Description
is populated Checks that ConceptID is not NA.
is not populated Verifies that ConceptID is NA.

Data Type Checks

QC Type Description
isNumeric Confirms that ConceptID can be converted to a numeric type.
NA or isNumeric Allows ConceptID to be numeric or NA.

Date and Time Validations

QC Type Description
valid before date() Ensures ConceptID date is before a specified comparison date.
NA or valid before date() Allows ConceptID date to be before the comparison date or NA.
is 24hr time Validates that ConceptID follows the HH:MM 24-hour time format.
NA or is 24hr time Allows ConceptID to be in 24-hour time format or NA.

String Length Checks

QC Type Description
has_n_characters Ensures ConceptID string has an exact number of characters.
has_less_than_or_equal_n_characters Checks that ConceptID string does not exceed a specified maximum length.
NA or has_n_characters Allows ConceptID to have the exact length or be NA.
NA or has_less_than_or_equal_n_characters Allows ConceptID to be within the maximum length or be NA.

Cross-Variable Validations

QC Type Description
crossValid1 β†’ crossValid4 Validates ConceptID based on up to 4 related variables.
crossValid1 isNumeric Combines crossValid1 with numeric validation.
crossValid1 is populated Combines crossValid1 with population check.

Matching Values Checks

QC Type Description
match cid values Ensures ConceptID matches the value of another ConceptID.
crossvalid match cid values Applies match cid values conditionally.
NA or match cid values Allows match or NA.
NA or crossvalid match cid values Allows conditional match or NA.

Special Validations

QC Type Description
crossValid1Date Ensures ConceptID is a valid date based on condition.
crossValid1NotNA Requires ConceptID to be non-NA based on condition.
crossValid1 equal to char() Checks string length based on condition.
crossValid1 equal to or less than char() Ensures string does not exceed length based on condition.
crossValid1 or is 24hr time Validates cross-variable condition or 24-hour time.
NA or crossValid1 is 24hr time Allows 24-hour time or NA based on condition.

Usage Examples

Local Development

# Edit parameters in qaqc.R header
local_drive <- "/your/working/directory"
tier        <- "dev"
module      <- "recruitment"

# Then source the script
source("qaqc.R")

Parallel Processing

# Configure parameters in run_qaqc_in_parallel.R
tier       <- "prod"
dataset    <- "recruitment"
rows_per_chunk <- 20000

# Run the script
source("run_qaqc_in_parallel.R")

Output

All execution modes generate Excel reports with three sheets:

  • report
  • exclusions
  • rules

Reports include:

  • Participant identifiers
  • Rule details
  • Site information
  • Cross-reference lookups

Deployment and Automation

Cloud Build Process

Build Trigger

  • Trigger: ccc-generic-qaqc
  • Activation: Pushed changes to environment branches (stage, prod)
  • Configuration: Uses environment-specific cloudbuild.yaml

Build Process

  1. Docker Build
  2. Container Registry
  3. Cloud Run Deployment

Cloud Run API Service

The deployed API provides:

  • GET/POST / (health check)
  • GET/POST /run-qaqc (QAQC execution)

Automated Scheduling System

Cloud Scheduler

Cloud Scheduler β†’ Cloud Run API β†’ QAQC Processing β†’ GCS Bucket β†’ Box.com

Example Payload:

{
  "dataset": "recruitment",
  "min_rule": 1,
  "max_rule": 100
}

Automated Report Distribution

  • Reports uploaded to GCS β†’ transferred to Box
  • Folder mapping configured via config.yml

Complete Automation Workflow

graph LR
    A[Code Push to Branch] --> B[Cloud Build Trigger]
    B --> C[Docker Build & Deploy]
    C --> D[Cloud Run Service]
    E[Cloud Scheduler] --> D
    D --> F[QAQC Processing]
    F --> G[Upload to GCS]
    G --> H[Cloud Function]
    H --> I[Transfer to Box.com]
Loading

Monitoring and Maintenance

  • Cloud Build, Run, Scheduler, and Cloud Function logs available
  • Manual intervention possible via Cloud Console

Issues and Feature Requests

Use the Issues tab to submit requests for new validation rules:


File Structure

qaqc_testing/
β”œβ”€β”€ api.R
β”œβ”€β”€ qaqc.R
β”œβ”€β”€ run_qaqc_in_parallel.R
β”œβ”€β”€ data_retrieval/
β”‚   β”œβ”€β”€ get_merged_*_data.R
β”œβ”€β”€ dev/
β”œβ”€β”€ stg/
β”œβ”€β”€ prod/
β”œβ”€β”€ qc_rules_*.xlsx
└── exclusions/

Maintainers

  • Primary: Jake Peters (since January 2023)
  • Original Author: Daniel Russ (August 2022)
  • Team: C4C Analytics team

This system processes sensitive health data. Ensure all development and deployment follows institutional data security policies.

About

working with qaqc

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •