A repository for the Connect4Cancer Quality Assurance and Quality Control (QAQC) process.
The QAQC system validates data quality for the Connect4Cancer study by running configurable validation rules against BigQuery datasets. The system generates Excel reports highlighting data quality issues and can be executed via API, locally, or in parallel batch processing.
The system consists of three main execution patterns:
- Purpose: RESTful API for on-demand QAQC execution
- Usage: Production deployments via Cloud Run
- Parameters: dataset, rule range, data range
- Environment: Uses environment variables for parameter passing
- Purpose: Direct script execution for development and testing
- Usage: Local development and debugging
- Configuration: Set parameters directly in script header
- Purpose: Large-scale batch processing with chunked data
- Usage: Processing entire datasets efficiently
- Features: Automatic chunk management, retry logic, report concatenation
Specialized functions for merging and retrieving different dataset types:
get_merged_biospecimen_and_recruitment_data.Rget_merged_module_[1-4]_data.Rget_merged_[survey_type]_data.R
Each function handles dataset-specific joins, data type conversions, and deduplication.
The core validation system supports a wide range of QC rule types for basic validation, data types, cross-variable logic, string/format checks, and date comparisons.
π See the QC Types Overview for a complete reference and descriptions of each rule type.
- Environment-specific configs:
dev/,stg/,prod/directories - Dataset-specific settings: Rules files, output folders, flags
- Deployment configs: Dockerfiles, Cloud Build configurations
The qctypes encompass a wide range of validation rules designed to ensure that the dataset adheres to defined standards of quality and consistency:
- Basic Checks
- Population Checks
- Data Type Enforcement
- String Length Constraints
- Cross-Variable Logic
- Matching Values
Each rule type is described in detail below:
| QC Type | Description |
|---|---|
valid |
Ensures ConceptID values are within the specified ValidValues. |
NA or valid |
Allows ConceptID to be within ValidValues or NA. |
| QC Type | Description |
|---|---|
is populated |
Checks that ConceptID is not NA. |
is not populated |
Verifies that ConceptID is NA. |
| QC Type | Description |
|---|---|
isNumeric |
Confirms that ConceptID can be converted to a numeric type. |
NA or isNumeric |
Allows ConceptID to be numeric or NA. |
| QC Type | Description |
|---|---|
valid before date() |
Ensures ConceptID date is before a specified comparison date. |
NA or valid before date() |
Allows ConceptID date to be before the comparison date or NA. |
is 24hr time |
Validates that ConceptID follows the HH:MM 24-hour time format. |
NA or is 24hr time |
Allows ConceptID to be in 24-hour time format or NA. |
| QC Type | Description |
|---|---|
has_n_characters |
Ensures ConceptID string has an exact number of characters. |
has_less_than_or_equal_n_characters |
Checks that ConceptID string does not exceed a specified maximum length. |
NA or has_n_characters |
Allows ConceptID to have the exact length or be NA. |
NA or has_less_than_or_equal_n_characters |
Allows ConceptID to be within the maximum length or be NA. |
| QC Type | Description |
|---|---|
crossValid1 β crossValid4 |
Validates ConceptID based on up to 4 related variables. |
crossValid1 isNumeric |
Combines crossValid1 with numeric validation. |
crossValid1 is populated |
Combines crossValid1 with population check. |
| QC Type | Description |
|---|---|
match cid values |
Ensures ConceptID matches the value of another ConceptID. |
crossvalid match cid values |
Applies match cid values conditionally. |
NA or match cid values |
Allows match or NA. |
NA or crossvalid match cid values |
Allows conditional match or NA. |
| QC Type | Description |
|---|---|
crossValid1Date |
Ensures ConceptID is a valid date based on condition. |
crossValid1NotNA |
Requires ConceptID to be non-NA based on condition. |
crossValid1 equal to char() |
Checks string length based on condition. |
crossValid1 equal to or less than char() |
Ensures string does not exceed length based on condition. |
crossValid1 or is 24hr time |
Validates cross-variable condition or 24-hour time. |
NA or crossValid1 is 24hr time |
Allows 24-hour time or NA based on condition. |
# Edit parameters in qaqc.R header
local_drive <- "/your/working/directory"
tier <- "dev"
module <- "recruitment"
# Then source the script
source("qaqc.R")# Configure parameters in run_qaqc_in_parallel.R
tier <- "prod"
dataset <- "recruitment"
rows_per_chunk <- 20000
# Run the script
source("run_qaqc_in_parallel.R")All execution modes generate Excel reports with three sheets:
reportexclusionsrules
Reports include:
- Participant identifiers
- Rule details
- Site information
- Cross-reference lookups
- Trigger:
ccc-generic-qaqc - Activation: Pushed changes to environment branches (
stage,prod) - Configuration: Uses environment-specific
cloudbuild.yaml
- Docker Build
- Container Registry
- Cloud Run Deployment
The deployed API provides:
GET/POST /(health check)GET/POST /run-qaqc(QAQC execution)
Cloud Scheduler β Cloud Run API β QAQC Processing β GCS Bucket β Box.com
Example Payload:
{
"dataset": "recruitment",
"min_rule": 1,
"max_rule": 100
}- Reports uploaded to GCS β transferred to Box
- Folder mapping configured via
config.yml
graph LR
A[Code Push to Branch] --> B[Cloud Build Trigger]
B --> C[Docker Build & Deploy]
C --> D[Cloud Run Service]
E[Cloud Scheduler] --> D
D --> F[QAQC Processing]
F --> G[Upload to GCS]
G --> H[Cloud Function]
H --> I[Transfer to Box.com]
- Cloud Build, Run, Scheduler, and Cloud Function logs available
- Manual intervention possible via Cloud Console
Use the Issues tab to submit requests for new validation rules:
- Assign to appropriate team member
- Tag with "QAQC"
- Use https://nih.app.box.com/file/1185137275319 as guide
qaqc_testing/
βββ api.R
βββ qaqc.R
βββ run_qaqc_in_parallel.R
βββ data_retrieval/
β βββ get_merged_*_data.R
βββ dev/
βββ stg/
βββ prod/
βββ qc_rules_*.xlsx
βββ exclusions/
- Primary: Jake Peters (since January 2023)
- Original Author: Daniel Russ (August 2022)
- Team: C4C Analytics team
This system processes sensitive health data. Ensure all development and deployment follows institutional data security policies.