DeepCAST-GWAS is a GWAS post-processing pipeline that integrates Enformer-derived SAD scores and coding-variant annotations to prioritize variants and loci.
This repository provides two “production” entrypoints:
- Baseline + DeepCAST (FWER-style thresholding):
src/main.py - DeepCAST with stratified FDR (sFDR):
src/deepcast_sfdr.py
The pipeline covers the complete workflow:
- Data preprocessing (reference formatting; coding SNP list generation; optional LDSC formatting utilities)
- Integration of Enformer-derived SAD scores into GWAS summary statistics
- Variant filtering / prioritization (coding SNPs and SAD outliers)
- LD-based locus definition via PLINK clumping
- Stratified FDR (binning by SAD z-score + coding stratum)
Example SLURM scripts for HPC are provided below (copy/paste templates).
- Configure paths in
src/config.py(portable defaults are relative to the repo). Most users only need to set:
SUMSTATS_DIR(GWAS summary stats)TRACKLISTS_DIR(phenotype-specific Enformer tracklists)REF_DIR_1KG+REF_DICT_PATH(1KG reference augmented with Enformer SAD tracks)LD_REFERENCE_PATH+PLINK_PATH(for clumping)
You can override most paths via environment variables (see docstring at the top of src/config.py).
- Run baseline + DeepCAST (FWER-style)
python3 -u src/main.py --phen 123 --run_id demo_run- Run DeepCAST with stratified FDR (sFDR)
python3 -u src/deepcast_sfdr.py --phen 123 --run_id demo_runOutputs are written under results/ (see “Outputs” below).
DeepCAST-GWAS is pure Python, but requires a working PLINK installation for LD clumping.
At minimum you need:
python>=3.10pandas,numpy,scipy,statsmodels
Optional (only for some data-prep utilities):
openai,pydantic,python-dotenv(used bysrc/data_preparation/curate_sad_tracks.py)
Ensure plink is available on your $PATH, or set DEEPCAST_GWAS_PLINK to an explicit binary path.
DeepCAST-GWAS expects a small set of well-defined inputs. The default layout is:
deepcast_gwas/
data/
sumstats/ # GWAS summary stats files
tracklists/ # phenotype-specific Enformer track indices
1000genomes_as_csv/ # 1KG reference files augmented with SAD columns
reference_files_by_chr.json # chr -> reference filename mapping
genome_assembly/
coding_snps.csv # list of coding SNP rsIDs (generated)
results/
logs/
src/
The pipeline reads a single GWAS summary-statistics file for a phenotype. The reader supports plain text and .tsv.bgz (bgzip) files.
Required columns (default names):
chr: chromosome (int-like)pos: base-pair position (int-like)ref: reference allelealt: alternate alleleneglog10_pval_EUR: (-\log_{10}(p))
If your columns differ, you can map them via CLI flags:
--chr,--bp,--ref,--alt,--neglogp
For non-indexed phenotypes, pass the file directly with --filename (relative to SUMSTATS_DIR).
Each phenotype needs a tracklist: a one-column CSV of Enformer track indices, one per line.
By default, the filename is:
data/tracklists/tracks_phen<PHEN_ID>.csv
(This naming is controlled by src/utils.py:create_filename_tracklist.)
To integrate Enformer-derived SAD scores, DeepCAST-GWAS joins your GWAS variants onto per-chromosome reference CSVs.
Inputs:
data/reference_files_by_chr.json: mapping"1" -> "<filename_for_chr1>.csv", …,"22" -> ...data/1000genomes_as_csv/<that filename>: reference CSVs
Each reference CSV must contain the merge keys and annotation columns:
- Merge keys:
chr,pos,ref,alt - Required:
snp(rsID) - SAD columns:
SAD<track_index>for every track listed in the phenotype’s tracklist
You can download the reference genome merged with the Enformer tracks here.
DeepCAST-GWAS marks variants as “coding” by membership in data/genome_assembly/coding_snps.csv (rsIDs).
You can download coding SNPs as csv here.
Alternatively, you can generate your own list using an alternative exon list.
- Run:
python3 -u src/data_preparation/coding_snps/generate_coding_snp_list.pyRun:
python3 -u src/main.py --phen 123 --run_id my_runHigh-level steps (as implemented in src/main.py):
- Read GWAS sumstats
- Load phenotype tracklist
- Merge sumstats with SAD tracks (per chromosome) using
reference_files_by_chr.json - Identify DeepCAST-relevant SNPs:
- coding SNPs (from
coding_snps.csv) - SAD outliers (based on
SAD_DEVIATION_FACTOR)
- coding SNPs (from
- Compute an adjusted p-value threshold for the DeepCAST subset
- Define loci via LD clumping (PLINK) for:
- baseline (genome-wide significant lead SNPs; keeps DeepCAST-adjusted tagging variants)
- DeepCAST-relevant variants (lead SNPs at the adjusted threshold)
Run:
python3 -u src/deepcast_sfdr.py --phen 123 --run_id my_runHigh-level steps (as implemented in src/deepcast_sfdr.py):
- Read GWAS sumstats
- Merge SAD tracks
- Compute per-variant mean SAD z-score (across the phenotype’s tracks)
- Assign each SNP to a stratum:
- non-coding SNPs: binned by SAD z-score (
STD_BINS/STD_BIN_LABELS) - coding SNPs: forced into a dedicated
"coding"stratum
- non-coding SNPs: binned by SAD z-score (
- Perform BH-FDR within each stratum, then clump to loci with PLINK
Output directory:
results/run_<RUN_ID>/phen<PHEN_ID>/
Files:
significant_snps_baseline.csvsignificant_snps_deepcast.csvsnps_missing_from_reference.csvmetadata.csv
Output directory:
results/deepcast_sfdr/run_<RUN_ID>/phen<PHEN_ID>/
Files:
significant_snps_baseline.csvfdr_significant_snps.csvsnps_missing_from_reference.csvmetadata.csv
The scripts below are written as templates (placeholders are intentional). You typically only need to edit:
- conda/module activation
- resource requests (
--cpus-per-task,--mem,--time) - the phenotype list file used for arrays
- environment overrides for
src/config.pypaths (optional but recommended on shared filesystems)
#!/bin/bash
#SBATCH --job-name=deepcast_baseline
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=06:00:00
#SBATCH --output=/dev/null
set -euo pipefail
DEEPCAST_GWAS_DIR="${DEEPCAST_GWAS_DIR:-/path/to/deepcast_gwas}"
PHEN_ID="${PHEN_ID:-123}"
RUN_ID="${RUN_ID:-${SLURM_JOB_ID}}"
source ~/.bashrc
# conda activate deepcast_gwas
cd "${DEEPCAST_GWAS_DIR}"
log_dir="logs/baseline/run_${RUN_ID}"
mkdir -p "${log_dir}"
exec > "${log_dir}/job_${SLURM_JOB_ID}.out" 2>&1
python3 -u src/main.py \
--phen "${PHEN_ID}" \
--run_id "${RUN_ID}"#!/bin/bash
#SBATCH --job-name=deepcast_baseline_arr
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=06:00:00
#SBATCH --array=1-100
#SBATCH --output=/dev/null
set -euo pipefail
DEEPCAST_GWAS_DIR="${DEEPCAST_GWAS_DIR:-/path/to/deepcast_gwas}"
PHEN_LIST="${PHEN_LIST:-/path/to/phen_ids.txt}" # one integer phenotype id per line
RUN_ID="${RUN_ID:-${SLURM_ARRAY_JOB_ID}}"
source ~/.bashrc
# conda activate deepcast_gwas
cd "${DEEPCAST_GWAS_DIR}"
log_dir="logs/baseline/run_${RUN_ID}"
mkdir -p "${log_dir}"
exec > "${log_dir}/job_${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}.out" 2>&1
PHEN_ID="$(sed -n "${SLURM_ARRAY_TASK_ID}p" "${PHEN_LIST}")"
echo "PHEN_ID=${PHEN_ID} RUN_ID=${RUN_ID}"
python3 -u src/main.py \
--phen "${PHEN_ID}" \
--run_id "${RUN_ID}"#!/bin/bash
#SBATCH --job-name=deepcast_sfdr
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=06:00:00
#SBATCH --output=/dev/null
set -euo pipefail
DEEPCAST_GWAS_DIR="${DEEPCAST_GWAS_DIR:-/path/to/deepcast_gwas}"
PHEN_ID="${PHEN_ID:-123}"
RUN_ID="${RUN_ID:-${SLURM_JOB_ID}}"
source ~/.bashrc
# conda activate deepcast_gwas
cd "${DEEPCAST_GWAS_DIR}"
log_dir="logs/sfdr/run_${RUN_ID}"
mkdir -p "${log_dir}"
exec > "${log_dir}/job_${SLURM_JOB_ID}.out" 2>&1
python3 -u src/deepcast_sfdr.py \
--phen "${PHEN_ID}" \
--run_id "${RUN_ID}"#!/bin/bash
#SBATCH --job-name=deepcast_sfdr_arr
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=06:00:00
#SBATCH --array=1-100
#SBATCH --output=/dev/null
set -euo pipefail
DEEPCAST_GWAS_DIR="${DEEPCAST_GWAS_DIR:-/path/to/deepcast_gwas}"
PHEN_LIST="${PHEN_LIST:-/path/to/phen_ids.txt}" # one integer phenotype id per line
RUN_ID="${RUN_ID:-${SLURM_ARRAY_JOB_ID}}"
source ~/.bashrc
# conda activate deepcast_gwas
cd "${DEEPCAST_GWAS_DIR}"
log_dir="logs/sfdr/run_${RUN_ID}"
mkdir -p "${log_dir}"
exec > "${log_dir}/job_${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}.out" 2>&1
PHEN_ID="$(sed -n "${SLURM_ARRAY_TASK_ID}p" "${PHEN_LIST}")"
echo "PHEN_ID=${PHEN_ID} RUN_ID=${RUN_ID}"
python3 -u src/deepcast_sfdr.py \
--phen "${PHEN_ID}" \
--run_id "${RUN_ID}"Method thresholds:
SAD_DEVIATION_FACTOR: SAD outlier cutoff (in SD units) for DeepCAST relevancePVAL_THRESHOLD: baseline genome-wide significance thresholdFDR_THRESHOLD: sFDR threshold for the adjusted p-valuesSTD_BINS/STD_BIN_LABELS: bins used for SAD-based strata in sFDR
LD clumping:
R2_THRESHOLD,KB_RADIUS: PLINK clumping parametersLD_REFERENCE_PATH:plink --bfileprefix for LD reference (e.g., 1KG EUR)PLINK_PATH: plink binary (or"plink")
- Merge returns empty: check that your reference CSVs contain the correct merge keys (
chr,pos,ref,alt) and the SAD columns (SAD<track>), and thatreference_files_by_chr.jsonpoints to the correct filenames. - PLINK errors / missing variants: ensure
LD_REFERENCE_PATHpoints to a valid PLINK binary dataset prefix and that alleles/rsIDs are compatible. - Missing phenotype mappings: if you don’t use phenotype-id-based lookup, prefer passing
--filenameand keep--phenonly for tracklist naming.