Repository contains code to perform Association Study on Early-Onset Pancreatic Cancer
This repository contains a genome-wide association study (GWAS) analyzing germline genetic variants on chromosome 18 for association with early-onset pancreatic cancer (EOPC) risk. This project was completed as a coding assessment for a Computational Biology position at Collins Genomics Lab (CGL).
This analysis was completed as part of a case study assessment for a Computational Biology position. The task involved analyzing genetic data to identify potential risk factors for early-onset pancreatic cancer (diagnosed at <50 years old), building upon findings from a CRISPR/Cas9-based screen in primary pancreatic ductal cells.
Study Data:
- 473 EOPC cases (diagnosed <50 years)
- 891 controls (cancer-free adults)
- 3,364 variants on chromosome 18
Author: Aymen Maqsood Mulbgal
- Python 3.10+
- Jupyter Notebook
# 1. Navigate to project
cd DFCI_assesment/
# 2. Launch notebook
jupyter notebook notebooks/EOPC.ipynb
# 3. Run all cells (Cell > Run All)For a clean environment:
# Option A: Conda
conda env create -f environment.yml
conda activate eopc_analysis
# Option B: venv
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
# Run notebook
jupyter notebook notebooks/EOPC.ipynbAfter running the notebook, check:
results/association_results_full.csv- Complete results for all 1,355 tested variants with statistics, effect sizes, and multiple testing correctionsresults/top_100_variants.csv- Ranked list of top 100 associated variantsresults/top10_summary_snps.csv- Formatted summary table of top 10 SNPsplots/EOPC_association_plots.png- Combined Manhattan and QQ plots (shown below)plots/EOPC_manhattan_plot.png/pdf- High-resolution Manhattan plotplots/EOPC_qq_plot.png/pdf- High-resolution QQ plot
Manhattan plot (left) and QQ plot (right) showing association results for chromosome 18 variants. Two variants exceed the Bonferroni significance threshold (red dashed line).
Key packages (auto-installed by notebook):
- pandas, numpy, scipy
- matplotlib, seaborn
- pysam (VCF handling)
- statsmodels
- tqdm (progress bars)
See requirements.txt for complete list with versions.
- Call rate thresholds: Standard GWAS QC thresholds (95% variant, 90% sample) balance data quality with retention of informative variants
- MAF ≥1%: Ensures adequate power for association testing with sample size (N=1,139); lower MAF variants have unstable effect estimates
- Heterozygosity filtering: ±3 SD threshold identifies potential contamination (high het) or sample issues (low het)
- Complete case analysis for smoking: Preferable to imputation for a major risk factor to avoid bias
- Additive genetic model: Standard approach for GWAS; assumes linear increase in log-odds per copy of risk allele
- Covariate adjustment: Age, sex, BMI, smoking, and ancestry PCs control for known confounders and population structure
- Multiple testing: Both Bonferroni (conservative) and FDR (practical) corrections applied
- Sample size (N=1,139) provides adequate power for common variants (MAF>5%) but limited power for rare variants
- Complete case analysis reduced sample size by 16% due to missing smoking data
- Single chromosome analysis; genome-wide significance thresholds are conservative for targeted region
- No replication cohort available; findings require independent validation
Conclusions: I identified two robust genetic risk loci for EOPC on chromosome 18, supporting the candidate region from functional screens. The lead variant shows a strong effect size consistent with moderate-penetrance risk alleles. Future directions include replication in independent cohorts, fine-mapping to identify causal variants, functional annotation of implicated genes (PSMA8, EPG5), and investigation of gene-environment interactions.
I would like to express my sincere gratitude to Dana Farber Cancer Institute and the Collins Genomics Lab for providing this valuable learning opportunity. Special thanks to Dr. Ryan Collins for giving me the chance to work on this assessment and for helping me learn and refresh my skills in:
- Reproducibility - Best practices for documenting and organizing computational analyses
- GWAS methodology - Quality control, statistical modeling, and interpretation of genetic association studies
- Early oncogenic hits - Understanding genetic risk factors in early-onset cancer
This assessment has been an excellent opportunity to apply and strengthen my computational biology skills while contributing to important research in pancreatic cancer genetics.
