EOPC GWAS Analysis

Repository contains code to perform Association Study on Early-Onset Pancreatic Cancer

About This Project

This repository contains a genome-wide association study (GWAS) analyzing germline genetic variants on chromosome 18 for association with early-onset pancreatic cancer (EOPC) risk. This project was completed as a coding assessment for a Computational Biology position at Collins Genomics Lab (CGL).

Assessment Context

This analysis was completed as part of a case study assessment for a Computational Biology position. The task involved analyzing genetic data to identify potential risk factors for early-onset pancreatic cancer (diagnosed at <50 years old), building upon findings from a CRISPR/Cas9-based screen in primary pancreatic ductal cells.

Study Data:

473 EOPC cases (diagnosed <50 years)
891 controls (cancer-free adults)
3,364 variants on chromosome 18

Author: Aymen Maqsood Mulbgal

Prerequisites

Python 3.10+
Jupyter Notebook

Installation & Usage

Quick Start

# 1. Navigate to project
cd DFCI_assesment/

# 2. Launch notebook
jupyter notebook notebooks/EOPC.ipynb

# 3. Run all cells (Cell > Run All)

Manual Environment Setup

For a clean environment:

# Option A: Conda
conda env create -f environment.yml
conda activate eopc_analysis

# Option B: venv
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

# Run notebook
jupyter notebook notebooks/EOPC.ipynb

Outputs

After running the notebook, check:

results/association_results_full.csv - Complete results for all 1,355 tested variants with statistics, effect sizes, and multiple testing corrections
results/top_100_variants.csv - Ranked list of top 100 associated variants
results/top10_summary_snps.csv - Formatted summary table of top 10 SNPs
plots/EOPC_association_plots.png - Combined Manhattan and QQ plots (shown below)
plots/EOPC_manhattan_plot.png/pdf - High-resolution Manhattan plot
plots/EOPC_qq_plot.png/pdf - High-resolution QQ plot

Visualizations

Manhattan plot (left) and QQ plot (right) showing association results for chromosome 18 variants. Two variants exceed the Bonferroni significance threshold (red dashed line).

Requirements

Key packages (auto-installed by notebook):

pandas, numpy, scipy
matplotlib, seaborn
pysam (VCF handling)
statsmodels
tqdm (progress bars)

See requirements.txt for complete list with versions.

Key Methodological Decisions

Quality Control Rationale

Call rate thresholds: Standard GWAS QC thresholds (95% variant, 90% sample) balance data quality with retention of informative variants
MAF ≥1%: Ensures adequate power for association testing with sample size (N=1,139); lower MAF variants have unstable effect estimates
Heterozygosity filtering: ±3 SD threshold identifies potential contamination (high het) or sample issues (low het)
Complete case analysis for smoking: Preferable to imputation for a major risk factor to avoid bias

Statistical Modeling

Additive genetic model: Standard approach for GWAS; assumes linear increase in log-odds per copy of risk allele
Covariate adjustment: Age, sex, BMI, smoking, and ancestry PCs control for known confounders and population structure
Multiple testing: Both Bonferroni (conservative) and FDR (practical) corrections applied

Limitations

Sample size (N=1,139) provides adequate power for common variants (MAF>5%) but limited power for rare variants
Complete case analysis reduced sample size by 16% due to missing smoking data
Single chromosome analysis; genome-wide significance thresholds are conservative for targeted region
No replication cohort available; findings require independent validation

Conclusions: I identified two robust genetic risk loci for EOPC on chromosome 18, supporting the candidate region from functional screens. The lead variant shows a strong effect size consistent with moderate-penetrance risk alleles. Future directions include replication in independent cohorts, fine-mapping to identify causal variants, functional annotation of implicated genes (PSMA8, EPG5), and investigation of gene-environment interactions.

Acknowledgements

I would like to express my sincere gratitude to Dana Farber Cancer Institute and the Collins Genomics Lab for providing this valuable learning opportunity. Special thanks to Dr. Ryan Collins for giving me the chance to work on this assessment and for helping me learn and refresh my skills in:

Reproducibility - Best practices for documenting and organizing computational analyses
GWAS methodology - Quality control, statistical modeling, and interpretation of genetic association studies
Early oncogenic hits - Understanding genetic risk factors in early-onset cancer

This assessment has been an excellent opportunity to apply and strengthen my computational biology skills while contributing to important research in pancreatic cancer genetics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EOPC GWAS Analysis

About This Project

Assessment Context

Prerequisites

Installation & Usage

Quick Start

Manual Environment Setup

Outputs

Visualizations

Requirements

Key Methodological Decisions

Quality Control Rationale

Statistical Modeling

Limitations

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
notebooks		notebooks
plots		plots
results		results
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

EOPC GWAS Analysis

About This Project

Assessment Context

Prerequisites

Installation & Usage

Quick Start

Manual Environment Setup

Outputs

Visualizations

Requirements

Key Methodological Decisions

Quality Control Rationale

Statistical Modeling

Limitations

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages