Skip to content

mulbagalamaq/GWAS-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EOPC GWAS Analysis

Repository contains code to perform Association Study on Early-Onset Pancreatic Cancer


About This Project

This repository contains a genome-wide association study (GWAS) analyzing germline genetic variants on chromosome 18 for association with early-onset pancreatic cancer (EOPC) risk. This project was completed as a coding assessment for a Computational Biology position at Collins Genomics Lab (CGL).

Assessment Context

This analysis was completed as part of a case study assessment for a Computational Biology position. The task involved analyzing genetic data to identify potential risk factors for early-onset pancreatic cancer (diagnosed at <50 years old), building upon findings from a CRISPR/Cas9-based screen in primary pancreatic ductal cells.

Study Data:

  • 473 EOPC cases (diagnosed <50 years)
  • 891 controls (cancer-free adults)
  • 3,364 variants on chromosome 18

Author: Aymen Maqsood Mulbgal

Prerequisites

  • Python 3.10+
  • Jupyter Notebook

Installation & Usage

Quick Start

# 1. Navigate to project
cd DFCI_assesment/

# 2. Launch notebook
jupyter notebook notebooks/EOPC.ipynb

# 3. Run all cells (Cell > Run All)

Manual Environment Setup

For a clean environment:

# Option A: Conda
conda env create -f environment.yml
conda activate eopc_analysis

# Option B: venv
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

# Run notebook
jupyter notebook notebooks/EOPC.ipynb

Outputs

After running the notebook, check:

  • results/association_results_full.csv - Complete results for all 1,355 tested variants with statistics, effect sizes, and multiple testing corrections
  • results/top_100_variants.csv - Ranked list of top 100 associated variants
  • results/top10_summary_snps.csv - Formatted summary table of top 10 SNPs
  • plots/EOPC_association_plots.png - Combined Manhattan and QQ plots (shown below)
  • plots/EOPC_manhattan_plot.png/pdf - High-resolution Manhattan plot
  • plots/EOPC_qq_plot.png/pdf - High-resolution QQ plot

Visualizations

EOPC Association Plots

Manhattan plot (left) and QQ plot (right) showing association results for chromosome 18 variants. Two variants exceed the Bonferroni significance threshold (red dashed line).


Requirements

Key packages (auto-installed by notebook):

  • pandas, numpy, scipy
  • matplotlib, seaborn
  • pysam (VCF handling)
  • statsmodels
  • tqdm (progress bars)

See requirements.txt for complete list with versions.


Key Methodological Decisions

Quality Control Rationale

  • Call rate thresholds: Standard GWAS QC thresholds (95% variant, 90% sample) balance data quality with retention of informative variants
  • MAF ≥1%: Ensures adequate power for association testing with sample size (N=1,139); lower MAF variants have unstable effect estimates
  • Heterozygosity filtering: ±3 SD threshold identifies potential contamination (high het) or sample issues (low het)
  • Complete case analysis for smoking: Preferable to imputation for a major risk factor to avoid bias

Statistical Modeling

  • Additive genetic model: Standard approach for GWAS; assumes linear increase in log-odds per copy of risk allele
  • Covariate adjustment: Age, sex, BMI, smoking, and ancestry PCs control for known confounders and population structure
  • Multiple testing: Both Bonferroni (conservative) and FDR (practical) corrections applied

Limitations

  • Sample size (N=1,139) provides adequate power for common variants (MAF>5%) but limited power for rare variants
  • Complete case analysis reduced sample size by 16% due to missing smoking data
  • Single chromosome analysis; genome-wide significance thresholds are conservative for targeted region
  • No replication cohort available; findings require independent validation

Conclusions: I identified two robust genetic risk loci for EOPC on chromosome 18, supporting the candidate region from functional screens. The lead variant shows a strong effect size consistent with moderate-penetrance risk alleles. Future directions include replication in independent cohorts, fine-mapping to identify causal variants, functional annotation of implicated genes (PSMA8, EPG5), and investigation of gene-environment interactions.


Acknowledgements

I would like to express my sincere gratitude to Dana Farber Cancer Institute and the Collins Genomics Lab for providing this valuable learning opportunity. Special thanks to Dr. Ryan Collins for giving me the chance to work on this assessment and for helping me learn and refresh my skills in:

  • Reproducibility - Best practices for documenting and organizing computational analyses
  • GWAS methodology - Quality control, statistical modeling, and interpretation of genetic association studies
  • Early oncogenic hits - Understanding genetic risk factors in early-onset cancer

This assessment has been an excellent opportunity to apply and strengthen my computational biology skills while contributing to important research in pancreatic cancer genetics.

About

GWAS Analysis of Early onset Pancreatic Cancer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors