This repository provides an end‑to‑end R workflow that converts raw GWAS summary statistics into gene‑level association tables and optional eQTL annotations. The pipeline is organised in three modules that can be run independently or through a single orchestration script:
-
Summary‑statistics preprocessing: filter by p‑value & minor‑allele‑frequency (MAF) and convert to BED.
-
SNP‑to‑gene assignment: annotate each sentinel SNP with its in‑gene/nearest genes.
- Gene‑level analysis & annotation
- MAGMA – aggregate SNP p‑values into SNP to Gene assignment.
- eQTL look‑up (optional) – intersect sentinel SNPs with user‑supplied eQTL catalogues.
- Reads any tab‑delimited summary file; column names are user‑configurable.
- Supports gzip‑compressed input.
- Exports a BED‑formatted version for fast interval joins.
- Overlap – report gene(s) whose bodies overlap each SNP.
- Proximity – optionally include nearest upstream & downstream genes.
- Strand‑aware and distance‑aware mapping.
- Thin wrapper around MAGMA ≥ v1.10.
- Accepts standard bed/bim/fam reference panels and gene‑annotation files.
- Returns a tidy data.table for direct plotting or enrichment analysis.
- Scans all TSV/CSV/TSV.GZ files inside a directory.
- Extracts rows whose rsid matches the sentinel SNP list.
- Strips Ensembl version suffixes (.1, .2, …) and collapses multiple hits per gene.
- For each included method, all method data are summarized to an overall summary file output.
| File | Purpose |
|---|---|
| main.R | Orchestrates the whole workflow; see parameters below. |
| file_helpers.R | I/O helpers, mainly load_input_gscan_data() & load_assembly_file(). |
| snp_to_gene.R | Interval & nearest‑gene lookup functions. |
| magma.R | Wrapper around MAGMA binary. |
| eqtl.R | Directory‑based eQTL mapper. |
- R (≥ 4.0)
- Required R packages:
bedtoolsrdata.tabledplyrstringr
Install them via:
install.packages(c("data.table", "dplyr", "stringr", "remotes"))
remotes::install_github("phillipburgess/bedtoolsr")Both MAGMA and bedtools need to be installed separately.
| Argument | Default | Description |
|---|---|---|
| run_snp_to_nearest_gene | TRUE | Toggle SNP‑gene mapping step. |
| run_magma | TRUE | Toggle MAGMA step. |
| run_eQTL_mapping | TRUE | Toggle eQTL look‑up step. |
| input_gscan_filepath | required | GWAS summary statistics file. |
| gene_loc_filepath | required | BED‑like file with gene coordinates (GRCh38). |
| gwas_pvalue_threshold | 5.00E-08 | Genome‑wide significance cut‑off. |
| minor_allele_frequency | 0.01 | MAF threshold. |
| outdir | "./output" | Folder to write all results. |
| snps_to_nearest_gene_mode | "multiple" | "single" or "multiple". |
| magma_binary_path | "./magma" | Path to MAGMA executable. |
| bed_bim_fam_filepath | NULL | Prefix to reference panel (without extension). |
| gene_annotation_filepath | NULL | MAGMA‑formatted gene annotation file. |
| count_column_N | "N" | Sample‑size column in GWAS file. |
| snp_column_name | "RSID" | Column with SNP IDs. |
| p_value_column | "P" | Column with p‑values. |
| eQTL_eGene_directory | NULL | Directory containing eQTL files. |
To run the program, users need to supply GWAS summary statics files, and a bed-like file with gene coordinates.
1 Command‑line invocation
Rscript inst/scripts/snp_to_gene.R \
--input_gscan_filepath \
--assembly_filepath \
--gwas_pvalue_threshold \
--minor_allele_frequency \
--outdir \
--snps_to_nearest_gene_mode \
--magma_binary_path \
--bed_bim_fam_filepath \
--gene_annotation_filepath \
--count_column_N \
--snp_column_name \
--p_value_column \
--eQTL_eGene_directory \
--eqtl_gene_column_name \
--eqtl_rsid_column_name \
--run_snp_to_nearest_gene \
--run_magma \
--run_eQTL_mapping 2 Programmatic call inside R
main(
input_gscan_filepath = "./data/GSCAN_SmkCes_2022_GWAS_SUMMARY_STATS_EUR.txt_lt5e-8.txt.gz",
gene_loc_filepath = "./data/snp_to_nearest_gene/ensembl.grch38.gene.loc",
gwas_pvalue_threshold = 5e-8,
minor_allele_frequency = 0.01,
outdir = "./output",
# Function Module Callers
run_snp_to_nearest_gene = TRUE,
run_magma = TRUE,
run_eQTL_mapping = TRUE,
# SNP to Nearest Gene
snps_to_nearest_gene_mode = 'multiple',
# MAGMA File Paths
magma_binary_path = "./MAGMA/magma_v1_10/magm",
bed_bim_fam_filepath = "./bed_bim_bam/grch38_rsid_dbsnp157_no_chr",
gene_annotation_filepath = "./Ensembl_GRCh38_dbSNP151_Annotation.genes.annot",
count_column_N = "N",
snp_column_name = "RSID",
p_value_column = "P",
# eQTL Function Requirements:
eQTL_eGene_directory = "./snp_to_gene_assignment/eqtl/data",
eqtl_gene_column_name = "gene_id",
eqtl_rsid_column_name = "rs_id",
)
``
Tip: set any run_* flag to FALSE to skip that module
Citation:
Please reach out to Matt Lane or Kyle Sullivan for citation.
