Skip to content

Jacobson-CompSysBio/SNPtoGene

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SNP-to-Gene Mapping Utilities (R)

This repository provides an end‑to‑end R workflow that converts raw GWAS summary statistics into gene‑level association tables and optional eQTL annotations. The pipeline is organised in three modules that can be run independently or through a single orchestration script:

  1. Summary‑statistics preprocessing: filter by p‑value & minor‑allele‑frequency (MAF) and convert to BED.

  2. SNP‑to‑gene assignment: annotate each sentinel SNP with its in‑gene/nearest genes.

  1. Gene‑level analysis & annotation
    • MAGMA – aggregate SNP p‑values into SNP to Gene assignment.
    • eQTL look‑up (optional) – intersect sentinel SNPs with user‑supplied eQTL catalogues.

Features

GWAS preprocessing

  • Reads any tab‑delimited summary file; column names are user‑configurable.
  • Supports gzip‑compressed input.
  • Exports a BED‑formatted version for fast interval joins.

SNP to Gene Assignment:

SNP‑to‑gene mapping (snp_to_gene.R)

  • Overlap – report gene(s) whose bodies overlap each SNP.
  • Proximity – optionally include nearest upstream & downstream genes.
  • Strand‑aware and distance‑aware mapping.

MAGMA wrapper (magma.R)

  • Thin wrapper around MAGMA ≥ v1.10.
  • Accepts standard bed/bim/fam reference panels and gene‑annotation files.
  • Returns a tidy data.table for direct plotting or enrichment analysis.

eQTL utilities (eqtl.R)

  • Scans all TSV/CSV/TSV.GZ files inside a directory.
  • Extracts rows whose rsid matches the sentinel SNP list.
  • Strips Ensembl version suffixes (.1, .2, …) and collapses multiple hits per gene.

Summarization:

  • For each included method, all method data are summarized to an overall summary file output.

Repository Structure:

File Purpose
main.R Orchestrates the whole workflow; see parameters below.
file_helpers.R I/O helpers, mainly load_input_gscan_data() & load_assembly_file().
snp_to_gene.R Interval & nearest‑gene lookup functions.
magma.R Wrapper around MAGMA binary.
eqtl.R Directory‑based eQTL mapper.

Installation

Dependencies

  • R (≥ 4.0)
  • Required R packages:
    • bedtoolsr
    • data.table
    • dplyr
    • stringr

Install them via:

install.packages(c("data.table", "dplyr", "stringr", "remotes"))
remotes::install_github("phillipburgess/bedtoolsr")

Both MAGMA and bedtools need to be installed separately.

How to Run

Argument Default Description
run_snp_to_nearest_gene TRUE Toggle SNP‑gene mapping step.
run_magma TRUE Toggle MAGMA step.
run_eQTL_mapping TRUE Toggle eQTL look‑up step.
input_gscan_filepath required GWAS summary statistics file.
gene_loc_filepath required BED‑like file with gene coordinates (GRCh38).
gwas_pvalue_threshold 5.00E-08 Genome‑wide significance cut‑off.
minor_allele_frequency 0.01 MAF threshold.
outdir "./output" Folder to write all results.
snps_to_nearest_gene_mode "multiple" "single" or "multiple".
magma_binary_path "./magma" Path to MAGMA executable.
bed_bim_fam_filepath NULL Prefix to reference panel (without extension).
gene_annotation_filepath NULL MAGMA‑formatted gene annotation file.
count_column_N "N" Sample‑size column in GWAS file.
snp_column_name "RSID" Column with SNP IDs.
p_value_column "P" Column with p‑values.
eQTL_eGene_directory NULL Directory containing eQTL files.

To run the program, users need to supply GWAS summary statics files, and a bed-like file with gene coordinates.


Example Usage

1  Command‑line invocation

Rscript inst/scripts/snp_to_gene.R \
   --input_gscan_filepath   \ 
    --assembly_filepath  \ 
    --gwas_pvalue_threshold  \ 
    --minor_allele_frequency  \ 
    --outdir  \ 
    --snps_to_nearest_gene_mode  \ 
    --magma_binary_path  \ 
    --bed_bim_fam_filepath  \ 
    --gene_annotation_filepath  \ 
    --count_column_N  \ 
    --snp_column_name  \ 
    --p_value_column  \ 
    --eQTL_eGene_directory  \ 
    --eqtl_gene_column_name  \ 
    --eqtl_rsid_column_name  \ 
    --run_snp_to_nearest_gene  \ 
    --run_magma  \ 
    --run_eQTL_mapping  

2  Programmatic call inside R

main(
    input_gscan_filepath = "./data/GSCAN_SmkCes_2022_GWAS_SUMMARY_STATS_EUR.txt_lt5e-8.txt.gz",
    gene_loc_filepath =  "./data/snp_to_nearest_gene/ensembl.grch38.gene.loc",
    gwas_pvalue_threshold = 5e-8,
    minor_allele_frequency = 0.01,
    outdir = "./output",

    # Function Module Callers
    run_snp_to_nearest_gene = TRUE,
    run_magma = TRUE,
    run_eQTL_mapping = TRUE,

    # SNP to Nearest Gene 
    snps_to_nearest_gene_mode = 'multiple',
    
    # MAGMA File Paths
    magma_binary_path = "./MAGMA/magma_v1_10/magm",
    bed_bim_fam_filepath = "./bed_bim_bam/grch38_rsid_dbsnp157_no_chr",
    gene_annotation_filepath = "./Ensembl_GRCh38_dbSNP151_Annotation.genes.annot",
    count_column_N = "N",
    snp_column_name = "RSID",
    p_value_column =  "P",

    # eQTL Function Requirements: 
    eQTL_eGene_directory = "./snp_to_gene_assignment/eqtl/data",
    eqtl_gene_column_name = "gene_id",
    eqtl_rsid_column_name = "rs_id",
)

``
Tip: set any run_* flag to FALSE to skip that module

Citation: 

Please reach out to Matt Lane or Kyle Sullivan for citation. 

About

This repository contains a codebase with multiple options for calculating SNP to Gene Assignments via linear proximity, MAGMA, and eQTL assignment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages