Single Variant Association -- Mixed Models

Description

This workflow implements a single variant association analysis of genotype data with a single trait using mixed models. The primary code is written in R using the GENESIS package for model fitting and association testing. The workflow can either generate a null model from traits and covariates and a genetic related matrix or use a pre-generated null model obtained from GENESIS.

Also included is a workflow for regenerating the final summary results of the single variant association pipeline. This summary workflow takes a CSV file of association results (with the same format of the final summary output) and regenerates plots and output files.

Authors

This workflow is produced and maintained by the Manning Lab. Contributing authors include:

Tim Majarian
Jasen Jackson
Alisa Manning

Dependencies

Workflow execution

WDL
Cromwell

R packages

All code and necessary packages are available through a Docker image as well as through the Github repository.

Main Functions

preprocess_conditional.R

This script preprocesses the phenotype file for a conditional analysis on some snps, extracting the genotype information and adding to the phenotypes for each sample.

Inputs:

genotype_files : comma separated list of gds file paths, this list must contain gds files for every snp (file, GDS)
phenotype_file : phenotype file to be edited (file, CSV/TSV)
id_col : column name in the phenotype file that holds the individual sample ids (string)
sample_file : text file with list of sample ids to include, one per line (file, TXT)
snps : comma separated list with the form chr_number:position,chr_number:position (string)

Outputs :

new_phenotype_file : phenotype.data input with appended dosage columns for snps of interest (.csv)
alt_ref : text file with alternate and reference alleles for each snp of interest (.txt)
log_file : log file containing paths to inputs and memory and cpu usage

genesis_nullmodel.R

This script generates a null model to be used in association testing in Genesis

Inputs:

genotype_files : genotype data for all samples (array of VCF or GDS file)
phenotype_file : phenotype data for all samples to be included in analysis (CSV or TSV file)
outcome_name : the outcome to be tested (string)
outcome_type : the type of outcome being tested (dichotomous or continuous)
covariates_string : covariates to condition on in linear mixed modeling (comma separated string, default = '')
conditional_string : variant to condition on in string format (chr#:position)
ivars_string : comma separated list of interaction variants found in phenotype file (string)
group_var : variable for running with heterogeneous variances (string)
sample_file : a file containing a list of sample ids (matching the genotype and phenotype files) to be included, one per line (.txt)
label : prefix for output filename (string)
kinship_file : relatedness measures for all samples (CSV or TSV file)
id_col : column name of id column in phenotype file (string)

Outputs:

model : generated null model (.RDa)
log_file : log file containing paths to inputs and memory and cpu usage

association.R

This script performs an association test to generate p-values for each variant included.

Inputs:

gds_file : a genotype file containing data for all samples are variants to be tested (file, GDS)
null_file : output of the task fitNull or a pregenerated null model (file, RDa)
label : prefix for output filename (string)
test : statistical test (Score or Wald, default = Score)
ivars_string : comma separated list of interaction variants found in phenotype file (string)
mac : minimum minor allele count for variants to be included in analysis (int, default = 5)
variant_range : A tab separated file of variant ranges to restrict to, one per line (TSV)

Outputs:

assoc : an RData file of associations results (RData)
log_file : log file containing paths to inputs and memory and cpu usage

summary.R

Generate a summary of association results including quantile-quantile and manhattan plots for variants subseted by minor allele frequency (all variants, maf < 5%, maf >= 5%). Also generates CSV files of all variants and variants with P < pval_threshold.

Inputs:

pval_threshold : p-value threshold for the returning top associations, top association output will include only variants with a p-value less than the threshold (float, default = 0.0001)
label : prefix for output filename (string)
assoc : output of assocTest (Array[RData])

Outputs:

allassoccsv : association results for all variants (CSV)
topassoccsv : association results for variants passing pvalue threshold (CSV)
plots : manhatten and QQ plots (PNG)
log_file : log file containing paths to inputs and memory and cpu usage

Other workflow inputs

this_fitNull_memory : amount of memory in GB for fitNull task (int)
this_assocTest_memory : amount of memory in GB for assocTest task (int)
this_summary_memory : amount of memory in GB for summary task (int)
this_disk : amount of disk space in GB to allot for each execution of a task (int)

Workflow Outputs

The main outputs of the single variant association workflow come from summary.R. Two CSV files are produced:

All association results - analysis results for every variant tested
Top association results - analysis results for only those variants passing a pvalue threshold

Each of these files have a single row for each variant included (multi-allelics are split with a single row per alternate allele). Each file has a minimum set of columns with a few extra columns that depend on the type of association test performed. The columns that are always present in both output CSV files are:

1. MarkerName: a unique identifier of the variant with the format: `chromosome-position-reference_allele-alternate_allele`, ex: `chr1-123456-A-C` (string)
2. chr: chromosome of the variant (string)
3. pos: position of the variant (int)
4. ref: reference allele of the variant as encoded in the input GDS file (string)
5. alt: alternate allele of the variant as encoded in the input GDS file (string)
6. minor.allele: allele with lower maf between ref and alt
7. maf: frequency of `minor.allele` in sample tested for association (float)
8. pvalue: pvalue generated in association analysis (float)
9. n: number of samples used in generated association statistics (int)
10. mac: minor allele count

If the trait of interest is dichotomous:

11. homref.case: number of homozygous reference cases (int)
12. homref.control: number of homozygous reference controls (int)
13. het.case: number of heterozygous reference cases (int)
14. het.control: number of heterozygous reference controls (int)
15. homalt.case: number of homozygous alternate cases (int)
16. homalt.control: number of homozygous alternate controls (int)

If the statistical test is the Score test:

16. Score.stat: score statistic of the variant (float)

If the statistical test is the Wald test:

9. Wald.Stat: Wald statistic of the variant (float)

Below is an example output file using 1000 genomes data and simulated phenotypes. Note that case/control genotype counts is only calculated for variants with minor allele frequency < 5% and pvalue < 0.01. (maf, pvalue, and Score.Stat have been truncated for display only)

MarkerName	chr	pos	ref	alt	minor.allele	maf	pvalue	n	Score.Stat	homref.case	homref.control	het.case	het.control	homalt.case	homalt.control
chr22-16052271-G-A	22	16052271	G	A	alt	0.0075878	0.0598677	2504	3.541045
chr22-16052986-C-A	22	16052986	C	A	alt	0.0740814	0.0013632	2504	10.25490
chr22-16911246-C-T	22	16911246	C	T	alt	0.0229632	0.0049379	2504	7.902014	485	1909	11	94	0	5
chr22-17008128-C-T	22	17008128	C	T	alt	0.0021964	0.5074979	2504	0.439222
chr22-17008980-C-T	22	17008980	C	T	alt	0.0658945	0.7537240	2504	0.098428
chr22-17009923-G-C	22	17009923	G	C	alt	0.0119808	0.0024070	2504	9.209963	475	1969	21	39	0	0

summary.R also produces a PNG file with Manhattan and quantile-quantile plots. Three plots are shown in this image (from top-left):

QQ plot of all variants
QQ plot of variants divided into Common variants with MAF >= 5% and Low frequency and rare variants with MAF < 5%. Genomic control is shown at top left of each plot.
Manhattan plot of all variants with nominal (1e-5) and genomewide (5e-8) significance thresholds in blue and red lines respectively.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
association.R		association.R
genesis_nullmodel.R		genesis_nullmodel.R
preprocess_conditional.R		preprocess_conditional.R
singleVariantAssociation.wdl		singleVariantAssociation.wdl
summary.R		summary.R
summaryCSV.R		summaryCSV.R
summaryCSV.wdl		summaryCSV.wdl
test_inputs.json		test_inputs.json
test_inputs_chr1.json		test_inputs_chr1.json
test_inputs_variant_range.json		test_inputs_variant_range.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Single Variant Association -- Mixed Models

Description

Authors

Dependencies

Workflow execution

R packages

Main Functions

preprocess_conditional.R

genesis_nullmodel.R

association.R

summary.R

Other workflow inputs

Workflow Outputs

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Single Variant Association -- Mixed Models

Description

Authors

Dependencies

Workflow execution

R packages

Main Functions

preprocess_conditional.R

genesis_nullmodel.R

association.R

summary.R

Other workflow inputs

Workflow Outputs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages