This repository reorganizes and restructures scripts from Genetically regulated gene expression underlies lipid traits in Hispanic cohorts to be more user-friendly. All data input (PLINK .bed/.bim/.fam, covariates w/ and w/o IIDs, phenotypes w/ and w/o IIDs, and phenotype names) must be exactly in the same format as the example data, including file names, and all missing data must be replaced with "NA". This pipeline is optimized for genotypes built in hg19 and linear phenotypes, and may not work with case/control (logistic). All paths to softwares are defaulted to those on wheelerlab3, and it is expected that all scripts are run from the same directory. For exact details on the inner workings of each script, use the --help flag or see the wiki. We will be using genotypes from 1000 Genomes American superpopulation (AMR.bed/.bim/.fam) and simulating phenotypes and covariances in R. The input population does not need to be admixed, but local ancestry steps (12-15) are for use mainly in admixed (African-American, Hispanic) populations. You should be able to copy and paste all the commands from this README and everything should run... but will you understand it?
For much more detail on the process of everything in here, please see the wiki, I spent a lot of time writing it :).
- Produce phenotypes and covariates (ex. medicines) in R (for test data only)
Rscript 00_simulate_pheno_covar.R --bfile AMR
-
Perform quality control in PLINK using gwasqc_pipeline
- Test data has already been filtered
-
Calculate principal components and a relationship matrix in KING
- If your data runs into a segmentation fault, try using only a subset of SNPs (ex. restrict to chr. 22)
Rscript 02_relate_matrix_PCs.R --bfile AMR
-
Impute data with the Michigan Imputation Server
- Test data is already imputed
- Follow instructions in 03_Michigan_Imputation_Server.pptx
-
Convert imputed data to PrediXcan dosages using this script.
- Since test data is already imputed in PLINK format, run this (for example data only)
mkdir dosages/; cd dosages/; wget https://raw.githubusercontent.com/hakyimlab/PrediXcan/master/Software/convert_plink_to_dosage.py; awk '{ print $1"\t"$2 }' ../AMR.fam > samples.txt; python convert_plink_to_dosage.py -b ../AMR -p /usr/local/bin/plink; cd ..; echo "Completed converting dosages."
- Since test data is already imputed in PLINK format, run this (for example data only)
-
Perform a genome wide association study in GEMMA
- a. Convert from PrediXcan dosage format to BIMBAM format (GEMMA genotype input)
python 05a_PrediXcan_dosages_to_GEMMA.py --dosage_suffix .txt.gz
- b. Make covariance file from known covariates and KING PCs
Rscript 05b_make_GEMMA_covars.R --pcs_num 5
- c. Run all chrs. in a loop
python 05c_GEMMA_wrapper.py --GWAS
- a. Convert from PrediXcan dosage format to BIMBAM format (GEMMA genotype input)
-
Calculate predicted gene expressions in PrediXcan using GTEx and MESA models
python 06_make_pred_exp.py
-
Perform an imputed transcriptome-based association study in GEMMA
- a. Convert predicted expression to GEMMA-style pseudo-genotypes
python 07a_convert_PrediXcan_to_GEMMA.py
- b. Run all pops. and tissues in a loop
python 05c_GEMMA_wrapper.py --pred_exp
- a. Convert predicted expression to GEMMA-style pseudo-genotypes
-
Find significant SNPs from GWAS and significant genes from PrediXcan in R
python 08_sig_SNP_sig_gene.py --SNP_sig 5e-6 --gene_sig 5e-3- NOTE: I set these thresholds arbitrarilty low so we have significant genes to with with later. Usually, use the defaults, 5e-8 and 9.654e-6, respectively.
-
Calculate independent significant SNPs in a joint analysis in GCTA-COJO
- a. Make GWAS output into GCTA-COJO format
python 09a_GEMMA_to_GCTA-COJO.py --fam AMR.fam
- b. Run GCTA (see manual)
gcta64 --cojo-file pheno1.ma --cojo-slct --cojo-p 5e-6 --bfile AMR --cojo-actual-geno --out pheno1; gcta64 --cojo-file pheno2.ma --cojo-slct --cojo-p 5e-6 --bfile AMR --cojo-actual-geno --out pheno2- Again, set P arbitrarily low for example; set to 5e-8 for real analyses
- a. Make GWAS output into GCTA-COJO format
-
Perform colocalization between GWAS results and eQTL data using COLOC in a COLOC wrapper
- a. Put GWAS and eQTL data into COLOC input format (takes a while due to size of eQTL files)
Rscript 10a_make_COLOC_input.R --sample_size 320
- b. Run COLOC wrapper (takes a while due to size of eQTL files)
python 10b_COLOC_wrapper.py --sample_size 320
- a. Put GWAS and eQTL data into COLOC input format (takes a while due to size of eQTL files)
-
Perform backward elimination modeling of all significant genes to find independent signals in R
Rscript 11_back_elim.R --sig_gene output/pheno1_sig_genes.txt --pheno pheno_wIID.txt --pheno_name pheno1
-
Prepare genotype data with PLINK
- a. Merge genotypes with a reference populations (options are African-American (AFA) and Hispanic (HIS)), restrict to only SNPs in the study population
plink --bfile AMR --bmerge /home/angela/Ad_PX_pipe_data/RFMix/RefPop/HIS --make-bed --out AMR_w_ref --extract AMR.bim; plink --bfile AMR --exclude AMR_w_ref-merge.missnp --make-bed --out AMR_for_merge; plink --bfile /home/angela/Ad_PX_pipe_data/RFMix/RefPop/HIS --exclude AMR_w_ref-merge.missnp --make-bed --out HIS_for_merge --extract AMR.bim; plink --bfile AMR_for_merge --bmerge HIS_for_merge --make-bed --out AMR_w_ref --allow-no-sex
- b. Order genotype file by population and add cM positions
cat HIS_for_merge.fam AMR_for_merge.fam > merged.fam; plink --bfile AMR_w_ref --indiv-sort file merged.fam --cm-map /home/angela/Ad_PX_pipe_data/cm_map/genetic_map_chr@_combined_b37.txt --make-bed --out AMR_w_ref_ordered
- c. Split input by chr
mkdir -p merged_w_ref/; for i in {1..22}; do plink --bfile AMR_w_ref_ordered --chr ${i} --make-bed --out merged_w_ref/chr${i}; done
- a. Merge genotypes with a reference populations (options are African-American (AFA) and Hispanic (HIS)), restrict to only SNPs in the study population
-
Phase haplotypes with HAPI-UR. Change value of
-wdepending on parameters described in the manualmkdir -p haplotypes/; for i in {1..22}; do /home/angela/Ad_PX_pipe_data/HAPI-UR/hapi-ur -p merged_w_ref/chr${i} -w 64 -o haplotypes/chr${i}; done
-
Calculate local ancestry inference in RFMix
- a. Make additional files for RFMix input. When making classes, choose a reference population (options are African-American (AFA) and Hispanic (HIS)).
for i in {1..22}; do awk '{print $3}' haplotypes/chr${i}.phsnp > haplotypes/chr${i}.snp_locations; done; Rscript 14a_make_classes.R haplotypes/chr22.phind HIS- Note: the sample data is rather small, so some chromosomes may not output
- b. Run RFMix to estimate local ancestry for all individuals. This process takes a very long time, so we will only run chr. 22 for this example. Push chrs. to multiple cores when running real data.
mkdir -p RFMix/; cd /home/angela/Ad_PX_pipe_data/RFMix/; python RunRFMix.py -e 2 -w 0.2 --num-threads 10 --use-reference-panels-in-EM --forward-backward PopPhased /home/angela/Ad_PX_pipe/haplotypes/chr22.phgeno /home/angela/Ad_PX_pipe/RFMix.classes /home/angela/Ad_PX_pipe/haplotypes/chr22.snp_locations -o /home/angela/Ad_PX_pipe/RFMix/chr22.rfmix
- a. Make additional files for RFMix input. When making classes, choose a reference population (options are African-American (AFA) and Hispanic (HIS)).
-
Perform admixture mapping in GEMMA
- a. Convert RFMix output into an intermediate for GEMMA input downstream
python 15a_RFMix_to_BIMBAM.py --Viterbi RFMix/chr22.rfmix.2.Viterbi.txt --phind haplotypes/chr22.phind --fam AMR.fam --phsnp haplotypes/chr22.phsnp --chr 22
- b. Run phenotype associations with each local ancestry in GEMMA
python 15b_loc_anc_wrapper.py --snptable loc_anc_input/chr22.csv --refpop HIS --chr 22
- a. Convert RFMix output into an intermediate for GEMMA input downstream