Skip to content

aandaleon/Ad_PX_pipe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

205 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ad_PX_pipe

This repository reorganizes and restructures scripts from Genetically regulated gene expression underlies lipid traits in Hispanic cohorts to be more user-friendly. All data input (PLINK .bed/.bim/.fam, covariates w/ and w/o IIDs, phenotypes w/ and w/o IIDs, and phenotype names) must be exactly in the same format as the example data, including file names, and all missing data must be replaced with "NA". This pipeline is optimized for genotypes built in hg19 and linear phenotypes, and may not work with case/control (logistic). All paths to softwares are defaulted to those on wheelerlab3, and it is expected that all scripts are run from the same directory. For exact details on the inner workings of each script, use the --help flag or see the wiki. We will be using genotypes from 1000 Genomes American superpopulation (AMR.bed/.bim/.fam) and simulating phenotypes and covariances in R. The input population does not need to be admixed, but local ancestry steps (12-15) are for use mainly in admixed (African-American, Hispanic) populations. You should be able to copy and paste all the commands from this README and everything should run... but will you understand it?

For much more detail on the process of everything in here, please see the wiki, I spent a lot of time writing it :).

  1. Produce phenotypes and covariates (ex. medicines) in R (for test data only)
    • Rscript 00_simulate_pheno_covar.R --bfile AMR

For use in all populations:

  1. Perform quality control in PLINK using gwasqc_pipeline

    • Test data has already been filtered
  2. Calculate principal components and a relationship matrix in KING

    • If your data runs into a segmentation fault, try using only a subset of SNPs (ex. restrict to chr. 22)
    • Rscript 02_relate_matrix_PCs.R --bfile AMR
  3. Impute data with the Michigan Imputation Server

    • Test data is already imputed
    • Follow instructions in 03_Michigan_Imputation_Server.pptx
  4. Convert imputed data to PrediXcan dosages using this script.

    • Since test data is already imputed in PLINK format, run this (for example data only)
      • mkdir dosages/; cd dosages/; wget https://raw.githubusercontent.com/hakyimlab/PrediXcan/master/Software/convert_plink_to_dosage.py; awk '{ print $1"\t"$2 }' ../AMR.fam > samples.txt; python convert_plink_to_dosage.py -b ../AMR -p /usr/local/bin/plink; cd ..; echo "Completed converting dosages."
  5. Perform a genome wide association study in GEMMA

    • a. Convert from PrediXcan dosage format to BIMBAM format (GEMMA genotype input)
      • python 05a_PrediXcan_dosages_to_GEMMA.py --dosage_suffix .txt.gz
    • b. Make covariance file from known covariates and KING PCs
      • Rscript 05b_make_GEMMA_covars.R --pcs_num 5
    • c. Run all chrs. in a loop
      • python 05c_GEMMA_wrapper.py --GWAS
  6. Calculate predicted gene expressions in PrediXcan using GTEx and MESA models

    • python 06_make_pred_exp.py
  7. Perform an imputed transcriptome-based association study in GEMMA

    • a. Convert predicted expression to GEMMA-style pseudo-genotypes
      • python 07a_convert_PrediXcan_to_GEMMA.py
    • b. Run all pops. and tissues in a loop
      • python 05c_GEMMA_wrapper.py --pred_exp
  8. Find significant SNPs from GWAS and significant genes from PrediXcan in R

    • python 08_sig_SNP_sig_gene.py --SNP_sig 5e-6 --gene_sig 5e-3
      • NOTE: I set these thresholds arbitrarilty low so we have significant genes to with with later. Usually, use the defaults, 5e-8 and 9.654e-6, respectively.
  9. Calculate independent significant SNPs in a joint analysis in GCTA-COJO

    • a. Make GWAS output into GCTA-COJO format
      • python 09a_GEMMA_to_GCTA-COJO.py --fam AMR.fam
    • b. Run GCTA (see manual)
      • gcta64 --cojo-file pheno1.ma --cojo-slct --cojo-p 5e-6 --bfile AMR --cojo-actual-geno --out pheno1; gcta64 --cojo-file pheno2.ma --cojo-slct --cojo-p 5e-6 --bfile AMR --cojo-actual-geno --out pheno2
        • Again, set P arbitrarily low for example; set to 5e-8 for real analyses
  10. Perform colocalization between GWAS results and eQTL data using COLOC in a COLOC wrapper

    • a. Put GWAS and eQTL data into COLOC input format (takes a while due to size of eQTL files)
      • Rscript 10a_make_COLOC_input.R --sample_size 320
    • b. Run COLOC wrapper (takes a while due to size of eQTL files)
      • python 10b_COLOC_wrapper.py --sample_size 320
  11. Perform backward elimination modeling of all significant genes to find independent signals in R

    • Rscript 11_back_elim.R --sig_gene output/pheno1_sig_genes.txt --pheno pheno_wIID.txt --pheno_name pheno1

For use in admixed populations:

  1. Prepare genotype data with PLINK

    • a. Merge genotypes with a reference populations (options are African-American (AFA) and Hispanic (HIS)), restrict to only SNPs in the study population
      • plink --bfile AMR --bmerge /home/angela/Ad_PX_pipe_data/RFMix/RefPop/HIS --make-bed --out AMR_w_ref --extract AMR.bim; plink --bfile AMR --exclude AMR_w_ref-merge.missnp --make-bed --out AMR_for_merge; plink --bfile /home/angela/Ad_PX_pipe_data/RFMix/RefPop/HIS --exclude AMR_w_ref-merge.missnp --make-bed --out HIS_for_merge --extract AMR.bim; plink --bfile AMR_for_merge --bmerge HIS_for_merge --make-bed --out AMR_w_ref --allow-no-sex
    • b. Order genotype file by population and add cM positions
      • cat HIS_for_merge.fam AMR_for_merge.fam > merged.fam; plink --bfile AMR_w_ref --indiv-sort file merged.fam --cm-map /home/angela/Ad_PX_pipe_data/cm_map/genetic_map_chr@_combined_b37.txt --make-bed --out AMR_w_ref_ordered
    • c. Split input by chr
      • mkdir -p merged_w_ref/; for i in {1..22}; do plink --bfile AMR_w_ref_ordered --chr ${i} --make-bed --out merged_w_ref/chr${i}; done
  2. Phase haplotypes with HAPI-UR. Change value of -w depending on parameters described in the manual

    • mkdir -p haplotypes/; for i in {1..22}; do /home/angela/Ad_PX_pipe_data/HAPI-UR/hapi-ur -p merged_w_ref/chr${i} -w 64 -o haplotypes/chr${i}; done
  3. Calculate local ancestry inference in RFMix

    • a. Make additional files for RFMix input. When making classes, choose a reference population (options are African-American (AFA) and Hispanic (HIS)).
      • for i in {1..22}; do awk '{print $3}' haplotypes/chr${i}.phsnp > haplotypes/chr${i}.snp_locations; done; Rscript 14a_make_classes.R haplotypes/chr22.phind HIS
      • Note: the sample data is rather small, so some chromosomes may not output
    • b. Run RFMix to estimate local ancestry for all individuals. This process takes a very long time, so we will only run chr. 22 for this example. Push chrs. to multiple cores when running real data.
      • mkdir -p RFMix/; cd /home/angela/Ad_PX_pipe_data/RFMix/; python RunRFMix.py -e 2 -w 0.2 --num-threads 10 --use-reference-panels-in-EM --forward-backward PopPhased /home/angela/Ad_PX_pipe/haplotypes/chr22.phgeno /home/angela/Ad_PX_pipe/RFMix.classes /home/angela/Ad_PX_pipe/haplotypes/chr22.snp_locations -o /home/angela/Ad_PX_pipe/RFMix/chr22.rfmix
  4. Perform admixture mapping in GEMMA

    • a. Convert RFMix output into an intermediate for GEMMA input downstream
      • python 15a_RFMix_to_BIMBAM.py --Viterbi RFMix/chr22.rfmix.2.Viterbi.txt --phind haplotypes/chr22.phind --fam AMR.fam --phsnp haplotypes/chr22.phsnp --chr 22
    • b. Run phenotype associations with each local ancestry in GEMMA
      • python 15b_loc_anc_wrapper.py --snptable loc_anc_input/chr22.csv --refpop HIS --chr 22

About

Pipeline for performing GWAS and PrediXcan in admixed populations, finding independently associated and eQTL colocalized signals, and mapping admixture using local ancestry estimations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors