This repository contains a Python wrapper for SQLite3 to take in parameters from the user and automate the queries to PrediXcan database files, producing .csv files ready for further analysis.
- Linux
- Python 3.6.7 with the libraries:
Open a terminal session and enter: git clone https://github.com/aandaleon/SQL-Simplifier.git
- One database file or one folder containing database files
- This program is calibrated for GTEx V7 and MESA .dbs
- List of genes (Ensembl ids), one per row
- List of gene names, one per row
- Flags containing the information you want queried (see Program options)
When the program is run without any parameters or gene lists, it will query every gene in a model and output db, gene, genename, cv_R2_avg, rsid, and weight, the most common metrics utilized by the Wheeler lab. Default output will be in SQL_Simplifier_output.csv
- Query a list of genes in a folder of .db files without any query flags (will output into
SQL_Simplifier_output.csv)python3 master.py --dbs example_data/ --genenames example_data/genenames.txt
| db | gene | genename | cv_R2_avg | rsid | weight |
|---|---|---|---|---|---|
| gtex_v7_Whole_Blood_imputed_europeans_tw_0.5_signif.db | ENSG00000130203.5 | APOE | 0.0135600336620361 | rs2356537 | -0.151237337241466 |
| gtex_v7_Whole_Blood_imputed_europeans_tw_0.5_signif.db | ENSG00000130203.5 | APOE | 0.0135600336620361 | rs11668687 | -0.00241744847729031 |
| gtex_v7_Whole_Blood_imputed_europeans_tw_0.5_signif.db | ENSG00000130203.5 | APOE | 0.0135600336620361 | rs11673170 | -0.00232453795322572 |
- Query all genes in a single .db file and their
genename,cv_R2_avg,n.snps.in.model, andpred.perf.R2, outputting intogene_info.csvpython3 master.py --dbs example_data/AFA_imputed_10_peer_3_pcs_v2.db --genename_col --cv_R2_avg --n.snps.in.model --pred.perf.R2 --out_prefix gene_info
| genename | n.snps.in.model | cv_R2_avg | pred.perf.R2 |
|---|---|---|---|
| FUCA2 | 21 | 0.219505222129763 | 0.239011989288677 |
| ENPP4 | 82 | 0.396811312007829 | 0.411548308924569 |
| ANKIB1 | 26 | 0.0961397809054118 | 0.0890519595368423 |
- Query all genes in a single .db file with cv_R2_avg > 0.1 and their genenames, cv_R2_avg, rsids, weights, outputting into
cv_R2_avg_0.1.csvpython3 master.py --dbs example_data/gtex_v7_Whole_Blood_imputed_europeans_tw_0.5_signif.db --genename_col --cv_R2_avg --rsid --weight --cv_R2_avg_thres 0.1 --out_prefix cv_R2_avg_0.1
| genename | cv_R2_avg | rsid | weight |
|---|---|---|---|
| ISG15 | 0.154111838799616 | rs1058161 | -0.11337283758081 |
| ISG15 | 0.154111838799616 | rs11804831 | -0.0126092887627783 |
| ISG15 | 0.154111838799616 | rs2477782 | -0.0644525079361206 |
-
Input/output files
--db: path to .db file or folder path you want to query--genes: file containing gene (Ensembl IDs) separated by line--genenames: file containing gene names separated by line--out_prefix: output file prefix; will end in .csv
-
Inclusion parameters
--db_col: Output the column of .db file of origin.--gene_col: Output the column of genes (Ensembl IDs).--genename_col: Output the column of gene names.--n.snps.in.model: Output the number of SNPs within the cis window that have non-zero weights, as found by elastic net.--test_R2_avg: Output the average coefficient of determination when predicting values of the hold out fold during nested cross validation.--cv_R2_avg: Output the average coefficient of determination for each of the hold out folds when cross-validation was performed on the entire data set.--rho_avg: Output the average correlation between predicted and observed on the hold out folds when doing nested cross-validation.--rho_zscore: Output the transformation of rho_avg into a z-score using Stouffer's Method.--pred.perf.R2: Output the rho_avg squared.--pred.perf.pval: Output the p-value for rho_zscore.--rsid: Output the rsids in the models of queried genes.--varID: Output the variant IDs in the models of queried genes. These are string labels of the format chromosome_position_allele1_allele2_build. All varIDs are from build 37 of the HUman Reference Genome.--ref_allele: Output the reference alleles of the SNPs in the models of the queried genes.--eff_allele: Output the effect alleles of the SNPs in the models of the queried genes.--weight: Output the effect alleles of the SNPs in the models of the queried genes.--n_samples: Output the number of samples used the make the .db file.--population: Output the population studied.--tissue: Output the tissue or MESA population from which RNA was sequenced.
-
Filtering parameters
--test_R2_avg_thres(default = 0): Restrict the test_R2_avg to values above this threshold.--cv_R2_avg(default = 0): Restrict the cv_R2_avg to values above this threshold.--rho_avg_thres(default = 0): Restrict the rho_avg to values above this threshold.--pred.perf.R2_thres(default = 0): Restrict the test_R2_avg to values above this threshold.--pred.perf.pval_thres(default = 1): Restrict the pred_perf_pval to values below this threshold.
Our project queries information from database files used by the program PrediXcan. PrediXcan predicts gene expression by aggregate precalculated weights based on an individual's genotype that are stored in database files. These weights are calculated in various tissues and cohorts, such as the Genotype-Tissue Expression Project (GTEx) and the Multi-Ethnic Study of Atherosclerosis, and all public database files are available at predictdb.org. A general layout of database files and descriptions for all information stored is available here. We give the users the ability to query information from these models without prior knowledge of SQL and in a simple command line format. For more detail on the context, goals, and milestones of the project, please consult the design document.
This program and documentation were created by BS and MS Bioinformatics students Angela Andaleon, Carlee Bettler, and Sherya Wadhwa for Computational Biology (COMP 383/483) Spring 2019 with Dr. Catherine Putonti at Loyola University Chicago. The original project idea was proposed by Angela Andaleon, Peter Fiorica, Ryan Schubert, and Dr. Heather Wheeler for use by the Wheeler Lab.
- Gamazon ER‡, Wheeler HE‡, Shah KP‡, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC, GTEx Consortium, Nicolae DL, Cox NJ, Im HK. (2015) A gene-based association method for mapping traits using reference transcriptome data. Nature Genetics 47(9):1091-8. ‡Contributed equally.
- GTEx Consortium. (2013) The Genotype-Tissue Expression (GTEx) project. Nature Genetics 45, 580–585.
- Mogil LS, Andaleon A, Badalamenti A, Dickinson SP, Guo X, Rotter JI, Johnson WC, Im HK, Liu Y, Wheeler HE. (2018) Genetic architecture of gene expression traits across diverse populations. PLOS Genetics 14(8):e1007586.