Skip to content

aandaleon/SQL-Simplifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

182 Commits
 
 
 
 
 
 

Repository files navigation

SQL Simplifier

Overview

This repository contains a Python wrapper for SQLite3 to take in parameters from the user and automate the queries to PrediXcan database files, producing .csv files ready for further analysis.

Software Requirements

Downloading the project

Open a terminal session and enter: git clone https://github.com/aandaleon/SQL-Simplifier.git

Input files

Required

  • One database file or one folder containing database files
    • This program is calibrated for GTEx V7 and MESA .dbs

Optional

  • List of genes (Ensembl ids), one per row
  • List of gene names, one per row
  • Flags containing the information you want queried (see Program options)

When the program is run without any parameters or gene lists, it will query every gene in a model and output db, gene, genename, cv_R2_avg, rsid, and weight, the most common metrics utilized by the Wheeler lab. Default output will be in SQL_Simplifier_output.csv

Example

  • Query a list of genes in a folder of .db files without any query flags (will output into SQL_Simplifier_output.csv)
    • python3 master.py --dbs example_data/ --genenames example_data/genenames.txt
db gene genename cv_R2_avg rsid weight
gtex_v7_Whole_Blood_imputed_europeans_tw_0.5_signif.db ENSG00000130203.5 APOE 0.0135600336620361 rs2356537 -0.151237337241466
gtex_v7_Whole_Blood_imputed_europeans_tw_0.5_signif.db ENSG00000130203.5 APOE 0.0135600336620361 rs11668687 -0.00241744847729031
gtex_v7_Whole_Blood_imputed_europeans_tw_0.5_signif.db ENSG00000130203.5 APOE 0.0135600336620361 rs11673170 -0.00232453795322572
  • Query all genes in a single .db file and their genename, cv_R2_avg, n.snps.in.model, and pred.perf.R2, outputting into gene_info.csv
    • python3 master.py --dbs example_data/AFA_imputed_10_peer_3_pcs_v2.db --genename_col --cv_R2_avg --n.snps.in.model --pred.perf.R2 --out_prefix gene_info
genename n.snps.in.model cv_R2_avg pred.perf.R2
FUCA2 21 0.219505222129763 0.239011989288677
ENPP4 82 0.396811312007829 0.411548308924569
ANKIB1 26 0.0961397809054118 0.0890519595368423
  • Query all genes in a single .db file with cv_R2_avg > 0.1 and their genenames, cv_R2_avg, rsids, weights, outputting into cv_R2_avg_0.1.csv
    • python3 master.py --dbs example_data/gtex_v7_Whole_Blood_imputed_europeans_tw_0.5_signif.db --genename_col --cv_R2_avg --rsid --weight --cv_R2_avg_thres 0.1 --out_prefix cv_R2_avg_0.1
genename cv_R2_avg rsid weight
ISG15 0.154111838799616 rs1058161 -0.11337283758081
ISG15 0.154111838799616 rs11804831 -0.0126092887627783
ISG15 0.154111838799616 rs2477782 -0.0644525079361206

Program options

  • Input/output files

    • --db: path to .db file or folder path you want to query
    • --genes: file containing gene (Ensembl IDs) separated by line
    • --genenames: file containing gene names separated by line
    • --out_prefix: output file prefix; will end in .csv
  • Inclusion parameters

    • --db_col: Output the column of .db file of origin.
    • --gene_col: Output the column of genes (Ensembl IDs).
    • --genename_col: Output the column of gene names.
    • --n.snps.in.model: Output the number of SNPs within the cis window that have non-zero weights, as found by elastic net.
    • --test_R2_avg: Output the average coefficient of determination when predicting values of the hold out fold during nested cross validation.
    • --cv_R2_avg: Output the average coefficient of determination for each of the hold out folds when cross-validation was performed on the entire data set.
    • --rho_avg: Output the average correlation between predicted and observed on the hold out folds when doing nested cross-validation.
    • --rho_zscore: Output the transformation of rho_avg into a z-score using Stouffer's Method.
    • --pred.perf.R2: Output the rho_avg squared.
    • --pred.perf.pval: Output the p-value for rho_zscore.
    • --rsid: Output the rsids in the models of queried genes.
    • --varID: Output the variant IDs in the models of queried genes. These are string labels of the format chromosome_position_allele1_allele2_build. All varIDs are from build 37 of the HUman Reference Genome.
    • --ref_allele: Output the reference alleles of the SNPs in the models of the queried genes.
    • --eff_allele: Output the effect alleles of the SNPs in the models of the queried genes.
    • --weight: Output the effect alleles of the SNPs in the models of the queried genes.
    • --n_samples: Output the number of samples used the make the .db file.
    • --population: Output the population studied.
    • --tissue: Output the tissue or MESA population from which RNA was sequenced.
  • Filtering parameters

    • --test_R2_avg_thres (default = 0): Restrict the test_R2_avg to values above this threshold.
    • --cv_R2_avg (default = 0): Restrict the cv_R2_avg to values above this threshold.
    • --rho_avg_thres (default = 0): Restrict the rho_avg to values above this threshold.
    • --pred.perf.R2_thres (default = 0): Restrict the test_R2_avg to values above this threshold.
    • --pred.perf.pval_thres (default = 1): Restrict the pred_perf_pval to values below this threshold.

Project summaries

Quick background and resources

Our project queries information from database files used by the program PrediXcan. PrediXcan predicts gene expression by aggregate precalculated weights based on an individual's genotype that are stored in database files. These weights are calculated in various tissues and cohorts, such as the Genotype-Tissue Expression Project (GTEx) and the Multi-Ethnic Study of Atherosclerosis, and all public database files are available at predictdb.org. A general layout of database files and descriptions for all information stored is available here. We give the users the ability to query information from these models without prior knowledge of SQL and in a simple command line format. For more detail on the context, goals, and milestones of the project, please consult the design document.

Authors

This program and documentation were created by BS and MS Bioinformatics students Angela Andaleon, Carlee Bettler, and Sherya Wadhwa for Computational Biology (COMP 383/483) Spring 2019 with Dr. Catherine Putonti at Loyola University Chicago. The original project idea was proposed by Angela Andaleon, Peter Fiorica, Ryan Schubert, and Dr. Heather Wheeler for use by the Wheeler Lab.

References

About

Developing a Python wrapper for SQLite3 to take in parameters from the user and automate the queries to the database, producing .csv files ready for parsing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages