Skip to content

Latest commit

 

History

History
24 lines (21 loc) · 5.11 KB

File metadata and controls

24 lines (21 loc) · 5.11 KB

ML-PredictDB


In this study we sought to optimize protein coding gene expression imputation performance across global populations in comparison to the popular transcriptome tool PrediXcan. Specifically, we used two non-linear machine learning (ML) models; random forest (RF) and K nearest neigbor (KNN), and a combination of both linear and non-linear ML model; support vector regression (SVR), to build transcriptome imputation models, and evaluated their performance in comparison to a linear ML model; elastic net (EN) - the algorithm used in PrediXcan. We trained gene expression prediction models using genotype and blood monocyte transcriptome data from the Multi-Ethnic Study of Atherosclerosis (MESA) comprising individuals of African (AFA), Hispanic (HIS), and European (CAU) ancestries and tested them using genotype and whole blood transcriptome data from the Modeling the Epidemiology Transition Study (METS) comprising individuals of African ancestries. We show that the prediction performance is highest when the training and the testing population share similar ancestries regardless of the prediction algorithm used. While EN generally outperformed RF, SVR, and KNN, we found that RF outperforms EN for some genes, particularly between disparate ancestries, suggesting potential robustness and reduced variability of RF imputation performance across global populations. When applied to a high-density lipoprotein (HDL) phenotype, we show including RF prediction models in PrediXcan reveals potential gene associations missed by EN models. Therefore, by integrating other machine learning modeling into PrediXcan and diversifying our training populations to include more global ancestries, we may uncover new genes associated with complex traits.

Model Training and Testing


The model training and testing were done in two separate parts. Firstly, the elastic net (EN) models were trained and tested as described in Mogil et al., 2018. Secondly and lastly, since the non-linear machine learning (ML) models do not have weights/beta values as we would have in a traditional regularization linear regression models like EN, which would have allowed us to build a DB file containing the weight values of the eQTLs in each gene model and do model testing whenever, we were limited to training and testing at the same time and in the same script. Following that analogy, the ML models other than EN were built as described below:

  1. We conducted the first training/testing using the training population via five-fold cross-validation. Specifically, we used the 00_gridsearch_model.py script to carry out a gridsearch through each algorithm hyperparameters in order to determine the hyperparameter values or combination of values that yields the optimum five-fold cross-validated imputation performance (R2) for each gene. The files containing the optimum hyperparameter values and (R2) for each gene across training population are KNN_optimum_hyperparameter.txt, SVR_optimum_hyperparameter.txt, and RF_optimum_hyperparameter.txt.
  2. We then tested the trained models in testing populations (MESA and METS). Now, because non-linear ML models do not generate weight/beta values that we could have stored as DB files and used to predict expression in a new test set, the non-linear ML trained models in this context practically refers to the optimum hyperparameter files where we have stored the optimal hyperparameter values as referenced in step 1 above (We also included the linear model component of SVR here). Therefore, in order to use these other ML models to predict expression in a new test set, firstly, for each gene, we essentially fit each of the algorithm with the training data as well as the optimal hyperparameter value learned for that gene in step 1 above, and then lastly we use the fitted model to predict expression in a new test set. This second step is actualized with 01_model_testing_in_METS.py, and 02_model_imputation_in_MESA.py scripts.

Model Usage and Caveats


There are two categories of models - a linear ML model, elastic net (EN), and a combination of both linear and non-linear ML models - generated from this study that can be used to predict expression in a new test set.

  • If you are interested in using the models built with EN, then you can access the models at PredictDB and run prediction with the instructions stated here.
  • Now, if you want to use the other ML models with the MESA training data as used in this study, you will use the optimum hyperparameter files KNN_optimum_hyperparameter.txt, SVR_optimum_hyperparameter.txt, and RF_optimum_hyperparameter.txt, and the 01_model_testing_in_METS.py scripts.
  • If you want to use your own training data and testing data, then you can use both 00_gridsearch_model.py and 01_model_testing_in_METS.py scripts.

Notes for the Machine Learning Models other than EN

Softwares

Prerequisite packages and libraries

  • Python 3
  • Sci-Kit Learn
  • Pandas
  • Numpy