Skip to content

Venkatesh-99/HP_ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Overview

DOI

Helicobacter pylori (H. pylori) is a globally prevalent gastric pathogen implicated in a spectrum of clinical outcomes, from chronic gastritis to gastric cancer. While most infected individuals develop gastritis, only a minority progress to severe diseases like gastric cancer, influenced by bacterial, host, and environmental factors. This study presents a supervised machine learning framework to predict whether an H. pylori infection will result in gastric cancer or a non-malignant outcome, using a combination of clinical and genome-derived features.

A curated dataset of 1,363 H. pylori genomes with host metadata and annotated genomic features was used. Feature extraction included gene presence/absence profiles, sequence descriptors from iFeatureOmega and MathFeature, and aggregate variant annotation features. The workflow trains a white-box logistic regression model and black-box eXtreme Gradient Boosting (XGBoost) and Random Forest models, utilizing SMOTE-NC to address class imbalance. SHAP values are used for model interpretability.

Main Workflow

The main script, main.py, orchestrates the following steps:

  1. Load Dataset: Reads the dataset from an Excel file.
  2. Data Cleaning: (Optional, placeholder for future cleaning steps).
  3. Train/Test Split: Stratified splitting to maintain class balance.
  4. Preprocessing: Encodes categorical features and labels.
  5. Feature Selection: Selects important features using SHAP and Bayesian optimization.
  6. Baseline Model: Trains and evaluates a logistic regression model on the reduced feature set.
  7. XGBoost Model: Trains, calibrates, evaluates, and explains an XGBoost model on the reduced feature set.
  8. Random Forest Model: Trains, calibrates, evaluates, and explains a Random Forest model on the reduced feature set.
  9. Model Explanation: Uses SHAP to interpret models.

Scripts Directory

Each script in the scripts/ folder is responsible for a specific part of the workflow:

  • load_dataset.py: Loads the dataset from an Excel file.
  • split_and_preprocess.py: Splits the data and preprocesses features/labels.
  • feature_selection.py: Performs feature selection using SHAP and Bayesian optimization.
  • train_baseline_lr_model.py: Trains a baseline logistic regression model on selected features.
  • train_xgb_with_bayesopt.py: Trains an XGBoost model with Bayesian optimization.
  • calibrate_model.py: Calibrates classifiers for improved probability estimates.
  • train_rf_with_bayesopt.py: Trains a Random Forest model with Bayesian optimization.
  • evaluate_model.py: Evaluates and plots results for trained models.
  • explain_models.py: Generates SHAP plots for trained models.

Installation

To set up the required dependencies, create a new conda environment using the provided HP_ML.yml file:

conda env create -f HP_ML.yml
conda activate HP_ML

Installing iFeatureOmega and MathFeature

These tools are used for feature extraction and must be installed separately:

  • iFeatureOmega

  • MathFeature

    • Visit the MathFeature GitHub page for full instructions.
    • Basic installation:
      git clone https://github.com/Bonidia/MathFeature.git MathFeature
      cd MathFeature 
      conda env create -f mathfeature-terminal.yml -n mathfeature-terminal

This will install all necessary packages and dependencies for running the workflow and notebooks.

Installing Snippy and BLAST+

These tools are used to extract additional genomic features:

  • Snippy: Used for extracting variant annotation counts from VCF files.
  • BLAST+: Used for generating virulence gene presence/absence profiles.

Refer to the official documentation for installation and usage instructions:

Usage

To run the workflow, execute:

python main.py

You will be prompted to enter the path to your dataset (e.g., data/dataset.xlsx).

Results

Model outputs, evaluation reports, and SHAP plots are saved in the results/ directory.

Output Directory Structure

The main outputs are saved in the results/ directory, organized as follows:

results/
├── LR/
│   ├── [evaluation plots and metrics for Logistic Regression]
│   └── [SHAP plots for Logistic Regression model]
├── XGB/
│   ├── [evaluation plots and metrics for XGBoost]
│   └── [SHAP plots for calibrated XGBoost model]
└── RF/
    ├── [evaluation plots and metrics for Random Forest]
    └── [SHAP plots for calibrated Random Forest model]
  • results/ contains classification reports for each model.
  • results/figures/ contains SHAP summary plots, AUROC and AUPRC curves generated during model evaluation and interpretation.

Jupyter Notebook

A notebook version of the workflow is available in notebooks/HP_ML.ipynb. This notebook provides step-by-step code cells and explanations for each stage of the analysis, making it easy to interactively explore the data, train models, and visualize results.

References

  • Chen, Z., et al. (2022). iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets. GitHub Repository
  • Bonidia, R. P., et al. (2021). MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors. GitHub Repository
  • Seemann, T. (2015). Snippy: fast bacterial variant calling from NGS reads. GitHub Repository
  • Camacho, C., et al. (2009). BLAST+: architecture and applications. BMC Bioinformatics, 10:421. NCBI BLAST+ Documentation

If you use this repository for your research, please also cite the original tools:

  • iFeatureOmega: Chen Z., et al., Nucleic Acids Research, 2022. DOI.
  • MathFeature: Bonidia R. P., et al., Briefings in Bioinformatics, 2022. DOI.
  • Snippy: Seemann T., Snippy: fast bacterial variant calling from NGS reads, 2015. GitHub Repository.
  • BLAST+: Camacho C., et al., BMC Bioinformatics, 2009. DOI.

Contact

For questions or support, contact:


About

A machine learning pipeline for gastric cancer classification from Helicobacter pylori infected patients

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages