Helicobacter pylori (H. pylori) is a globally prevalent gastric pathogen implicated in a spectrum of clinical outcomes, from chronic gastritis to gastric cancer. While most infected individuals develop gastritis, only a minority progress to severe diseases like gastric cancer, influenced by bacterial, host, and environmental factors. This study presents a supervised machine learning framework to predict whether an H. pylori infection will result in gastric cancer or a non-malignant outcome, using a combination of clinical and genome-derived features.
A curated dataset of 1,363 H. pylori genomes with host metadata and annotated genomic features was used. Feature extraction included gene presence/absence profiles, sequence descriptors from iFeatureOmega and MathFeature, and aggregate variant annotation features. The workflow trains a white-box logistic regression model and black-box eXtreme Gradient Boosting (XGBoost) and Random Forest models, utilizing SMOTE-NC to address class imbalance. SHAP values are used for model interpretability.
The main script, main.py, orchestrates the following steps:
- Load Dataset: Reads the dataset from an Excel file.
- Data Cleaning: (Optional, placeholder for future cleaning steps).
- Train/Test Split: Stratified splitting to maintain class balance.
- Preprocessing: Encodes categorical features and labels.
- Feature Selection: Selects important features using SHAP and Bayesian optimization.
- Baseline Model: Trains and evaluates a logistic regression model on the reduced feature set.
- XGBoost Model: Trains, calibrates, evaluates, and explains an XGBoost model on the reduced feature set.
- Random Forest Model: Trains, calibrates, evaluates, and explains a Random Forest model on the reduced feature set.
- Model Explanation: Uses SHAP to interpret models.
Each script in the scripts/ folder is responsible for a specific part of the workflow:
load_dataset.py: Loads the dataset from an Excel file.split_and_preprocess.py: Splits the data and preprocesses features/labels.feature_selection.py: Performs feature selection using SHAP and Bayesian optimization.train_baseline_lr_model.py: Trains a baseline logistic regression model on selected features.train_xgb_with_bayesopt.py: Trains an XGBoost model with Bayesian optimization.calibrate_model.py: Calibrates classifiers for improved probability estimates.train_rf_with_bayesopt.py: Trains a Random Forest model with Bayesian optimization.evaluate_model.py: Evaluates and plots results for trained models.explain_models.py: Generates SHAP plots for trained models.
To set up the required dependencies, create a new conda environment using the provided HP_ML.yml file:
conda env create -f HP_ML.yml
conda activate HP_MLThese tools are used for feature extraction and must be installed separately:
-
iFeatureOmega
- Visit the iFeatureOmega GitHub page for full instructions.
- Basic installation:
pip install iFeatureOmegaCLI
-
MathFeature
- Visit the MathFeature GitHub page for full instructions.
- Basic installation:
git clone https://github.com/Bonidia/MathFeature.git MathFeature cd MathFeature conda env create -f mathfeature-terminal.yml -n mathfeature-terminal
This will install all necessary packages and dependencies for running the workflow and notebooks.
These tools are used to extract additional genomic features:
- Snippy: Used for extracting variant annotation counts from VCF files.
- BLAST+: Used for generating virulence gene presence/absence profiles.
Refer to the official documentation for installation and usage instructions:
To run the workflow, execute:
python main.pyYou will be prompted to enter the path to your dataset (e.g., data/dataset.xlsx).
Model outputs, evaluation reports, and SHAP plots are saved in the results/ directory.
The main outputs are saved in the results/ directory, organized as follows:
results/
├── LR/
│ ├── [evaluation plots and metrics for Logistic Regression]
│ └── [SHAP plots for Logistic Regression model]
├── XGB/
│ ├── [evaluation plots and metrics for XGBoost]
│ └── [SHAP plots for calibrated XGBoost model]
└── RF/
├── [evaluation plots and metrics for Random Forest]
└── [SHAP plots for calibrated Random Forest model]
results/contains classification reports for each model.results/figures/contains SHAP summary plots, AUROC and AUPRC curves generated during model evaluation and interpretation.
A notebook version of the workflow is available in notebooks/HP_ML.ipynb. This notebook provides step-by-step code cells and explanations for each stage of the analysis, making it easy to interactively explore the data, train models, and visualize results.
- Chen, Z., et al. (2022). iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets. GitHub Repository
- Bonidia, R. P., et al. (2021). MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors. GitHub Repository
- Seemann, T. (2015). Snippy: fast bacterial variant calling from NGS reads. GitHub Repository
- Camacho, C., et al. (2009). BLAST+: architecture and applications. BMC Bioinformatics, 10:421. NCBI BLAST+ Documentation
If you use this repository for your research, please also cite the original tools:
- iFeatureOmega: Chen Z., et al., Nucleic Acids Research, 2022. DOI.
- MathFeature: Bonidia R. P., et al., Briefings in Bioinformatics, 2022. DOI.
- Snippy: Seemann T., Snippy: fast bacterial variant calling from NGS reads, 2015. GitHub Repository.
- BLAST+: Camacho C., et al., BMC Bioinformatics, 2009. DOI.
For questions or support, contact: