Project Overview

Helicobacter pylori (H. pylori) is a globally prevalent gastric pathogen implicated in a spectrum of clinical outcomes, from chronic gastritis to gastric cancer. While most infected individuals develop gastritis, only a minority progress to severe diseases like gastric cancer, influenced by bacterial, host, and environmental factors. This study presents a supervised machine learning framework to predict whether an H. pylori infection will result in gastric cancer or a non-malignant outcome, using a combination of clinical and genome-derived features.

A curated dataset of 1,363 H. pylori genomes with host metadata and annotated genomic features was used. Feature extraction included gene presence/absence profiles, sequence descriptors from iFeatureOmega and MathFeature, and aggregate variant annotation features. The workflow trains a white-box logistic regression model and black-box eXtreme Gradient Boosting (XGBoost) and Random Forest models, utilizing SMOTE-NC to address class imbalance. SHAP values are used for model interpretability.

Main Workflow

The main script, main.py, orchestrates the following steps:

Load Dataset: Reads the dataset from an Excel file.
Data Cleaning: (Optional, placeholder for future cleaning steps).
Train/Test Split: Stratified splitting to maintain class balance.
Preprocessing: Encodes categorical features and labels.
Feature Selection: Selects important features using SHAP and Bayesian optimization.
Baseline Model: Trains and evaluates a logistic regression model on the reduced feature set.
XGBoost Model: Trains, calibrates, evaluates, and explains an XGBoost model on the reduced feature set.
Random Forest Model: Trains, calibrates, evaluates, and explains a Random Forest model on the reduced feature set.
Model Explanation: Uses SHAP to interpret models.

Scripts Directory

Each script in the scripts/ folder is responsible for a specific part of the workflow:

load_dataset.py: Loads the dataset from an Excel file.
split_and_preprocess.py: Splits the data and preprocesses features/labels.
feature_selection.py: Performs feature selection using SHAP and Bayesian optimization.
train_baseline_lr_model.py: Trains a baseline logistic regression model on selected features.
train_xgb_with_bayesopt.py: Trains an XGBoost model with Bayesian optimization.
calibrate_model.py: Calibrates classifiers for improved probability estimates.
train_rf_with_bayesopt.py: Trains a Random Forest model with Bayesian optimization.
evaluate_model.py: Evaluates and plots results for trained models.
explain_models.py: Generates SHAP plots for trained models.

Installation

To set up the required dependencies, create a new conda environment using the provided HP_ML.yml file:

conda env create -f HP_ML.yml
conda activate HP_ML

Installing iFeatureOmega and MathFeature

These tools are used for feature extraction and must be installed separately:

iFeatureOmega
- Visit the iFeatureOmega GitHub page for full instructions.
- Basic installation:
```
pip install iFeatureOmegaCLI
```

MathFeature

Visit the MathFeature GitHub page for full instructions.

Basic installation:

git clone https://github.com/Bonidia/MathFeature.git MathFeature
cd MathFeature 
conda env create -f mathfeature-terminal.yml -n mathfeature-terminal

This will install all necessary packages and dependencies for running the workflow and notebooks.

Installing Snippy and BLAST+

These tools are used to extract additional genomic features:

Snippy: Used for extracting variant annotation counts from VCF files.
BLAST+: Used for generating virulence gene presence/absence profiles.

Refer to the official documentation for installation and usage instructions:

Usage

To run the workflow, execute:

python main.py

You will be prompted to enter the path to your dataset (e.g., data/dataset.xlsx).

Results

Model outputs, evaluation reports, and SHAP plots are saved in the results/ directory.

Output Directory Structure

The main outputs are saved in the results/ directory, organized as follows:

results/
├── LR/
│   ├── [evaluation plots and metrics for Logistic Regression]
│   └── [SHAP plots for Logistic Regression model]
├── XGB/
│   ├── [evaluation plots and metrics for XGBoost]
│   └── [SHAP plots for calibrated XGBoost model]
└── RF/
    ├── [evaluation plots and metrics for Random Forest]
    └── [SHAP plots for calibrated Random Forest model]

results/ contains classification reports for each model.
results/figures/ contains SHAP summary plots, AUROC and AUPRC curves generated during model evaluation and interpretation.

Jupyter Notebook

A notebook version of the workflow is available in notebooks/HP_ML.ipynb. This notebook provides step-by-step code cells and explanations for each stage of the analysis, making it easy to interactively explore the data, train models, and visualize results.

References

Chen, Z., et al. (2022). iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets. GitHub Repository
Bonidia, R. P., et al. (2021). MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors. GitHub Repository
Seemann, T. (2015). Snippy: fast bacterial variant calling from NGS reads. GitHub Repository
Camacho, C., et al. (2009). BLAST+: architecture and applications. BMC Bioinformatics, 10:421. NCBI BLAST+ Documentation

If you use this repository for your research, please also cite the original tools:

iFeatureOmega: Chen Z., et al., Nucleic Acids Research, 2022. DOI.
MathFeature: Bonidia R. P., et al., Briefings in Bioinformatics, 2022. DOI.
Snippy: Seemann T., Snippy: fast bacterial variant calling from NGS reads, 2015. GitHub Repository.
BLAST+: Camacho C., et al., BMC Bioinformatics, 2009. DOI.

Contact

For questions or support, contact:

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
data		data
notebooks		notebooks
preprocessing		preprocessing
results		results
scripts		scripts
supplementary		supplementary
.gitignore		.gitignore
HP_ML.yml		HP_ML.yml
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Overview

Main Workflow

Scripts Directory

Installation

Installing iFeatureOmega and MathFeature

Installing Snippy and BLAST+

Usage

Results

Output Directory Structure

Jupyter Notebook

References

Contact

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project Overview

Main Workflow

Scripts Directory

Installation

Installing iFeatureOmega and MathFeature

Installing Snippy and BLAST+

Usage

Results

Output Directory Structure

Jupyter Notebook

References

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages