Skip to content

EttoreRocchi/MALDI-Kleb-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MALDI-Kleb-AI

DOI

Rocchi, E., Nicitra, E., Calvo, M. et al. Combining mass spectrometry and machine learning models for predicting Klebsiella pneumoniae antimicrobial resistance: a multicenter experience from clinical isolates in Italy. BMC Microbiol (2026). https://doi.org/10.1186/s12866-025-04657-2

Overview

This repository contains a comprehensive machine learning pipeline for predicting antimicrobial resistance (AMR) phenotypes from MALDI-TOF mass spectrometry data across multiple collection sites. Data used within this pipeline are MALDI-TOF spectra of Klebsiella pneumoniae and were collected in 3 Italian clinical centres. The framework implements cross-dataset evaluation strategies and batch effect correction methods to assess model generalizability and site-specific performance.

The MALDI-TOF spectra are processed using MaldiAMRKit. The batch-effect correction is performed with combatlearn to avoid data leakage in the machine learning pipeline.

Structure

.
├── src/
│   ├── prepare_maldiset.py               # Data preparation and pseudogel visualization
│   ├── cross_datasets_framework.py       # Cross-dataset generalization analysis
│   ├── aggregated_datasets_framework.py  # ComBat-corrected cross-validation
│   └── batch_visualization.py            # Batch effect visualization (PCA/UMAP)
├── results/                              # Analysis outputs (generated)
├── requirements.txt
└── README.md

Installation

Clone the repository:

git clone https://github.com/EttoreRocchi/MALDI-Kleb-AI.git
cd MALDI-Kleb-AI/

and install required packages using pip:

pip install -r requirements

Usage

Data preparation

Automated processing of MALDI-TOF spectra with metadata integration using MaldiAMRKit.

Input requirements:

  • spectra_dir: Directory containing individual spectrum files (.txt, .csv)
  • metadata.csv: CSV file with columns:
    • Sample identifiers (matching spectrum filenames)
    • Antibiotic susceptibility results (S/I/R format)
    • City column indicating collection site

Command:

python src/prepare_maldiset.py \
    --spectra_dir </path/to/data/> \
    --metadata </path/to/metadata> \
    --antibiotics Meropenem Amikacin \
    --other City \
    --output_dir ./data/dfs/

Outputs:

  • data_bin_3.csv: Feature matrix (samples × mass-to-charge bins) with target labels
  • metadata_bin_3.csv: Associated metadata including batch information
  • pseudogel_Meropenem.png: Visualization of spectral intensities by resistance to Meropenem
  • pseudogel_Amikacin.png: Visualization of spectral intensities by resistance to Amikacin

Cross-dataset generalization analysis

Evaluate model generalization by training on individual sites and testing on all sites.

Input requirements:

  • data_bin_3.csv: Feature matrix from preparation step
  • metadata_bin_3.csv: Batch metadata with City column

Command:

python src/cross_datasets_framework.py \
    --data ./data/dfs/data_bin_3.csv \
    --batches ./data/dfs/metadata_bin_3.csv \
    --targets Meropenem Amikacin \
    --out_dir ./results/cross_datasets

Outputs (per target antibiotic):

results/cross_datasets/
└── Meropenem/
    ├── logistic_regression_f1.png          # Heatmap: F1-weighted scores
    ├── logistic_regression_f1.csv          # Numeric matrix
    ├── logistic_regression_auroc.png       # Heatmap: AUROC scores
    ├── logistic_regression_auroc.csv
    ├── logistic_regression_balacc.png      # Heatmap: Balanced accuracy
    ├── logistic_regression_balacc.csv
    ├── logistic_regression_mcc.png         # Heatmap: Matthews correlation coefficient
    ├── logistic_regression_mcc.csv
    ├── [similar files for random_forest, xgboost, mlp]
    └── ...

Each heatmap shows train city (rows) × test city (columns) performance.

Batch-corrected cross-validation

Perform stratified cross-validation with ComBat batch effect correction using combatlearn.

Input requirements:

  • Same as cross-dataset analysis
  • Models are saved and can be reloaded unless --force-retrain is specified

Command:

python src/aggregated_datasets_framework.py \
    --data ./data/dfs/data_bin_3.csv \
    --batches ./data/dfs/metadata_bin_3.csv \
    --targets Meropenem Amikacin \
    --out ./results/aggregated_datasets \
    [--force-retrain]

Outputs (per target):

results/aggregated_datasets/
├── metrics_Meropenem.csv                                         # Aggregate metrics (all models)
├── metrics_detailed_logistic_regression_Meropenem.csv            # Per-fold metrics
├── confmat_Logistic_Regression_Meropenem.png                     # Confusion matrix
├── confmat_Logistic_Regression_Meropenem.pdf
├── shap_beeswarm_Logistic_Regression_Meropenem.png               # Feature importance
├── shap_beeswarm_Logistic_Regression_Meropenem.pdf
├── perf_by_center_logistic_regression_Meropenem.csv              # Center-specific metrics
├── perf_boxplot_Logistic_Regression_Meropenem_Balanced_Acc.png
├── perf_boxplot_Logistic_Regression_Meropenem_MCC.png
├── perf_boxplot_Logistic_Regression_Meropenem_AUROC.png
├── saved_models/
│   └── Meropenem/
│       ├── Logistic_Regression_fold0.pkl
│       ├── Logistic_Regression_fold1.pkl
│       └── ...
└── [similar files for other models and targets]

Batch effect visualization

Visualize and quantify batch effects before and after ComBat correction using dimensionality reduction and statistical metrics.

Command:

python src/batch_analyses.py \
    --data ./data/dfs/data_bin_3.csv \
    --batches ./data/dfs/metadata_bin_3.csv \
    --out ./results/batches \
    [--per-batch]

Options:

  • --per-batch: Enable per-batch diagnostics (computes metrics for each batch individually)

Outputs:

results/batches/
├── pca_comparison.png                    # Side-by-side PCA (before/after)
├── pca_comparison.pdf
├── umap_comparison.png                   # Side-by-side UMAP (before/after)
├── umap_comparison.pdf
└── batch_correction_report.md            # Quantitative assessment

Metrics computed:

  • Batch Effect Removal: Centroid distances, variance alignment (Levene's test), batch separability (Silhouette score)
  • Structure Preservation: k-NN neighborhood preservation, distance correlation
  • Distribution Preservation: Skewness changes, Kolmogorov-Smirnov statistics
  • Per-Batch Diagnostics (optional): Alignment, structure preservation, and distribution similarity within each batch

Citation

The dataset is publicly available on Zenodo at: MALDI-Kleb-AI.

If you use the dataset and/or the pipeline, please consider citing:

@article{Rocchi2026,
  author    = {Rocchi, Ettore and Nicitra, Emanuele and Calvo, Maddalena and Cento, Valeria and Peiretti, Laura and Asif, Zian and Menchinelli, Giulia and Posteraro, Brunella and Sala, Claudia and Colosimo, Claudia and Cricca, Monica and Sambri, Vittorio and Sanguinetti, Maurizio and Castellani, Gastone and Stefani, Stefania},
  title     = {Combining mass spectrometry and machine learning models for predicting Klebsiella pneumoniae antimicrobial resistance: a multicenter experience from clinical isolates in Italy},
  journal   = {BMC Microbiology},
  year      = {2026},
  doi       = {10.1186/s12866-025-04657-2},
  url       = {https://doi.org/10.1186/s12866-025-04657-2}
}

About

Combining Mass Spectrometry and Machine Learning Models for Predicting Klebsiella pneumoniae Antimicrobial Resistance: A Multicenter Experience from Clinical Isolates in Italy

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages