Rocchi, E., Nicitra, E., Calvo, M. et al. Combining mass spectrometry and machine learning models for predicting Klebsiella pneumoniae antimicrobial resistance: a multicenter experience from clinical isolates in Italy. BMC Microbiol (2026). https://doi.org/10.1186/s12866-025-04657-2
This repository contains a comprehensive machine learning pipeline for predicting antimicrobial resistance (AMR) phenotypes from MALDI-TOF mass spectrometry data across multiple collection sites. Data used within this pipeline are MALDI-TOF spectra of Klebsiella pneumoniae and were collected in 3 Italian clinical centres. The framework implements cross-dataset evaluation strategies and batch effect correction methods to assess model generalizability and site-specific performance.
The MALDI-TOF spectra are processed using MaldiAMRKit. The batch-effect correction is performed with combatlearn to avoid data leakage in the machine learning pipeline.
.
├── src/
│ ├── prepare_maldiset.py # Data preparation and pseudogel visualization
│ ├── cross_datasets_framework.py # Cross-dataset generalization analysis
│ ├── aggregated_datasets_framework.py # ComBat-corrected cross-validation
│ └── batch_visualization.py # Batch effect visualization (PCA/UMAP)
├── results/ # Analysis outputs (generated)
├── requirements.txt
└── README.md
Clone the repository:
git clone https://github.com/EttoreRocchi/MALDI-Kleb-AI.git
cd MALDI-Kleb-AI/and install required packages using pip:
pip install -r requirementsAutomated processing of MALDI-TOF spectra with metadata integration using MaldiAMRKit.
Input requirements:
spectra_dir: Directory containing individual spectrum files (.txt,.csv)metadata.csv: CSV file with columns:- Sample identifiers (matching spectrum filenames)
- Antibiotic susceptibility results (S/I/R format)
Citycolumn indicating collection site
Command:
python src/prepare_maldiset.py \
--spectra_dir </path/to/data/> \
--metadata </path/to/metadata> \
--antibiotics Meropenem Amikacin \
--other City \
--output_dir ./data/dfs/Outputs:
data_bin_3.csv: Feature matrix (samples × mass-to-charge bins) with target labelsmetadata_bin_3.csv: Associated metadata including batch informationpseudogel_Meropenem.png: Visualization of spectral intensities by resistance to Meropenempseudogel_Amikacin.png: Visualization of spectral intensities by resistance to Amikacin
Evaluate model generalization by training on individual sites and testing on all sites.
Input requirements:
data_bin_3.csv: Feature matrix from preparation stepmetadata_bin_3.csv: Batch metadata withCitycolumn
Command:
python src/cross_datasets_framework.py \
--data ./data/dfs/data_bin_3.csv \
--batches ./data/dfs/metadata_bin_3.csv \
--targets Meropenem Amikacin \
--out_dir ./results/cross_datasetsOutputs (per target antibiotic):
results/cross_datasets/
└── Meropenem/
├── logistic_regression_f1.png # Heatmap: F1-weighted scores
├── logistic_regression_f1.csv # Numeric matrix
├── logistic_regression_auroc.png # Heatmap: AUROC scores
├── logistic_regression_auroc.csv
├── logistic_regression_balacc.png # Heatmap: Balanced accuracy
├── logistic_regression_balacc.csv
├── logistic_regression_mcc.png # Heatmap: Matthews correlation coefficient
├── logistic_regression_mcc.csv
├── [similar files for random_forest, xgboost, mlp]
└── ...
Each heatmap shows train city (rows) × test city (columns) performance.
Perform stratified cross-validation with ComBat batch effect correction using combatlearn.
Input requirements:
- Same as cross-dataset analysis
- Models are saved and can be reloaded unless
--force-retrainis specified
Command:
python src/aggregated_datasets_framework.py \
--data ./data/dfs/data_bin_3.csv \
--batches ./data/dfs/metadata_bin_3.csv \
--targets Meropenem Amikacin \
--out ./results/aggregated_datasets \
[--force-retrain]Outputs (per target):
results/aggregated_datasets/
├── metrics_Meropenem.csv # Aggregate metrics (all models)
├── metrics_detailed_logistic_regression_Meropenem.csv # Per-fold metrics
├── confmat_Logistic_Regression_Meropenem.png # Confusion matrix
├── confmat_Logistic_Regression_Meropenem.pdf
├── shap_beeswarm_Logistic_Regression_Meropenem.png # Feature importance
├── shap_beeswarm_Logistic_Regression_Meropenem.pdf
├── perf_by_center_logistic_regression_Meropenem.csv # Center-specific metrics
├── perf_boxplot_Logistic_Regression_Meropenem_Balanced_Acc.png
├── perf_boxplot_Logistic_Regression_Meropenem_MCC.png
├── perf_boxplot_Logistic_Regression_Meropenem_AUROC.png
├── saved_models/
│ └── Meropenem/
│ ├── Logistic_Regression_fold0.pkl
│ ├── Logistic_Regression_fold1.pkl
│ └── ...
└── [similar files for other models and targets]
Visualize and quantify batch effects before and after ComBat correction using dimensionality reduction and statistical metrics.
Command:
python src/batch_analyses.py \
--data ./data/dfs/data_bin_3.csv \
--batches ./data/dfs/metadata_bin_3.csv \
--out ./results/batches \
[--per-batch]Options:
--per-batch: Enable per-batch diagnostics (computes metrics for each batch individually)
Outputs:
results/batches/
├── pca_comparison.png # Side-by-side PCA (before/after)
├── pca_comparison.pdf
├── umap_comparison.png # Side-by-side UMAP (before/after)
├── umap_comparison.pdf
└── batch_correction_report.md # Quantitative assessment
Metrics computed:
- Batch Effect Removal: Centroid distances, variance alignment (Levene's test), batch separability (Silhouette score)
- Structure Preservation: k-NN neighborhood preservation, distance correlation
- Distribution Preservation: Skewness changes, Kolmogorov-Smirnov statistics
- Per-Batch Diagnostics (optional): Alignment, structure preservation, and distribution similarity within each batch
The dataset is publicly available on Zenodo at: MALDI-Kleb-AI.
If you use the dataset and/or the pipeline, please consider citing:
@article{Rocchi2026,
author = {Rocchi, Ettore and Nicitra, Emanuele and Calvo, Maddalena and Cento, Valeria and Peiretti, Laura and Asif, Zian and Menchinelli, Giulia and Posteraro, Brunella and Sala, Claudia and Colosimo, Claudia and Cricca, Monica and Sambri, Vittorio and Sanguinetti, Maurizio and Castellani, Gastone and Stefani, Stefania},
title = {Combining mass spectrometry and machine learning models for predicting Klebsiella pneumoniae antimicrobial resistance: a multicenter experience from clinical isolates in Italy},
journal = {BMC Microbiology},
year = {2026},
doi = {10.1186/s12866-025-04657-2},
url = {https://doi.org/10.1186/s12866-025-04657-2}
}