MALDI-Kleb-AI

Rocchi, E., Nicitra, E., Calvo, M. et al. Combining mass spectrometry and machine learning models for predicting Klebsiella pneumoniae antimicrobial resistance: a multicenter experience from clinical isolates in Italy. BMC Microbiol (2026). https://doi.org/10.1186/s12866-025-04657-2

Overview

This repository contains a comprehensive machine learning pipeline for predicting antimicrobial resistance (AMR) phenotypes from MALDI-TOF mass spectrometry data across multiple collection sites. Data used within this pipeline are MALDI-TOF spectra of Klebsiella pneumoniae and were collected in 3 Italian clinical centres. The framework implements cross-dataset evaluation strategies and batch effect correction methods to assess model generalizability and site-specific performance.

The MALDI-TOF spectra are processed using MaldiAMRKit. The batch-effect correction is performed with combatlearn to avoid data leakage in the machine learning pipeline.

Structure

.
├── src/
│   ├── prepare_maldiset.py               # Data preparation and pseudogel visualization
│   ├── cross_datasets_framework.py       # Cross-dataset generalization analysis
│   ├── aggregated_datasets_framework.py  # ComBat-corrected cross-validation
│   └── batch_visualization.py            # Batch effect visualization (PCA/UMAP)
├── results/                              # Analysis outputs (generated)
├── requirements.txt
└── README.md

Installation

Clone the repository:

git clone https://github.com/EttoreRocchi/MALDI-Kleb-AI.git
cd MALDI-Kleb-AI/

and install required packages using pip:

pip install -r requirements

Usage

Data preparation

Automated processing of MALDI-TOF spectra with metadata integration using MaldiAMRKit.

Input requirements:

spectra_dir: Directory containing individual spectrum files (.txt, .csv)
metadata.csv: CSV file with columns:
- Sample identifiers (matching spectrum filenames)
- Antibiotic susceptibility results (S/I/R format)
- City column indicating collection site

Command:

python src/prepare_maldiset.py \
    --spectra_dir </path/to/data/> \
    --metadata </path/to/metadata> \
    --antibiotics Meropenem Amikacin \
    --other City \
    --output_dir ./data/dfs/

Outputs:

data_bin_3.csv: Feature matrix (samples × mass-to-charge bins) with target labels
metadata_bin_3.csv: Associated metadata including batch information
pseudogel_Meropenem.png: Visualization of spectral intensities by resistance to Meropenem
pseudogel_Amikacin.png: Visualization of spectral intensities by resistance to Amikacin

Cross-dataset generalization analysis

Evaluate model generalization by training on individual sites and testing on all sites.

Input requirements:

data_bin_3.csv: Feature matrix from preparation step
metadata_bin_3.csv: Batch metadata with City column

Command:

python src/cross_datasets_framework.py \
    --data ./data/dfs/data_bin_3.csv \
    --batches ./data/dfs/metadata_bin_3.csv \
    --targets Meropenem Amikacin \
    --out_dir ./results/cross_datasets

Outputs (per target antibiotic):

results/cross_datasets/
└── Meropenem/
    ├── logistic_regression_f1.png          # Heatmap: F1-weighted scores
    ├── logistic_regression_f1.csv          # Numeric matrix
    ├── logistic_regression_auroc.png       # Heatmap: AUROC scores
    ├── logistic_regression_auroc.csv
    ├── logistic_regression_balacc.png      # Heatmap: Balanced accuracy
    ├── logistic_regression_balacc.csv
    ├── logistic_regression_mcc.png         # Heatmap: Matthews correlation coefficient
    ├── logistic_regression_mcc.csv
    ├── [similar files for random_forest, xgboost, mlp]
    └── ...

Each heatmap shows train city (rows) × test city (columns) performance.

Batch-corrected cross-validation

Perform stratified cross-validation with ComBat batch effect correction using combatlearn.

Input requirements:

Same as cross-dataset analysis
Models are saved and can be reloaded unless --force-retrain is specified

Command:

python src/aggregated_datasets_framework.py \
    --data ./data/dfs/data_bin_3.csv \
    --batches ./data/dfs/metadata_bin_3.csv \
    --targets Meropenem Amikacin \
    --out ./results/aggregated_datasets \
    [--force-retrain]

Outputs (per target):

results/aggregated_datasets/
├── metrics_Meropenem.csv                                         # Aggregate metrics (all models)
├── metrics_detailed_logistic_regression_Meropenem.csv            # Per-fold metrics
├── confmat_Logistic_Regression_Meropenem.png                     # Confusion matrix
├── confmat_Logistic_Regression_Meropenem.pdf
├── shap_beeswarm_Logistic_Regression_Meropenem.png               # Feature importance
├── shap_beeswarm_Logistic_Regression_Meropenem.pdf
├── perf_by_center_logistic_regression_Meropenem.csv              # Center-specific metrics
├── perf_boxplot_Logistic_Regression_Meropenem_Balanced_Acc.png
├── perf_boxplot_Logistic_Regression_Meropenem_MCC.png
├── perf_boxplot_Logistic_Regression_Meropenem_AUROC.png
├── saved_models/
│   └── Meropenem/
│       ├── Logistic_Regression_fold0.pkl
│       ├── Logistic_Regression_fold1.pkl
│       └── ...
└── [similar files for other models and targets]

Batch effect visualization

Visualize and quantify batch effects before and after ComBat correction using dimensionality reduction and statistical metrics.

Command:

python src/batch_analyses.py \
    --data ./data/dfs/data_bin_3.csv \
    --batches ./data/dfs/metadata_bin_3.csv \
    --out ./results/batches \
    [--per-batch]

Options:

--per-batch: Enable per-batch diagnostics (computes metrics for each batch individually)

Outputs:

results/batches/
├── pca_comparison.png                    # Side-by-side PCA (before/after)
├── pca_comparison.pdf
├── umap_comparison.png                   # Side-by-side UMAP (before/after)
├── umap_comparison.pdf
└── batch_correction_report.md            # Quantitative assessment

Metrics computed:

Batch Effect Removal: Centroid distances, variance alignment (Levene's test), batch separability (Silhouette score)
Structure Preservation: k-NN neighborhood preservation, distance correlation
Distribution Preservation: Skewness changes, Kolmogorov-Smirnov statistics
Per-Batch Diagnostics (optional): Alignment, structure preservation, and distribution similarity within each batch

Citation

The dataset is publicly available on Zenodo at: MALDI-Kleb-AI.

If you use the dataset and/or the pipeline, please consider citing:

@article{Rocchi2026,
  author    = {Rocchi, Ettore and Nicitra, Emanuele and Calvo, Maddalena and Cento, Valeria and Peiretti, Laura and Asif, Zian and Menchinelli, Giulia and Posteraro, Brunella and Sala, Claudia and Colosimo, Claudia and Cricca, Monica and Sambri, Vittorio and Sanguinetti, Maurizio and Castellani, Gastone and Stefani, Stefania},
  title     = {Combining mass spectrometry and machine learning models for predicting Klebsiella pneumoniae antimicrobial resistance: a multicenter experience from clinical isolates in Italy},
  journal   = {BMC Microbiology},
  year      = {2026},
  doi       = {10.1186/s12866-025-04657-2},
  url       = {https://doi.org/10.1186/s12866-025-04657-2}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MALDI-Kleb-AI

Overview

Structure

Installation

Usage

Data preparation

Cross-dataset generalization analysis

Batch-corrected cross-validation

Batch effect visualization

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
results		results
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MALDI-Kleb-AI

Overview

Structure

Installation

Usage

Data preparation

Cross-dataset generalization analysis

Batch-corrected cross-validation

Batch effect visualization

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages