scBoost-ensemble for cell-type classification from multiple scRNA-seq embeddings.
scBoost-ensemble trains per-method XGBoost base learners on each embedding view stored in an AnnData object (adata.obsm), then combines their out-of-fold probability predictions through a meta-learner.
This stacking strategy lets the ensemble leverage the complementary strengths of different integration and embedding methods (e.g., Harmony, Scanorama, scVI, scGPT, scANVI).
┌─────────────────────────────────────────────────────────────┐
│ AnnData (.h5ad) │
│ obsm: X_harmony, X_scanorama, X_scvi, ... │
│ obs: cell_type, batch, [donor_id, ...] │
└──────────────┬──────────────────────────────────────────────┘
│
┌──────▼──────┐
│ Split data │ stratified or user-defined holdout set (e.g. donor)
└──────┬──────┘
│
┌──────────▼──────────┐
│ Base learners (CV) │ one XGBoost per embedding method,
│ out-of-fold probs │ trained in parallel across GPUs
└──────────┬──────────┘
│
┌───────▼───────┐
│ Meta-learner │ XGBoost on stacked probabilities
└───────┬───────┘
│
┌───────▼───────┐
│ Predictions │ cell-type labels + probabilities
│ + evaluation │ per-cell-type report, method importance
└───────────────┘
- Python >= 3.9
- anndata / scanpy
- xgboost (>= 2.0 recommended for
device="cuda:N"support) - scikit-learn
- numpy, pandas, joblib
- matplotlib, seaborn (plotting only)
pip install anndata scanpy xgboost scikit-learn numpy pandas joblib matplotlib seabornscBoost operates on AnnData .h5ad files with:
adata.obsm— one or more precomputed embedding matrices (e.g.X_harmony,X_scanorama,X_scvi)adata.obs["cell_type"]— cell-type labelsadata.obs["batch"]— batch identifiers
python scBoost-ensemble/example_combined_calls.py \
--input data/SketchProcessed_10k_geosketch.h5ad \
--sample_name TabulaSapiens_10k \
--methods X_harmony X_scanorama X_scvi \
--split_type stratified --test_size 0.1 --seed 42 \
--gpu 0 --n_gpus 2 \
--out_dir outputs/python scBoost-ensemble/example_single_method.py \
--input data/SketchProcessed_10k_geosketch.h5ad \
--sample_name my_dataset \
--method X_harmony \
--split_type stratified --test_size 0.1 --seed 42 \
--gpu 0 \
--out_dir outputs/python scBoost-ensemble/test_only.py \
--input data/new_data.h5ad \
--model outputs/stacking_ensemble.pkl \
--sample_name my_dataset \
--split_type donor_id --holdout_value TSP7 \
--out_dir outputs/The above example evaluates on cells with donor_id labeled as "TSP7".
Use --split_type with an adata.obs column name instead of stratified:
python scBoost-ensemble/example_combined_calls.py \
--input data.h5ad \
--sample_name holdout_experiment \
--methods X_harmony X_scanorama X_scvi \
--split_type donor_id --test_size 0.1 \
--no_batch_feature \
--out_dir outputs/| Argument | Description | Default |
|---|---|---|
--input |
Path to .h5ad file |
required |
--sample_name |
Tag for output filenames | required |
--methods |
obsm keys to use as views |
required |
--split_type |
stratified or an obs column name (e.g. donor_id) |
stratified |
--holdout_value |
Specific value to hold out as test set | required if split_type is an obs column name |
--test_size |
Test fraction (stratified splits only) | 0.1 |
--seed |
Random seed | 42 |
--gpu |
Starting GPU device ID; CPU if omitted | None |
--n_gpus |
Number of GPUs for parallel base-learner training | 1 |
--no_batch_feature |
Disable batch one-hot in meta-learner | False |
--out_dir |
Output directory | outputs/ |
Each run produces:
| File | Contents |
|---|---|
*_classification_report.tsv |
Per-cell-type precision, recall, F1, top-5 accuracy, and support |
*_method_importance.tsv |
Relative importance of each embedding method in the meta-learner |
*_split.tsv |
Train/test assignment for each cell barcode |
*_stacking_ensemble.pkl |
Serialised model (loadable with joblib.load) |
Run 3-method combinations from a set of 7 embeddings, distributing across GPUs:
METHODS=(Geoksketch Harmony NMF Python_PCA Scanorama scANVI scVI)
for ((i=0; i<${#METHODS[@]}-2; i++)); do
for ((j=i+1; j<${#METHODS[@]}-1; j++)); do
for ((k=j+1; k<${#METHODS[@]}; k++)); do
python scBoost-ensemble/example_combined_calls.py \
--input data.h5ad \
--sample_name "combo_${METHODS[$i]}_${METHODS[$j]}_${METHODS[$k]}" \
--methods ${METHODS[$i]} ${METHODS[$j]} ${METHODS[$k]} \
--split_type donor_id --gpu $((job++ % 8)) --no_batch_feature &
done
done
done
waitAggregate results afterward:
python scBoost-ensemble/extract_summary_metrics.py --outputs_dir outputs/
python scBoost-ensemble/plot_barplot.py --input outputs/summary_metrics_slim.tsv --out figures/barplot.svg