GitHub - gdewael/bento-sc: BENchmarking Transformer-Obtained Single-Cell embeddings

BENchmarking Transformer-Obtained Single-Cell representations.

Single-cell language modeling

This repository is linked to the study called "A systematic assessment of single-cell language model configurations" (preprint paper link).

The package contains routines and definitions for pre-training single-cell (transcriptomic) language models.

Package features:

Memory-efficient scRNA-seq dataloading from h5torch-compatible HDF5 files.
yaml-configurable language model training scripts.
Modular and extendable data preprocessing pipelines.
A diverse set of downstream tasks to evaluate scLM performance.
Full reproducibility instructions of our study results via bento-sc-reproducibility.

Install

bento-sc is distributed on PyPI.

pip install bento-sc

Note: The package has been tested with torch==2.2.2 and pytorch-lightning==2.2.5. If you encounter errors with bento-sc using more recent version of these two packages, consider downgrading.

You may need to install PyTorch before running this command in order to ensure the right CUDA kernels for your system are installed.

Package usage and structure

Please refer to our documentation page.

Academic reproducibility

All config files and scripts that were used to pre-train models and fine-tune them towards downstream tasks are included in a separate GitHub repository: bento-sc-reproducibility.

In addition, all scripts to reproduce the "baselines" in our study are located in the bento-sc-reproducibility repository.

Citation

If you end up using this code in your research, please cite:

@article {dewaele2025systematic,
	author = {De Waele, Gaetan and Menschaert, Gerben and Waegeman, Willem},
	title = {A systematic assessment of single-cell language model configurations},
	year = {2025},
	doi = {10.1101/2025.04.02.646825},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/04/08/2025.04.02.646825},
	journal = {bioRxiv}
}

Accreditation original dataset authors

The following datasets were used:

scTab (derived from CELLxGENE census) - CC-BY-4.0 License
NeurIPS 2023 Perturbation data - CC-BY-4.0 License
Replogle perturb-seq - CC-BY-4.0 License
NeurIPS 2021 CITE-seq - CC-BY-4.0 License
Circulating immune cells (CELLxGENE derived) - CC-BY-4.0 License
Embryonic limb cell atlas - CC-BY-4.0 License
Middle temporal gyrus among great apes - CC-BY-4.0 License

If you use bento-sc, please accredit the original authors:

scTab:

@article{fischer2024sctab,
	title={scTab: scaling cross-tissue single-cell annotation models},
	author={Fischer, Felix and Fischer, David S and Mukhin, Roman and Isaev, Andrey and Biederstedt, Evan and Villani, Alexandra-Chlo{\'e} and Theis, Fabian J},
	journal={Nature Communications},
	volume={15},
	number={1},
	pages={6611},
	year={2024},
	publisher={Nature Publishing Group UK London}
}

GRN Inference:

@article{szalata2025benchmark,
	title={A benchmark for prediction of transcriptomic responses to chemical perturbations across cell types},
	author={Sza{\l}ata, Artur and Benz, Andrew and Cannoodt, Robrecht and Cortes, Mauricio and Fong, Jason and Kuppasani, Sunil and Lieberman, Richard and Liu, Tianyu and Mas-Rosario, Javier and Meinl, Rico and others},
	journal={Advances in Neural Information Processing Systems},
	volume={37},
	pages={20566--20616},
	year={2025}
}

Post-perturbation prediction

@article{replogle2022mapping,
	title={Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq},
	author={Replogle, Joseph M and Saunders, Reuben A and Pogson, Angela N and Hussmann, Jeffrey A and Lenail, Alexander and Guna, Alina and Mascibroda, Lauren and Wagner, Eric J and Adelman, Karen and Lithwick-Yanai, Gila and others},
	journal={Cell},
	volume={185},
	number={14},
	pages={2559--2575},
	year={2022},
	publisher={Elsevier}
}

Surface protein abundance prediction:

@inproceedings{luecken2021sandbox,
	title={A sandbox for prediction and integration of DNA, RNA, and proteins in single cells},
	author={Luecken, Malte D and Burkhardt, Daniel Bernard and Cannoodt, Robrecht and Lance, Christopher and Agrawal, Aditi and Aliee, Hananeh and Chen, Ann T and Deconinck, Louise and Detweiler, Angela M and Granados, Alejandro A and others},
	booktitle={Thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 2)},
	year={2021}
}

Batch correction:

@article{zhang2023human,
	title={A human embryonic limb cell atlas resolved in space and time},
	author={Zhang, Bao and He, Peng and Lawrence, John EG and Wang, Shuaiyu and Tuck, Elizabeth and Williams, Brian A and Roberts, Kenny and Kleshchevnikov, Vitalii and Mamanova, Lira and Bolt, Liam and others},
	journal={Nature},
	pages={1--11},
	year={2023},
	publisher={Nature Publishing Group UK London}
}

@article{jorstad2023comparative,
	title={Comparative transcriptomics reveals human-specific cortical features},
	author={Jorstad, Nikolas L and Song, Janet HT and Exposito-Alonso, David and Suresh, Hamsini and Castro-Pacheco, Nathan and Krienen, Fenna M and Yanny, Anna Marie and Close, Jennie and Gelfand, Emily and Long, Brian and others},
	journal={Science},
	volume={382},
	number={6667},
	pages={eade9516},
	year={2023},
	publisher={American Association for the Advancement of Science}
}

@article{ivanova2023mrna,
	title={mRNA COVID-19 vaccine elicits potent adaptive immune response without the acute inflammation of SARS-CoV-2 infection},
	author={Ivanova, Ellie N and Shwetar, Jasmine and Devlin, Joseph C and Buus, Terkild B and Gray-Gaillard, Sophie and Koide, Akiko and Cornelius, Amber and Samanovic, Marie I and Herrera, Alberto and Mimitou, Eleni P and others},
	journal={Iscience},
	volume={26},
	number={12},
	year={2023},
	publisher={Elsevier}
}

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
.github/workflows		.github/workflows
assets		assets
bento_sc		bento_sc
docs		docs
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Single-cell language modeling

Install

Package usage and structure

Academic reproducibility

Citation

Accreditation original dataset authors

About

Uh oh!

Releases 8

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Single-cell language modeling

Install

Package usage and structure

Academic reproducibility

Citation

Accreditation original dataset authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages