Understanding gene perturbation effects across diverse cellular contexts is a central challenge in functional genomics, with significant implications for therapeutic discovery and precision medicine. While single-cell technologies enable high-resolution measurement of transcriptional responses, collecting such data remains expensive and time-intensive, especially when repeated for each cell type. Existing computational methods attempt to predict these responses but typically require separate models per cell type, limiting scalability and generalization.
CFM-GP (Conditional Flow Matching for Gene Perturbation) is a deep learning framework that models perturbation as a continuous transformation between control and perturbed gene expression distributions, conditioned on cell type. A single model generalizes across all cell types, eliminating the need for cell type–specific training.
- Cell Type–Agnostic Prediction: Single model across all cell types
- Continuous Trajectory Modeling: Learns time-dependent perturbation dynamics
- Generalization Across Contexts: Works across datasets and species
- Biological Fidelity: Recovers pathway-level signals
git clone https://github.com/abrarrahmanabir/CFM-GP.git
cd CFM-GP
pip install -r requirements.txtDownload processed datasets:
👉 https://drive.google.com/file/d/1sJxHM4te1CNShBLUrLVEGPrkEbOjM7mk/view?usp=sharing
Place them in:
./data/
We use five public single-cell RNA-seq datasets:
- COVID-19 (GSE145926)
- PBMC IFN-β (GSE96583)
- Glioblastoma drug response (GSE148842)
- Lupus IFN-β (GSE96583)
- Statefate cytokine stimulation (GSE140802)
All datasets contain paired control and perturbed expression profiles.
Upstream preprocessing includes:
- Normalization
- Log-transformation
- Highly variable gene (HVG) selection
- Batch harmonization (when applicable)
For each cell type:
- Extract control cells
- Extract perturbed cells
- Match by minimum size to ensure strict pairing
min_n = min(adata_ctrl.shape[0], adata_pert.shape[0])
X_ctrl = adata_ctrl.X[:min_n]
X_pert = adata_pert.X[:min_n]This ensures 1:1 control–perturbation pairing per cell type.
Only cell types with both control and perturbed samples are retained.
x_ctrl ∈ ℝ^{N × G}→ control expressionx_pert ∈ ℝ^{N × G}→ perturbed expression
Gene ordering is consistent across both.
Cell types are encoded as integers:
from sklearn.preprocessing import LabelEncoder
cell_type_encoded = LabelEncoder().fit_transform(cell_types)Each dataset is saved as .pt containing:
x_ctrlx_pertcell_typecell_type_mappinggene_names
- Stratified by cell type
- Maintain paired structure
- Provided splits per dataset
- Donor-level separation applied when available
- Uses
scanpyfor.h5adloading - Converts sparse matrices to dense
- Uses PyTorch for tensor storage
bash test_script.shOutputs:
<dataset>_results/r2.csv<dataset>_results/spearman.csv<dataset>_results/mmd.csv
Pretrained models are available in:
./model/
Run:
bash test_script.shTrain from scratch:
bash train_script.sh- GPU recommended (V100 / A100)
- RAM ≥ 16GB
- R² → prediction accuracy
- Spearman correlation → rank consistency
- MMD → distribution similarity
@article{abir2025cfm,
title={CFM-GP: Unified Conditional Flow Matching to Learn Gene Perturbation Across Cell Types},
author={Abir, Abrar Rahman and Dip, Sajib Acharjee and Zhang, Liqing},
journal={arXiv preprint arXiv:2508.08312},
year={2025}
}- Abrar Rahman Abir (abrarrahmanabir156@gmail.com)
- Sajib Acharjee Dip (sajibacharjeedip@vt.edu)
- Liqing Zhang (lqzhang@cs.vt.edu)
