DeSCENT

Deconvolutional Single-Cell RNA-seq ENhances Transcriptome-based Cancer Survival Analysis

Quick Start · Pipeline · Notebook Demo · Configuration · Project Structure

Quick Start

git clone https://github.com/YonghaoZhao722/DeSCENT.git
cd DeSCENT
conda env create -f environment.yml
conda activate descent
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

The LD_LIBRARY_PATH export fixes CXXABI_1.3.15 errors on older systems. All shell scripts set it automatically.

The repository ships a repo-local BRCA demo config at config/path_local.json. config/path.json.example is also a repo-local BRCA example, so a direct copy is runnable for the BRCA demo as long as the expected data files are present under data/BRCA/. If you want to create or replace a local config from the template, run:

cp config/path.json.example config/path_local.json

Then edit config/path_local.json only if you want to change the demo defaults or add another cancer type.

Run: notebooks/descent_pipeline_demo.ipynb

Onboarding

For the BRCA demo, the repo expects the following layout:

config/path_local.json: active BRCA config used by the shell scripts and notebook
data/BRCA/refs/deg_cv/degs_fold{1..5}.csv: per-fold DEG inputs for survival CV
data/BRCA/bulk/: bulk expression splits plus filtered_tpm_BRCA.tsv, downloaded from the same Zenodo record as the BRCA single-cell demo data
data/BRCA/single_cell/BRCA_train.symbol_mapped.h5ad and data/BRCA/single_cell/BRCA_test.symbol_mapped.h5ad: single-cell train/test .h5ad downloaded from the same Zenodo record
data/pretrained/annotation_model_v1/: downloaded scimilarity pretrained checkpoint
data/BRCA/redeconv_ref/Meta_data_new.tsv and data/BRCA/redeconv_ref/scRNA_seq_new_noShift.tsv: ReDeconv reference files required by part 1

For a new cancer type, you can start from the BRCA repo-local template with:

cp config/path.json.example config/path_local.json

Then edit config/path_local.json and keep the same repo-relative directory layout under data/<CANCER>/....

Training Data

The BRCA demo data is available on Zenodo. That record now contains both:

the BRCA single-cell train/test .h5ad files
the BRCA bulk input files used by ReDeconv and survival CV

Bulk files are not stored in Git because several files exceed GitHub's 100 MB file size limit.

After downloading, place the files at:

data/BRCA/single_cell/BRCA_train.symbol_mapped.h5ad
data/BRCA/single_cell/BRCA_test.symbol_mapped.h5ad
data/BRCA/bulk/filtered_tpm_BRCA.tsv
data/BRCA/bulk/train_data_1.csv ... data/BRCA/bulk/train_data_5.csv
data/BRCA/bulk/val_data_1.csv ... data/BRCA/bulk/val_data_5.csv

External Checkpoint Requirement

The VAE fine-tuning step requires the downloaded scimilarity pretrained checkpoint annotation_model_v1. This checkpoint is not bundled with DeSCENT.

Download the scimilarity pretrained weights at Zenodo and place the extracted annotation_model_v1 directory at data/pretrained/annotation_model_v1/
DeSCENT uses that directory through config/path_local.json -> VAE_pretrained

Training policy in this repo is:

VAE: fine-tune from annotation_model_v1
diffusion_backbone: train from scratch
classifier: train from scratch

./scripts/run_part1_deg_redeconv_condgen.sh now supports SCDIFFUSION_TRAIN_MODE:

auto (default): train only when stable checkpoints are missing
force: always rerun VAE fine-tuning + backbone training + classifier training
skip: never train, only reuse the configured checkpoints

Pipeline

DeSCENT has four stages: ReDeconv → scDiffusion training/fine-tuning → Condgen → Survival CV.

bulk RNA-seq + scRNA-seq reference
        │
        ▼
  ReDeconv ──► cell fractions
        │
        ▼
  scDiffusion training/fine-tuning ──► VAE / diffusion / classifier checkpoints
        │
        ▼
  scDiffusion condgen ──► synthetic scGEP (.npz)
        │
        ▼
  Multimodal survival CV (bulk + scGEP + DEG) ──► C-index

Option A: Shell Scripts

# Part 1: ReDeconv + scDiffusion training + conditional scGEP generation
./scripts/run_part1_deg_redeconv_condgen.sh BRCA

# Part 2: Multimodal survival prediction (5-fold CV)
./scripts/run_part2_survival.sh BRCA

# Or run the full pipeline in one go
./scripts/run_full_pipeline_test.sh BRCA

Option B: Jupyter Notebook

An interactive notebook covers the same pipeline with inline visualizations:

notebooks/descent_pipeline_demo.ipynb

The notebook runs each step via !python shell commands (faithful to the shell scripts) and produces plots for cell fractions, training progress, generated scGEP stats, C-index per fold, and training curves. Change CANCER = "BRCA" in the first code cell to switch cancer types, and adjust CONDGEN_SAMPLE_LIMIT = 4 there to control how many samples the condgen step uses (None or 0 runs the full fraction file).

Note: The full pipeline is compute-intensive. scDiffusion training and condgen require GPU time, and survival CV trains for 250 epochs × 5 folds. For a quick demo, the notebook defaults to SCDIFFUSION_TRAIN_MODE = "auto" and CONDGEN_SAMPLE_LIMIT = 4, and you can reduce EPOCHS to 10.

Notebook Demo

Step	What it does	Output
Step 1 — ReDeconv	Estimate cell-type fractions from bulk RNA-seq	`output/redeconv_fraction/{CANCER}/`
Step 2 — scDiffusion Training	Fine-tune VAE and train the diffusion backbone + classifier	`output/scdiffusion_training/{CANCER}/` and `output/scdiffusion_models/{CANCER}/`
Step 3 — Condgen	Generate synthetic scGEP via scDiffusion	`output/scgep_condgen/{CANCER}/redeconv/`
Step 4 — Survival CV	Multimodal 5-fold CV (bulk + scGEP + DEG)	`output/survival_cv/{CANCER}/cv_summary.json`

Each step includes a visualization cell: stacked bar / box plots for cell fractions, .npz inspection for condgen, and C-index bar chart + training curves for survival.

Configuration

config/path_local.json is the active repo-local config for the BRCA demo. config/path.json.example is a runnable BRCA template that can be copied directly and then edited for other cancer entries while keeping all paths relative to the repository root.

Key	Purpose
`single_cell_data`	Downloaded train/test `.h5ad` files used by scDiffusion training
`VAE_pretrained`	Downloaded scimilarity pretrained directory (`annotation_model_v1`) used for VAE fine-tuning
`VAE`, `diffusion_backbone`, `classifier`	Stable trained checkpoints exported by DeSCENT under `output/scdiffusion_models/`
`sc_npz`	Generated scGEP directory consumed by part 2
`bulk`, `surv_label`	Bulk expression and survival label directories (5-fold splits)
`deg_dir`	Per-fold DEG directory containing `degs_fold{1..5}.csv` for leakage-free survival CV
`gene_list`	Gene order CSV for generation
`redeconv_ref`	Reference scRNA-seq for ReDeconv
`bulk_tpm`	TPM-normalized bulk for ReDeconv input
`celltypes`	Cell fraction output from ReDeconv

Verified cancers: BRCA, COAD, HNSC, KIRC, LGG, LIHC, LUAD, STAD.

Project Structure

DeSCENT/
├── scgep_generation/           # Chapter 1: scGEP Generation
│   ├── redeconv/               # Bundled ReDeconv (patched fork — do NOT pip install)
│   ├── VAE/                    # VAE encoder/decoder for gene expression latent space
│   ├── guided_diffusion/       # Diffusion backbone + classifier
│   ├── pipeline_scripts/       # ReDeconv + scDiffusion orchestration helpers
│   ├── cell_train.py           # Diffusion backbone training entrypoint
│   ├── classifier_train.py     # Classifier training entrypoint
│   └── generate_bulk_from_diffusion.py
├── survival_prediction/        # Chapter 2: Survival Prediction
│   ├── scrna_bulk_sc_survival_cv.py   # Main entry: 5-fold CV
│   ├── mil_survival_model.py          # MIL-based multimodal model
│   ├── mil_survival_training.py       # Training utilities
│   ├── survival_data.py               # Data loading
│   └── bulk_sample.py                 # Bulk prep
├── notebooks/
│   └── descent_pipeline_demo.ipynb    # Interactive pipeline demo
├── config/
│   ├── path.json               # legacy local config (ignored)
│   ├── path.json.example       # template for repo-relative path config
│   └── path_local.json         # active BRCA demo config
├── scripts/
│   ├── run_part1_deg_redeconv_condgen.sh
│   ├── run_part2_survival.sh
│   ├── run_full_pipeline_test.sh
│   └── run_survival_test.sh
├── output/                     # All outputs (not in git)
│   ├── redeconv_fraction/
│   ├── scdiffusion_models/
│   ├── scdiffusion_training/
│   ├── scgep_condgen/
│   └── survival_cv/
├── data/                       # Input data, DEG folds, and model prerequisites
└── environment.yml

Important Notes

Bundled ReDeconv: scgep_generation/redeconv/ is a patched fork. DO NOT RUN pip install redeconv, it will silently replace it with the vanilla version and break the pipeline.
GPU memory: Pipeline scripts call gpu_cleanup() between steps to avoid OOM when running sequentially.
Per-fold DEG only: part 2 reads degs_fold{1..5}.csv from data/{CANCER}/refs/deg_cv/. There is no fallback to a single global DEG file.
Stable checkpoint paths: part 1 exports trained checkpoints to output/scdiffusion_models/{CANCER}/, and both condgen and survival read those paths from config/path_local.json.

Citation

If you use DeSCENT in your research, please cite:

@article{zhao2026descent,
  title={DeSCENT: Deconvolutional Single-Cell RNA-seq Enhances Transcriptome-based Cancer Survival Analysis},
  author={Zhao, Yonghao and You, Zeyu and Shen, Yu and Chu, Jielei and Gong, Xun and Li, Tianrui and Wang, Ziqiang and Xu, Chuan and Luo, Zhipeng and He, Yazhou},
  journal={bioRxiv},
  year={2026},
  doi={10.64898/2026.03.15.711877},
  url={https://doi.org/10.64898/2026.03.15.711877}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeSCENT

Quick Start

Onboarding

Training Data

External Checkpoint Requirement

Pipeline

Option A: Shell Scripts

Option B: Jupyter Notebook

Notebook Demo

Configuration

Project Structure

Important Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
config		config
data		data
notebooks		notebooks
scgep_generation		scgep_generation
scripts		scripts
survival_prediction		survival_prediction
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Folders and files

Latest commit

History

Repository files navigation

DeSCENT

Quick Start

Onboarding

Training Data

External Checkpoint Requirement

Pipeline

Option A: Shell Scripts

Option B: Jupyter Notebook

Notebook Demo

Configuration

Project Structure

Important Notes

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages