Skip to content

Genome language model + Molecular graph transformer for drug combination discovery

License

Notifications You must be signed in to change notification settings

JunboShen/DeepSAD

Repository files navigation

Drug Combination Discovery

This repository contains the training/inference code for PDF .

The implementation focuses on predicting drug-pair synergy vs antagonism across multiple bacterial strains.

Most large datasets, embeddings, and trained checkpoints are intentionally not committed to git (see .gitignore and data/README.md).

Model Overview

DeepSAD architecture

Repository layout

  • Transformer_evo/: main model + training scripts
    • Transformer_evo/model.py: PyTorch model definitions
    • Transformer_evo/train_main.py: train/evaluate on predefined splits
    • Transformer_evo/train_ind_test.py: train on the independent training set and run inference on an external test set
    • Transformer_evo/process_data.py: utilities to build training CSVs from the supplement excel table
  • analyze_and_label.py: add human-readable label names and compute per-bacteria statistics
  • data/README.md: expected input files (not versioned)

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Optional utilities (Excel export + attribution scripts):

pip install -r requirements-optional.txt

Data preparation

Place your datasets and split CSVs under data/ as described in data/README.md.

If you have the supplement excel table and want to generate train_data.csv / unannotated.csv:

python Transformer_evo/process_data.py \
  --train-xlsx data/train_data.xlsx \
  --kmer-dir kmer_result \
  --out-train-csv data/train_data.csv \
  --out-unannotated-csv data/unannotated.csv

Training (Transformer)

The main entrypoint is Transformer_evo/train_main.py. It expects split files like: data/random_split_train.csv and data/random_split_test.csv.

python Transformer_evo/train_main.py --split random_split --data-aug --epochs 100

Outputs:

  • confidence scores: conf_scores_best/
  • (other scripts may write checkpoints under models_best/)

Independent training + external inference

python Transformer_evo/train_ind_test.py --data-aug --epochs 100

By default, this script looks for:

  • data/independent_train.csv
  • data/cleaned_positive_adjuvant_model_controls_04_25_2022.csv (expects column actual_strain)
  • ind_data1/drug1_base.npy and ind_data2/drug2_base.npy

Dataset inspection: labels + statistics

python analyze_and_label.py \
  --input data/independent_train.csv \
  --output overall_train_labeled.csv \
  --stats-csv bacteria_statistics.csv \
  --stats-xlsx bacteria_statistics.xlsx

Citation

If you use this code, please cite the paper:

TBD

If you need access to data, pretrained/fine-tuned checkpoints used in our experiments, please contact the repository owner via email.

About

Genome language model + Molecular graph transformer for drug combination discovery

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages