Skip to content

Pretrained model with data finetuning, flag to enable swapping out of both

Notifications You must be signed in to change notification settings

YuemingLong/EnzRanker_Kit

Repository files navigation

Enzyme Ranking Repo

Standalone folder for:

  • training a ranking model from your own data
  • optional reaction-context model
  • optional few-shot fine-tuning on in-house measurements

Folder Layout

  • data/
    • all_experiment.csv
    • inhouse-experiment.csv
    • experiments_4-methoxystyrene_eda_clean.csv
    • 01Q_seq-candidates.csv
    • 01Q_seq-candidates_merged_4-methoxystyrene-eda.csv
  • db_structures/
  • train_general_ranker.py
  • train_general_with_reaction_ranker.py
  • finetune_01q_hybrid_ranker.py
  • rdkit_morgan_featurize.py
  • models/ (outputs)
  • outputs/ (predictions/eval outputs)

Install

pip install -r requirements.txt

For reaction featurization, RDKit is required. If needed, install RDKit via conda and pass --rdkit-env <env_name>.

Input Schema

General training CSV needs:

  • amino_acid_sequence (or sequence)
  • normalized_fitness (or your chosen target column)
  • parent_experiment (group column for grouped split)
  • db_structure_path (or structure_path)

Reaction-context training additionally needs:

  • smiles_reaction

Fine-tuning needs:

  • candidates CSV with id, sequence column
  • base prediction CSV with id, predicted_score
  • measured CSV with id, measured target column

1) Train General Model (Seq + Structure)

python train_general_ranker.py train \
  --dataset data/all_experiment.csv \
  --target-col normalized_fitness \
  --group-col parent_experiment \
  --model-class extratrees \
  --out-model models/general_ranker.joblib \
  --out-metrics models/general_ranker_metrics.json

Score candidates:

python train_general_ranker.py score \
  --model models/general_ranker.joblib \
  --candidates data/01Q_seq-candidates.csv \
  --out outputs/general_candidate_scores.csv

2) Train General + Reaction Model (Optional)

python train_general_with_reaction_ranker.py train \
  --dataset data/all_experiment.csv \
  --target-col normalized_fitness \
  --group-col parent_experiment \
  --reaction-col smiles_reaction \
  --model-class extratrees \
  --out-model models/general_with_reaction_ranker.joblib \
  --out-metrics models/general_with_reaction_ranker_metrics.json

If RDKit is in a conda env:

python train_general_with_reaction_ranker.py train \
  --dataset data/all_experiment.csv \
  --reaction-col smiles_reaction \
  --rdkit-env debase \
  --model-class extratrees

Score candidates with optional fixed reaction:

python train_general_with_reaction_ranker.py score \
  --model models/general_with_reaction_ranker.joblib \
  --candidates data/01Q_seq-candidates.csv \
  --reaction-smiles "C=CC1=CC=C(OC)C=C1.O=C(OCC)C=[N+]=[N-]>>COC2=CC=C(C=C2)[C@@H]3[C@@H](C(OCC)=O)C3" \
  --out outputs/general_with_reaction_candidate_scores.csv

3) Few-Shot Fine-Tune on In-House Measurements (Optional)

python finetune_01q_hybrid_ranker.py \
  --candidates data/01Q_seq-candidates.csv \
  --actual data/01Q_seq-candidates_merged_4-methoxystyrene-eda.csv \
  --base-pred outputs/general_with_reaction_candidate_scores.csv \
  --support-size 24 \
  --seed 0 \
  --out-full outputs/hybrid_full.csv \
  --out-eval outputs/hybrid_eval.csv \
  --out-support outputs/hybrid_support.csv \
  --out-metrics outputs/hybrid_metrics.json

All CLI Flags

Use built-in help for complete flags:

python train_general_ranker.py --help
python train_general_ranker.py train --help
python train_general_ranker.py score --help

python train_general_with_reaction_ranker.py --help
python train_general_with_reaction_ranker.py train --help
python train_general_with_reaction_ranker.py score --help

python finetune_01q_hybrid_ranker.py --help

About

Pretrained model with data finetuning, flag to enable swapping out of both

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages