Skip to content

YuemingLong/DEBase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DEBase

DEBase is a Python toolkit (v2.1.0) for turning chemistry papers (manuscript + SI) into plate-ready enzyme lineage, sequence, reaction metric, and substrate scope tables using Google Gemini 2.5 Flash (text + vision).

What it does

  • Extract lineage trees, variant IDs, mutations, and sequences from PDFs (enzyme_lineage_extractor).
  • Generate and validate full protein sequences from mutations (cleanup_sequence).
  • Pull reaction metrics from text, tables, and figure images (reaction_info_extractor).
  • Extract substrate scope tables and merge lineage context (substrate_scope_extractor).
  • Normalize everything into a flat 96-well style CSV with IUPAC-to-SMILES conversion and optional Gemini matching (lineage_format).
  • Track Gemini token usage and estimated cost for every run.

Inputs and outputs

  • Inputs: manuscript PDF (required) and optional SI PDF.
  • Final output: <manuscript>_debase.csv or the path passed to --output (plate-friendly flat table).
  • Intermediate CSVs (written next to the output): enzyme_lineage_data.csv, 2_enzyme_sequences.csv, 3a_reaction_info.csv, 3b_substrate_scope.csv.
  • Logs and telemetry: debase_pipeline_<timestamp>.log plus price_<manuscript>.csv with per-module token/cost totals.
  • Debug artifacts (when --debug-dir is set): Gemini prompts/responses, captions, and extracted text for each stage (can be large). lineage_format uses -v/-vv for verbosity instead of --debug-dir.

Requirements

  • Python 3.8-3.12.
  • GEMINI_API_KEY environment variable for Google Gemini 2.5 Flash (text and vision). Example: export GEMINI_API_KEY=....
  • OPSIN CLI on your PATH for IUPAC validation/SMILES lookups (used by reaction_info_extractor and lineage_format): brew install opsin or conda install -c conda-forge opsin.
  • Network access for Gemini calls and for NCI/PubChem fallbacks during IUPAC->SMILES conversion unless OPSIN/RDKit/local cache already resolves the names.
  • Optional: RDKit for better SMILES canonicalization (pip install .[rdkit] or conda install -c conda-forge rdkit).

Installation

From source (recommended for this repository):

python -m pip install -e .
# with RDKit extras
python -m pip install -e .[rdkit]

Conda environment:

conda env create -f environment.yml
conda activate debase

Quick start (full pipeline)

export GEMINI_API_KEY=your_key

# Runs all steps and writes outputs alongside the chosen --output path
debase \
  --manuscript paper.pdf \
  --si supplementary.pdf \
  --output outputs/paper_plate.csv \
  --keep-intermediates \
  --debug-dir outputs/debug

If --output is omitted, the pipeline writes paper_debase.csv in the current directory.

Run components individually

# 1) Lineage + sequence extraction
python -m debase.enzyme_lineage_extractor --manuscript paper.pdf --si supplementary.pdf -o outputs/enzyme_lineage_data.csv --debug-dir outputs/debug/lineage

# 2) Generate full protein sequences from mutations
python -m debase.cleanup_sequence outputs/enzyme_lineage_data.csv outputs/2_enzyme_sequences.csv --debug-dir outputs/debug/sequence

# 3a) Reaction metrics (text + vision)
python -m debase.reaction_info_extractor --manuscript paper.pdf --si supplementary.pdf --lineage-csv outputs/2_enzyme_sequences.csv --output outputs/3a_reaction_info.csv --debug-dir outputs/debug/reaction

# 3b) Substrate scope
python -m debase.substrate_scope_extractor --manuscript paper.pdf --si supplementary.pdf --lineage-csv outputs/2_enzyme_sequences.csv --output outputs/3b_substrate_scope.csv --debug-dir outputs/debug/scope

# 4) Flatten/plate formatting and SMILES conversion
python -m debase.lineage_format -r outputs/3a_reaction_info.csv -s outputs/3b_substrate_scope.csv -o outputs/final_plate.csv -v

Debugging and logging

  • --debug-dir collects prompts, raw Gemini replies, captions, and extracted text per stage; lineage_format uses -v/-vv for verbosity instead.
  • Each pipeline run logs to debase_pipeline_<timestamp>.log in the output directory.
  • Token usage and estimated Gemini cost are written to price_<manuscript>.csv.

Version

2.1.0

License

MIT License

Authors

DEBase Team - Caltech

Contact

ylong@caltech.edu

About

Directed Evolution Database

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •