DEBase is a Python toolkit (v2.1.0) for turning chemistry papers (manuscript + SI) into plate-ready enzyme lineage, sequence, reaction metric, and substrate scope tables using Google Gemini 2.5 Flash (text + vision).
- Extract lineage trees, variant IDs, mutations, and sequences from PDFs (
enzyme_lineage_extractor). - Generate and validate full protein sequences from mutations (
cleanup_sequence). - Pull reaction metrics from text, tables, and figure images (
reaction_info_extractor). - Extract substrate scope tables and merge lineage context (
substrate_scope_extractor). - Normalize everything into a flat 96-well style CSV with IUPAC-to-SMILES conversion and optional Gemini matching (
lineage_format). - Track Gemini token usage and estimated cost for every run.
- Inputs: manuscript PDF (required) and optional SI PDF.
- Final output:
<manuscript>_debase.csvor the path passed to--output(plate-friendly flat table). - Intermediate CSVs (written next to the output):
enzyme_lineage_data.csv,2_enzyme_sequences.csv,3a_reaction_info.csv,3b_substrate_scope.csv. - Logs and telemetry:
debase_pipeline_<timestamp>.logplusprice_<manuscript>.csvwith per-module token/cost totals. - Debug artifacts (when
--debug-diris set): Gemini prompts/responses, captions, and extracted text for each stage (can be large).lineage_formatuses-v/-vvfor verbosity instead of--debug-dir.
- Python 3.8-3.12.
GEMINI_API_KEYenvironment variable for Google Gemini 2.5 Flash (text and vision). Example:export GEMINI_API_KEY=....- OPSIN CLI on your PATH for IUPAC validation/SMILES lookups (used by
reaction_info_extractorandlineage_format):brew install opsinorconda install -c conda-forge opsin. - Network access for Gemini calls and for NCI/PubChem fallbacks during IUPAC->SMILES conversion unless OPSIN/RDKit/local cache already resolves the names.
- Optional: RDKit for better SMILES canonicalization (
pip install .[rdkit]orconda install -c conda-forge rdkit).
From source (recommended for this repository):
python -m pip install -e .
# with RDKit extras
python -m pip install -e .[rdkit]Conda environment:
conda env create -f environment.yml
conda activate debaseexport GEMINI_API_KEY=your_key
# Runs all steps and writes outputs alongside the chosen --output path
debase \
--manuscript paper.pdf \
--si supplementary.pdf \
--output outputs/paper_plate.csv \
--keep-intermediates \
--debug-dir outputs/debugIf --output is omitted, the pipeline writes paper_debase.csv in the current directory.
# 1) Lineage + sequence extraction
python -m debase.enzyme_lineage_extractor --manuscript paper.pdf --si supplementary.pdf -o outputs/enzyme_lineage_data.csv --debug-dir outputs/debug/lineage
# 2) Generate full protein sequences from mutations
python -m debase.cleanup_sequence outputs/enzyme_lineage_data.csv outputs/2_enzyme_sequences.csv --debug-dir outputs/debug/sequence
# 3a) Reaction metrics (text + vision)
python -m debase.reaction_info_extractor --manuscript paper.pdf --si supplementary.pdf --lineage-csv outputs/2_enzyme_sequences.csv --output outputs/3a_reaction_info.csv --debug-dir outputs/debug/reaction
# 3b) Substrate scope
python -m debase.substrate_scope_extractor --manuscript paper.pdf --si supplementary.pdf --lineage-csv outputs/2_enzyme_sequences.csv --output outputs/3b_substrate_scope.csv --debug-dir outputs/debug/scope
# 4) Flatten/plate formatting and SMILES conversion
python -m debase.lineage_format -r outputs/3a_reaction_info.csv -s outputs/3b_substrate_scope.csv -o outputs/final_plate.csv -v--debug-dircollects prompts, raw Gemini replies, captions, and extracted text per stage;lineage_formatuses-v/-vvfor verbosity instead.- Each pipeline run logs to
debase_pipeline_<timestamp>.login the output directory. - Token usage and estimated Gemini cost are written to
price_<manuscript>.csv.
2.1.0
MIT License
DEBase Team - Caltech