BGC Protein Role Classifier — ESM-2 Embeddings + Genomic Knowledge Graphs

Classifying core, regulatory, accessory, and transport proteins within Biosynthetic Gene Clusters (BGCs) using protein language model embeddings, and visualizing their genomic organization through interactive knowledge graphs.

Built as a direct contribution to the problem addressed by the iGEM USP S(H)ARP 2026 project: current BGC prediction tools focus almost exclusively on core biosynthetic enzymes, ignoring the regulatory and accessory proteins that govern whether a silent pathway gets activated. This repository attacks that problem from two complementary angles — sequence-level classification and genomic context visualization.

Motivation

Tools like antiSMASH identify BGC boundaries and classify clusters by type, but treat all proteins within a cluster with limited functional granularity. The S(H)ARP approach proposes to incorporate regulatory and accessory proteins into BGC analysis — a biologically motivated improvement that can reveal silent pathways in Streptomyces and other Actinobacteria.

This repository explores two questions:

Can a protein language model distinguish core biosynthetic enzymes from regulatory and accessory proteins using sequence information alone? (Project 1 — classifier)
How are functional roles spatially organized on the chromosome, and are those patterns conserved across Streptomyces species? (Project 2 — knowledge graphs)

Project 1 — ESM-2 Protein Role Classifier

Dataset

Source: MIBiG 4.0 — curated repository of experimentally characterized BGCs
Filter: Actinobacteria only (focus on Streptomyces and related genera)
Labels: protein functional role — core, regulatory, accessory, transport, other
Label source: MIBiG JSON annotations (curated) + heuristic keyword fallback

Pipeline

FASTA + JSON (MIBiG 4.0)
        │
        ▼
01_download_mibig.py      Parse proteins + roles → CSV
        │
        ▼
02_generate_embeddings.py ESM-2 (8M) mean-pooled embeddings → (N, 320) array
        │
        ▼
03_train_classifier.py    Random Forest + Logistic Regression, 5-fold CV
        │
        ▼
04_visualize.py           t-SNE · confusion matrix · role × BGC type

Embedding model: ESM-2 esm2_t6_8M_UR50D (Meta AI) — 8M parameters, 320-dimensional representations, runs on CPU.

Why ESM-2? Unlike classical sequence features (amino acid composition, k-mers), ESM-2 embeddings encode evolutionary and structural context learned from 250M protein sequences. Proteins with similar function tend to cluster in embedding space regardless of sequence similarity — making it suitable for detecting regulatory proteins that share functional but not necessarily sequence similarity with known examples.

Results

Model	Macro F1
Random Forest	see results/metrics_summary.json
Logistic Regression	see results/metrics_summary.json

t-SNE of ESM-2 embeddings by protein role

Each point is one protein. Colors indicate functional role. Visible clustering shows that ESM-2 representations carry meaningful functional signal — core enzymes (green) occupy distinct regions from regulatory proteins (purple).

Confusion matrix

Role distribution by BGC type

Reproducing

# 1. Install dependencies
pip install -r requirements.txt

# 2. Download and parse MIBiG 4.0 (Actinobacteria only)
python 01_download_mibig.py --output data/raw/

# 3. Generate ESM-2 embeddings
python 02_generate_embeddings.py --input data/raw/mibig_proteins.csv \
                                  --output data/processed/

# 4. Train classifier
python 03_train_classifier.py --input data/processed/ --output results/

# 5. Generate figures
python 04_visualize.py --embeddings data/processed/embeddings.npy \
                        --metadata data/processed/metadata.csv \
                        --predictions results/predictions.csv \
                        --output figures/

Reproducing with Nextflow (Recommended)

To run the entire pipeline with a single command (ensuring full reproducibility and generating execution reports):

nextflow run main.nf

The pipeline will automatically:

Download and parse MIBiG data.
Generate ESM-2 embeddings.
Train and evaluate the classifier.
Visualize results and output an HTML report of the execution.

Tested on Python 3.14, CPU only. Full pipeline took ~120 minutes on a laptop with Intel i7-5500U CPU.

Project 2 — Genomic Neighborhood Interactive Knowledge Graphs

Two interactive graphs exploring how functional roles are spatially organized within BGCs. Both are HTML files hosted on GitHub Pages, so you can open them in any browser, drag nodes, and hover for details.

Graph A — MIBiG genomic neighborhood

Open interactive graph

Built from MIBiG 4.0 experimentally validated BGCs. Genomic coordinates are parsed directly from the FASTA headers (start–end field), so genes are connected by physical adjacency on the chromosome — not co-occurrence.

Node types:

Colored circles = individual genes, colored by functional role (core / regulatory / accessory / transport / other)

Edge types:

Solid black = genomic adjacency (consecutive genes sorted by chromosomal position within a BGC)
Dashed purple = cross-BGC functional similarity (same functional keyword, different BGCs)

What it reveals: the spatial distribution of regulatory and accessory genes relative to core biosynthetic genes across all experimentally validated Actinobacteria BGCs in MIBiG — the baseline against which silent cluster organization can be compared.

python 05_knowledge_graph.py \
    --fasta  data/raw/mibig_prot_seqs_4.0.fasta \
    --csv    data/raw/mibig_proteins.csv \
    --output figures/ --results results/ --n-bgcs 30

Graph B — antiSMASH multi-genome BGC graph

🔗 Open interactive graph

Built from antiSMASH BGC predictions across three Streptomyces genomes:

Genome	Organism	Accession
AL645882	Streptomyces coelicolor A3(2)	AL645882
CP009124	Streptomyces lividans TK24	CP009124
CP029197	Streptomyces venezuelae ATCC 10712	CP029197

Node types:

Colored circles = BGC genes, colored by functional role
Orange squares = genome anchors (one per organism)

Edge types:

Solid black = genomic adjacency (consecutive genes within a predicted BGC)
Dashed purple = cross-genome functional similarity (same functional keyword, different species)

Role assignment priority:

CDS overlapping proto_core boundary → core (antiSMASH ground truth)
gene_kind qualifier from antiSMASH annotation
gene_functions qualifier
Heuristic keyword match on product name

What it reveals: functional organization of predicted BGCs — including silent/cryptic clusters — across three species. Cross-genome dashed edges show conserved functional modules that appear independently in unrelated biosynthetic pathways, the most promising candidates for the S(H)ARP activation strategy.

python 05b_antismash_graph.py \
    --json-dir data/raw/antismash \
    --output figures/ --results results/

Next steps

Repository structure

sharp-bgc-classifier/
├── 01_download_mibig.py          # Download + parse MIBiG 4.0
├── 02_generate_embeddings.py     # ESM-2 embeddings
├── 03_train_classifier.py        # Random Forest + Logistic Regression
├── 04_visualize.py               # t-SNE, confusion matrix, role distribution
├── 05_knowledge_graph.py         # MIBiG genomic neighborhood graph
├── 05b_antismash_graph.py        # antiSMASH multi-genome graph
├── main.nf                       # Nextflow pipeline (in progress)
├── nextflow.config               # Nextflow config
├── data/
│   ├── raw/                      # MIBiG FASTA + CSV; antiSMASH JSON
│   └── processed/                # Embeddings + metadata
├── figures/                      # All output figures + interactive HTMLs
├── results/                      # Metrics, stats JSONs
└── requirements.txt

References

For the bibtex file with those references, check references.bib

Lin et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science.
Terlouw et al. (2023). MIBiG 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters. Nucleic Acids Research.
Blin et al. (2023). antiSMASH 7.0: new and improved predictions for detection, regulation, and visualisation. Nucleic Acids Research.

Limitations

This is a toy project made in less than a week with the help of Anthropic's Claude.ai. The goal was to get a feeling for the work involved in this iGem project. I know there are many things to improve, and I'd be happy with any contribution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BGC Protein Role Classifier — ESM-2 Embeddings + Genomic Knowledge Graphs

Motivation

Project 1 — ESM-2 Protein Role Classifier

Dataset

Pipeline

Results

t-SNE of ESM-2 embeddings by protein role

Confusion matrix

Role distribution by BGC type

Reproducing

Reproducing with Nextflow (Recommended)

Project 2 — Genomic Neighborhood Interactive Knowledge Graphs

Graph A — MIBiG genomic neighborhood

Graph B — antiSMASH multi-genome BGC graph

Next steps

Repository structure

References

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data/raw/antismash		data/raw/antismash
figures		figures
img		img
results		results
.gitignore		.gitignore
01_download_mibig.py		01_download_mibig.py
02_generate_embeddings.py		02_generate_embeddings.py
03_train_classifier.py		03_train_classifier.py
04_visualize.py		04_visualize.py
05_knowledge_graph.py		05_knowledge_graph.py
05b_antismash_graph.py		05b_antismash_graph.py
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
references.bib		references.bib
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

BGC Protein Role Classifier — ESM-2 Embeddings + Genomic Knowledge Graphs

Motivation

Project 1 — ESM-2 Protein Role Classifier

Dataset

Pipeline

Results

t-SNE of ESM-2 embeddings by protein role

Confusion matrix

Role distribution by BGC type

Reproducing

Reproducing with Nextflow (Recommended)

Project 2 — Genomic Neighborhood Interactive Knowledge Graphs

Graph A — MIBiG genomic neighborhood

Graph B — antiSMASH multi-genome BGC graph

Next steps

Repository structure

References

Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages