End-to-end deep learning pipeline for predicting biophysical properties of antibody CDR-H3 sequences using sequence-based models and large pre-trained protein language models.
Therapeutic antibodies are one of the fastest-growing classes of drugs, with applications in oncology, immunology, and infectious disease. A key bottleneck in antibody drug discovery is screening large sequence libraries for candidates with favourable biophysical properties — high specificity, good solubility, low aggregation risk — which are expensive and slow to measure experimentally.
This project trains deep learning models to predict two such properties directly from the CDR-H3 amino acid sequence (the primary determinant of antigen binding specificity):
| Task | Output | Model head |
|---|---|---|
| Length classification | short (≤9 AA) / medium (10–14 AA) / long (≥15 AA) |
3-class softmax |
| Hydrophobicity regression | Kyte-Doolittle GRAVY score | Scalar regression |
Training data comes from the Observed Antibody Space (OAS) — a database of ~1 billion naturally occurring human and animal antibody sequences — providing broad coverage of CDR-H3 diversity.
Three model architectures are compared:
- BiLSTM — bidirectional LSTM with mean pooling, fast and interpretable baseline
- Transformer — encoder-only with sinusoidal positional encoding and CLS-token pooling
- ESM-2 — 8M-parameter protein language model (Meta AI) fine-tuned on CDR sequences (GPU only)
In drug discovery pipelines:
- Rapid in-silico pre-screening of CDR-H3 libraries before costly wet-lab assays
- Prioritising candidates by predicted hydrophobicity (a proxy for aggregation and manufacturability risk)
- Filtering out sequences with unusually short or long CDR-H3 loops that may indicate structural instability
As a research foundation:
- Extend the regression head to predict additional properties: thermal stability (Tm), polyspecificity, viscosity, or expression yield — all measurable and available in public datasets
- Replace CDR-H3 with full variable-domain sequences (VH/VL) for broader coverage
- Add a generative head (e.g. VAE or diffusion decoder) to condition sequence generation on desired biophysical targets
- Plug ESM-2 embeddings into docking score predictors or structure-guided property models (AlphaFold-compatible)
As an MLOps reference:
- Demonstrates dual-environment training (CPU dev / GPU production) with shared codebase
- Full W&B experiment tracking, sweep optimisation, and model registry integration
- Schema-validated data pipeline with reproducible smoke data and Zenodo-sourced full data
W&B project: dima806-team/antibody-seq-ml
| Environment | Purpose | Dataset | Hardware |
|---|---|---|---|
| GitHub Codespaces | Dev, tests, EDA, smoke runs | ~2K sequences (data/smoke/) |
2-core CPU |
| Kaggle Notebook | Full training, sweeps, ESM-2 | ~500K OAS sequences | T4 / P100 GPU |
antibody_sequence_modeling_example/
├── .devcontainer/
│ └── devcontainer.json # Codespaces: Python 3.11, uv
├── .pre-commit-config.yaml # ruff, trailing-whitespace, yaml/toml checks, etc.
├── configs/
│ ├── default.yaml # Smoke-scale defaults
│ ├── schema.yaml # Dataset schema (validated on every generate/download)
│ ├── sweep_cpu.yaml # Random sweep, CPU
│ └── sweep_gpu.yaml # Bayesian sweep, GPU
├── data/
│ ├── smoke/
│ │ └── sequences_smoke.csv # ~2K synthetic sequences (committed)
│ ├── download.py # Download OAS data from Zenodo / OAS API
│ ├── generate_smoke.py # Generate synthetic smoke dataset
│ └── validate.py # Schema validation (reads configs/schema.yaml)
├── notebooks/
│ ├── 01_eda.ipynb # EDA on smoke data (Codespaces)
│ └── 02_kaggle_training.ipynb # Full GPU training (Kaggle)
├── src/
│ └── antibody_seq_ml/
│ ├── dataset.py # Tokeniser, CDRDataset, DataLoaders
│ ├── train.py # Training loop + W&B logging
│ ├── evaluate.py # Metrics, confusion matrix, attention viz
│ ├── sweep.py # W&B sweep agent entry point
│ └── models/
│ ├── base.py # SequenceModel base class + factory
│ ├── lstm.py # BiLSTM with mean pooling
│ ├── transformer.py # TransformerEncoder + sinusoidal PE
│ └── esm2.py # ESM-2 fine-tuning (GPU only)
├── tests/
│ ├── conftest.py
│ ├── test_dataset.py
│ ├── test_models.py
│ └── test_train.py
├── Makefile
├── pyproject.toml
└── requirements_kaggle.txt
make setupInstalls uv, the project package in editable mode, and pre-commit hooks.
make data-smokeRegenerates data/smoke/sequences_smoke.csv (~2K synthetic sequences, seed=42) and validates the output against configs/schema.yaml. The file is already committed — only run this to regenerate.
make train-smokeTrains a small Transformer on the smoke dataset for 10 epochs with W&B disabled. Checkpoint saved to checkpoints/best_model.pt.
make testRuns pytest with coverage across tests/. All tests use smoke data and run on CPU.
make lintRuns pre-commit run --all-files — ruff (lint + format), trailing whitespace, YAML/TOML checks, merge conflict detection, large file guard, and more.
make sweep-cpuLaunches a W&B random sweep (5 runs) over lstm and transformer using smoke data.
Two sources are supported, both require no authentication:
| Source | Command | Notes |
|---|---|---|
| Zenodo | make data-full |
p-IgGen pre-processed OAS snapshot (default) |
| OAS REST API | python data/download.py --source oas-api |
Direct from opig.stats.ox.ac.uk, up to --max-units units |
# Zenodo (default) — single file, no auth
make data-full
# OAS REST API — downloads up to 200 data units
python data/download.py --output data/full/ --source oas-apiBoth sources produce data/full/sequences_full.csv with identical schema (validated automatically):
sequence_id, heavy_chain_sequence, cdr_h3, cdr_h3_length, length_class, hydrophobicity.
When no CDR-H3 column is present in the source data, it is extracted from the heavy chain sequence using the conserved C...WGxG flanking motif.
For Codespaces smoke testing, the committed dataset at data/smoke/sequences_smoke.csv is used instead.
The schema is defined in configs/schema.yaml and enforced by data/validate.py after every make data-smoke and make data-full run. It checks:
- All 6 required columns are present and non-null
- String length bounds (
cdr_h3: 5–30 AA,heavy_chain_sequence: ≥30 AA) length_class∈ {short,medium,long}hydrophobicity∈ [-5.0, 5.0]cdr_h3_lengthmatcheslen(cdr_h3)exactly- Length class boundaries are consistent with
cdr_h3_length
| Model | Architecture | Smoke params | Full params |
|---|---|---|---|
lstm |
Embedding → BiLSTM → mean pool → heads | ~200K | ~3M |
transformer |
[CLS] + Embedding + sinusoidal PE → TransformerEncoder → heads | ~150K | ~4M |
esm2 |
ESM-2 (8M) frozen + 2 trainable layers + MLP heads | ~1M trainable | GPU only |
Select model via configs/default.yaml:
model:
type: transformer # lstm | transformer | esm2Multi-task: loss = cls_loss_weight × CrossEntropy + reg_loss_weight × MSE
Default weights: cls=1.0, reg=0.1 (MSE is orders of magnitude smaller than CE).
CosineAnnealingLR or ReduceLROnPlateau (configurable). Early stopping on val/loss with configurable patience.
WANDB_API_KEY is read from the environment (injected automatically from Codespaces / Kaggle secrets). train.py and sweep.py call wandb.login() at startup when the key is present.
Every epoch logs:
train/loss train/cls_loss train/reg_loss train/cls_acc train/hydro_r2
val/loss val/cls_loss val/reg_loss val/cls_acc val/hydro_r2
lr grad_norm epoch
| Target | Description |
|---|---|
make setup |
Install uv, package, pre-commit hooks |
make data-smoke |
Generate ~2K synthetic smoke sequences |
make data-full |
Download full OAS dataset from Zenodo |
make train-smoke |
Smoke training run, W&B disabled |
make train-full |
Full training run on data/full/sequences_full.csv |
make sweep-cpu |
Random sweep (5 runs) on smoke data |
make sweep-gpu |
Bayesian sweep (50 runs) on full data |
make test |
pytest + coverage |
make lint |
pre-commit (ruff, format, hooks) |
make notebook |
Open EDA notebook |
All hyperparameters live in configs/default.yaml. The --smoke flag overrides to small-scale settings:
| Hyperparameter | Smoke (CPU) | Full (GPU) |
|---|---|---|
d_model / embedding_dim |
64 | 256 |
hidden_dim (LSTM) |
128 | 512 |
num_layers |
2 | 3–6 |
dim_feedforward |
256 | 1024 |
nhead |
4 | 8 |
batch_size |
32–64 | 128–512 |
epochs |
10 | 30–50 |
max_seq_length |
30 | 150 |
Open notebooks/02_kaggle_training.ipynb on Kaggle with GPU enabled.
The notebook self-bootstraps:
- Clones this repo from GitHub
- Installs all dependencies (
pip install -e .+fair-esm) - Logs in to W&B and HuggingFace via Kaggle Secrets (
WANDB_API_KEY,HF_API_KEY) - Downloads full OAS data from Zenodo (fallback: OAS API)
- Runs full-scale Transformer training
- Launches 50-run Bayesian W&B sweep
- Fine-tunes ESM-2 (stretch goal)
- Evaluates best model and logs to W&B Model Registry
Prerequisites:
- Kaggle Settings → Accelerator → GPU T4 x2 or P100
- Kaggle Settings → Internet → On
- Kaggle Secrets →
WANDB_API_KEYandHF_API_KEYset
| Metric | Target |
|---|---|
| CDR-H3 length classification accuracy | >85% (full GPU data) |
| Hydrophobicity R² | >0.75 (full GPU data) |
| ESM-2 fine-tuned accuracy | >90% (stretch) |
| CPU smoke train time | <3 min |
| Test coverage | >80% |
- Device: never hardcode
"cpu"/"cuda"— always branch ontorch.cuda.is_available() - Config:
omegaconf.DictConfigthroughout; load withOmegaConf.load("configs/default.yaml") - W&B in tests:
WANDB_MODE=disabledset byconftest.pyandmake test - Schema: both
data/generate_smoke.pyanddata/download.pycalldata/validate.pybefore writing CSV - Type hints: all functions use full type hints with
from __future__ import annotations - Imports: stdlib → third-party → local, enforced by ruff