AntibodySeqML

End-to-end deep learning pipeline for predicting biophysical properties of antibody CDR-H3 sequences using sequence-based models and large pre-trained protein language models.

What This Project Is About

Therapeutic antibodies are one of the fastest-growing classes of drugs, with applications in oncology, immunology, and infectious disease. A key bottleneck in antibody drug discovery is screening large sequence libraries for candidates with favourable biophysical properties — high specificity, good solubility, low aggregation risk — which are expensive and slow to measure experimentally.

This project trains deep learning models to predict two such properties directly from the CDR-H3 amino acid sequence (the primary determinant of antigen binding specificity):

Task	Output	Model head
Length classification	`short` (≤9 AA) / `medium` (10–14 AA) / `long` (≥15 AA)	3-class softmax
Hydrophobicity regression	Kyte-Doolittle GRAVY score	Scalar regression

Training data comes from the Observed Antibody Space (OAS) — a database of ~1 billion naturally occurring human and animal antibody sequences — providing broad coverage of CDR-H3 diversity.

Three model architectures are compared:

BiLSTM — bidirectional LSTM with mean pooling, fast and interpretable baseline
Transformer — encoder-only with sinusoidal positional encoding and CLS-token pooling
ESM-2 — 8M-parameter protein language model (Meta AI) fine-tuned on CDR sequences (GPU only)

Potential Value & Further Use

In drug discovery pipelines:

Rapid in-silico pre-screening of CDR-H3 libraries before costly wet-lab assays
Prioritising candidates by predicted hydrophobicity (a proxy for aggregation and manufacturability risk)
Filtering out sequences with unusually short or long CDR-H3 loops that may indicate structural instability

As a research foundation:

Extend the regression head to predict additional properties: thermal stability (Tm), polyspecificity, viscosity, or expression yield — all measurable and available in public datasets
Replace CDR-H3 with full variable-domain sequences (VH/VL) for broader coverage
Add a generative head (e.g. VAE or diffusion decoder) to condition sequence generation on desired biophysical targets
Plug ESM-2 embeddings into docking score predictors or structure-guided property models (AlphaFold-compatible)

As an MLOps reference:

Demonstrates dual-environment training (CPU dev / GPU production) with shared codebase
Full W&B experiment tracking, sweep optimisation, and model registry integration
Schema-validated data pipeline with reproducible smoke data and Zenodo-sourced full data

W&B project: dima806-team/antibody-seq-ml

Dual-Environment Strategy

Environment	Purpose	Dataset	Hardware
GitHub Codespaces	Dev, tests, EDA, smoke runs	~2K sequences (`data/smoke/`)	2-core CPU
Kaggle Notebook	Full training, sweeps, ESM-2	~500K OAS sequences	T4 / P100 GPU

Repository Structure

antibody_sequence_modeling_example/
├── .devcontainer/
│   └── devcontainer.json              # Codespaces: Python 3.11, uv
├── .pre-commit-config.yaml            # ruff, trailing-whitespace, yaml/toml checks, etc.
├── configs/
│   ├── default.yaml                   # Smoke-scale defaults
│   ├── schema.yaml                    # Dataset schema (validated on every generate/download)
│   ├── sweep_cpu.yaml                 # Random sweep, CPU
│   └── sweep_gpu.yaml                 # Bayesian sweep, GPU
├── data/
│   ├── smoke/
│   │   └── sequences_smoke.csv        # ~2K synthetic sequences (committed)
│   ├── download.py                    # Download OAS data from Zenodo / OAS API
│   ├── generate_smoke.py              # Generate synthetic smoke dataset
│   └── validate.py                    # Schema validation (reads configs/schema.yaml)
├── notebooks/
│   ├── 01_eda.ipynb                   # EDA on smoke data (Codespaces)
│   └── 02_kaggle_training.ipynb       # Full GPU training (Kaggle)
├── src/
│   └── antibody_seq_ml/
│       ├── dataset.py                 # Tokeniser, CDRDataset, DataLoaders
│       ├── train.py                   # Training loop + W&B logging
│       ├── evaluate.py                # Metrics, confusion matrix, attention viz
│       ├── sweep.py                   # W&B sweep agent entry point
│       └── models/
│           ├── base.py                # SequenceModel base class + factory
│           ├── lstm.py                # BiLSTM with mean pooling
│           ├── transformer.py         # TransformerEncoder + sinusoidal PE
│           └── esm2.py                # ESM-2 fine-tuning (GPU only)
├── tests/
│   ├── conftest.py
│   ├── test_dataset.py
│   ├── test_models.py
│   └── test_train.py
├── Makefile
├── pyproject.toml
└── requirements_kaggle.txt

Quick Start (Codespaces / CPU)

1. Setup

make setup

Installs uv, the project package in editable mode, and pre-commit hooks.

2. Generate smoke dataset

make data-smoke

Regenerates data/smoke/sequences_smoke.csv (~2K synthetic sequences, seed=42) and validates the output against configs/schema.yaml. The file is already committed — only run this to regenerate.

3. Smoke training run

make train-smoke

Trains a small Transformer on the smoke dataset for 10 epochs with W&B disabled. Checkpoint saved to checkpoints/best_model.pt.

4. Run tests

make test

Runs pytest with coverage across tests/. All tests use smoke data and run on CPU.

5. Lint

make lint

Runs pre-commit run --all-files — ruff (lint + format), trailing whitespace, YAML/TOML checks, merge conflict detection, large file guard, and more.

6. CPU sweep

make sweep-cpu

Launches a W&B random sweep (5 runs) over lstm and transformer using smoke data.

Downloading the Full OAS Dataset (Kaggle / GPU)

Two sources are supported, both require no authentication:

Source	Command	Notes
Zenodo	`make data-full`	p-IgGen pre-processed OAS snapshot (default)
OAS REST API	`python data/download.py --source oas-api`	Direct from opig.stats.ox.ac.uk, up to `--max-units` units

# Zenodo (default) — single file, no auth
make data-full

# OAS REST API — downloads up to 200 data units
python data/download.py --output data/full/ --source oas-api

Both sources produce data/full/sequences_full.csv with identical schema (validated automatically): sequence_id, heavy_chain_sequence, cdr_h3, cdr_h3_length, length_class, hydrophobicity.

When no CDR-H3 column is present in the source data, it is extracted from the heavy chain sequence using the conserved C...WGxG flanking motif.

For Codespaces smoke testing, the committed dataset at data/smoke/sequences_smoke.csv is used instead.

Dataset Schema Validation

The schema is defined in configs/schema.yaml and enforced by data/validate.py after every make data-smoke and make data-full run. It checks:

All 6 required columns are present and non-null
String length bounds (cdr_h3: 5–30 AA, heavy_chain_sequence: ≥30 AA)
length_class ∈ {short, medium, long}
hydrophobicity ∈ [-5.0, 5.0]
cdr_h3_length matches len(cdr_h3) exactly
Length class boundaries are consistent with cdr_h3_length

Models

Model	Architecture	Smoke params	Full params
`lstm`	Embedding → BiLSTM → mean pool → heads	~200K	~3M
`transformer`	[CLS] + Embedding + sinusoidal PE → TransformerEncoder → heads	~150K	~4M
`esm2`	ESM-2 (8M) frozen + 2 trainable layers + MLP heads	~1M trainable	GPU only

Select model via configs/default.yaml:

model:
  type: transformer   # lstm | transformer | esm2

Training

Losses

Multi-task: loss = cls_loss_weight × CrossEntropy + reg_loss_weight × MSE

Default weights: cls=1.0, reg=0.1 (MSE is orders of magnitude smaller than CE).

Scheduler & early stopping

CosineAnnealingLR or ReduceLROnPlateau (configurable). Early stopping on val/loss with configurable patience.

W&B authentication

WANDB_API_KEY is read from the environment (injected automatically from Codespaces / Kaggle secrets). train.py and sweep.py call wandb.login() at startup when the key is present.

W&B logging

Every epoch logs:

train/loss  train/cls_loss  train/reg_loss  train/cls_acc  train/hydro_r2
val/loss    val/cls_loss    val/reg_loss    val/cls_acc    val/hydro_r2
lr  grad_norm  epoch

Makefile Reference

Target	Description
`make setup`	Install uv, package, pre-commit hooks
`make data-smoke`	Generate ~2K synthetic smoke sequences
`make data-full`	Download full OAS dataset from Zenodo
`make train-smoke`	Smoke training run, W&B disabled
`make train-full`	Full training run on `data/full/sequences_full.csv`
`make sweep-cpu`	Random sweep (5 runs) on smoke data
`make sweep-gpu`	Bayesian sweep (50 runs) on full data
`make test`	pytest + coverage
`make lint`	pre-commit (ruff, format, hooks)
`make notebook`	Open EDA notebook

Configuration

All hyperparameters live in configs/default.yaml. The --smoke flag overrides to small-scale settings:

Hyperparameter	Smoke (CPU)	Full (GPU)
`d_model` / `embedding_dim`	64	256
`hidden_dim` (LSTM)	128	512
`num_layers`	2	3–6
`dim_feedforward`	256	1024
`nhead`	4	8
`batch_size`	32–64	128–512
`epochs`	10	30–50
`max_seq_length`	30	150

Kaggle Full Training

Open notebooks/02_kaggle_training.ipynb on Kaggle with GPU enabled.

The notebook self-bootstraps:

Clones this repo from GitHub
Installs all dependencies (pip install -e . + fair-esm)
Logs in to W&B and HuggingFace via Kaggle Secrets (WANDB_API_KEY, HF_API_KEY)
Downloads full OAS data from Zenodo (fallback: OAS API)
Runs full-scale Transformer training
Launches 50-run Bayesian W&B sweep
Fine-tunes ESM-2 (stretch goal)
Evaluates best model and logs to W&B Model Registry

Prerequisites:

Kaggle Settings → Accelerator → GPU T4 x2 or P100
Kaggle Settings → Internet → On
Kaggle Secrets → WANDB_API_KEY and HF_API_KEY set

Success Targets

Metric	Target
CDR-H3 length classification accuracy	>85% (full GPU data)
Hydrophobicity R²	>0.75 (full GPU data)
ESM-2 fine-tuned accuracy	>90% (stretch)
CPU smoke train time	<3 min
Test coverage	>80%

Development Conventions

Device: never hardcode "cpu" / "cuda" — always branch on torch.cuda.is_available()
Config: omegaconf.DictConfig throughout; load with OmegaConf.load("configs/default.yaml")
W&B in tests: WANDB_MODE=disabled set by conftest.py and make test
Schema: both data/generate_smoke.py and data/download.py call data/validate.py before writing CSV
Type hints: all functions use full type hints with from __future__ import annotations
Imports: stdlib → third-party → local, enforced by ruff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AntibodySeqML

What This Project Is About

Potential Value & Further Use

Dual-Environment Strategy

Repository Structure

Quick Start (Codespaces / CPU)

1. Setup

2. Generate smoke dataset

3. Smoke training run

4. Run tests

5. Lint

6. CPU sweep

Downloading the Full OAS Dataset (Kaggle / GPU)

Dataset Schema Validation

Models

Training

Losses

Scheduler & early stopping

W&B authentication

W&B logging

Makefile Reference

Configuration

Kaggle Full Training

Success Targets

Development Conventions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.devcontainer		.devcontainer
checkpoints		checkpoints
configs		configs
data		data
notebooks		notebooks
src/antibody_seq_ml		src/antibody_seq_ml
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements_kaggle.txt		requirements_kaggle.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

AntibodySeqML

What This Project Is About

Potential Value & Further Use

Dual-Environment Strategy

Repository Structure

Quick Start (Codespaces / CPU)

1. Setup

2. Generate smoke dataset

3. Smoke training run

4. Run tests

5. Lint

6. CPU sweep

Downloading the Full OAS Dataset (Kaggle / GPU)

Dataset Schema Validation

Models

Training

Losses

Scheduler & early stopping

W&B authentication

W&B logging

Makefile Reference

Configuration

Kaggle Full Training

Success Targets

Development Conventions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages