Reproduction of: "Repurposing Protein Language Models for Latent Flow–Based Fitness Optimization" (Caceres Arroyo et al., 2026, arXiv:2602.02425)
CHASE is a framework for protein fitness optimization that:
- Encodes protein sequences using a pretrained ESM2 protein language model
- Compresses embeddings into a compact latent manifold via a β-VAE (compressor/decompressor)
- Trains a conditional flow matching model with classifier-free guidance
- Generates high-fitness variants without predictor-based guidance during ODE sampling
Sequence → ESM2 Encoder → Compressor → Latent z
↕ (Flow Matching with fitness conditioning)
Sequence ← ESM2 Decoder ← Decompressor ← Latent z'
pip install -r requirements.txtDownload the GFP/AAV benchmark datasets from:
- Kirjner et al. (2023): https://github.com/kirjner/GGS
python scripts/prepare_data.py --dataset gfp --split medium --output_dir data/python scripts/train_vae.py \
--dataset gfp_medium \
--data_dir data/ \
--output_dir checkpoints/vae/ \
--stage 1python scripts/train_vae.py \
--dataset gfp_medium \
--data_dir data/ \
--output_dir checkpoints/vae/ \
--stage 2 \
--vae_checkpoint checkpoints/vae/stage1/best.ptpython scripts/train_flow.py \
--dataset gfp_medium \
--data_dir data/ \
--vae_checkpoint checkpoints/vae/stage2/best.pt \
--output_dir checkpoints/flow/ \
--score_dropout 0.0 \
--train_steps 600000python scripts/sample.py \
--dataset gfp_medium \
--vae_checkpoint checkpoints/vae/stage2/best.pt \
--flow_checkpoint checkpoints/flow/best.pt \
--target_fitness 0.8 \
--guidance_scale -0.08 \
--n_samples 512 \
--output sequences.fastapython scripts/bootstrap.py \
--dataset gfp_medium \
--vae_checkpoint checkpoints/vae/stage2/best.pt \
--flow_checkpoint checkpoints/flow/best.pt \
--output_dir checkpoints/flow_bootstrapped/Pre-set configs for all 4 benchmarks are in configs/.
| Dataset | CHASE Fitness | CHASE Bootstrapped |
|---|---|---|
| AAV Medium | 0.62 | 0.65 |
| AAV Hard | 0.61 | 0.63 |
| GFP Medium | 0.91 | 0.93 |
| GFP Hard | 0.92 | 0.87 |
@article{caceresarroyo2026chase,
title={Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization},
author={Caceres Arroyo, Amaru and Bogensperger, Lea and Allam, Ahmed and Krauthammer, Michael and Schindler, Konrad and Narnhofer, Dominik},
journal={arXiv preprint arXiv:2602.02425},
year={2026}
}