This repository implements Sparse Autoencoders (SAEs) for mechanistic interpretability of the Borzoi genomic sequence model. SAEs decompose Borzoi's learned representations into interpretable features that correspond to regulatory motifs and genomic patterns. Results are available at motifscout.com.
- Overview
- Framework Schematic
- Installation
- Quick Start
- Pipeline Stages
- Saving Activations
- Training Parameters
- Analysis Workflow
- Example Usage
- Visualization
- Citation
- License
Deep learning models for regulatory genomics have achieved remarkable predictive performance, yet their internal representations remain a black box. We apply sparse autoencoders (SAEs) to decompose learned representations of Borzoi, a state-of-the-art CNN-transformer for predicting genome-wide molecular phenotypes from DNA sequence. We trained TopK-SAEs on activations from Borzoi's early convolutional layers and discover monosemantic regulatory features that correspond to transcription factor (TF) binding motifs and transposable element sequences. We validated our findings through motif discovery using MEME suite against known TF databases, and identify hundreds of significant position weight matrices that map SAE-discovered features to established TF binding sites. This establishes SAEs as valuable tools for mechanistic interpretability in computational biology.
SAE-Borzoi uses the Top-K sparsity approach (Gao et al.) to reconstruct activations from Borzoi's convolutional layers, enabling identification of monosemantic genomic features.
- Interpretable genomic features: Maps node activations to known/unknown regulatory motifs
- Automated motif discovery: Integrates MEME Suite and TomTom for motif analysis
Clone the repository and install dependencies:
# Clone the repo
git clone https://github.com/calico/sae-borzoi.git
cd sae-borzoi
# (Recommended) Create a virtual environment
python3 -m venv venv
source venv/bin/activate
# Install Python dependencies
pip install -r requirements.txtOr with conda:
conda create -n sae-borzoi python=3.8
conda activate sae-borzoi
pip install -r requirements.txt- Install MEME Suite separately for motif analysis.
- Python 3.8+ with PyTorch
- MEME Suite for motif analysis
- Borzoi model activations in HDF5 format
-
Configure paths in
config/config.json:{ "activations_path": "/path/to/borzoi/activations", "models_save_path": "/path/to/save/models", "expansion_factor": 4, "topk_pct": 0.05, "learning_rate": 1e-5 } -
Run the complete pipeline:
./pipeline.sh
-
Run specific stages:
# Run stages 0-3 (preprocessing through sequence extraction) ./pipeline.sh --stage 0 --stop_stage 3 # Run only motif discovery (stages 4-6) ./pipeline.sh --stage 4 --stop_stage 6
| Stage | Script | Description |
|---|---|---|
| 0 | find_global_max.py |
Compute activation normalization values |
| 1 | batch_train.py |
Train SAE models (distributed) |
| 2 | batch_infer.py |
Run inference to extract top activations |
| 3 | save_seqlets.py |
Extract genomic sequences around activations |
| 4 | run_meme_multi.py |
Discover motifs with MEME |
| 5 | run_meme_post.py |
Post-process MEME results |
| 6 | run_tomtom_multi.py |
Match motifs against known databases |
| 7 | umap_analysis.py |
Generate UMAP visualizations |
| 8 | seqlet_overlaps_analysis.py |
Analyze feature overlaps |
| 9 | jaccard.py |
Compute Jaccard similarities |
The script save_truncated.py saves pretrained Borzoi models truncated up to a particular layer for more efficient memory usage. save_activations.py runs forward passes on selected sequences and saves activations. Activation presave scripts depend on TensorFlow and on the baskerville repository, which should be installed first. Other scripts in this directory do not require baskerville. The usage is as follows:
python scripts/save_truncated.py -c config/config_save.json -o models_trunc
python scripts/save_activations.py -c config/config_save.json -o l2_activations -m models_trunc --chunk_size 8- Expansion factor: 4 (hidden dim = 4 × input channels)
- Sparsity: Top 5% activations per sequence
- Learning rate: 1e-5
- Input channels: 608 (conv1d_1 layer)
- Sequence length: 524,288bp (divided into 4 chunks for memory efficiency) - further broken into reception window seqlets that comprise the batch
- Loss function: MSE + Top-K sparsity (no L1 penalty)
-
Feature Selection: Identifies SAE nodes with:
- ≥1000 seqlets with non-zero activation
-
Motif Discovery:
- MEME: Discovers up to 2 motifs per node (E-value < 0.05)
- TomTom: Matches against Vierstra motif database (p,q,E-values < 0.05)
-
Visualization: SAE-vis server for interactive exploration
python scripts/train_one_instance.py --topk 0.05 --exp_factor 4 --lr 1e-5python scripts/infer_one_instance.py --topk 0.05 --exp_factor 4 --top_acts 16python scripts/save_seqlets.pyThe features are visualized using the SAE-vis server: motifscout.com.
If you use this code, please cite:
- Borzoi paper for the base model
MIT licence.
