SAE-Borzoi: sparse autoencoders for DNA sequence-to-function model interpretability

This repository implements Sparse Autoencoders (SAEs) for mechanistic interpretability of the Borzoi genomic sequence model. SAEs decompose Borzoi's learned representations into interpretable features that correspond to regulatory motifs and genomic patterns. Results are available at motifscout.com.

Overview

Deep learning models for regulatory genomics have achieved remarkable predictive performance, yet their internal representations remain a black box. We apply sparse autoencoders (SAEs) to decompose learned representations of Borzoi, a state-of-the-art CNN-transformer for predicting genome-wide molecular phenotypes from DNA sequence. We trained TopK-SAEs on activations from Borzoi's early convolutional layers and discover monosemantic regulatory features that correspond to transcription factor (TF) binding motifs and transposable element sequences. We validated our findings through motif discovery using MEME suite against known TF databases, and identify hundreds of significant position weight matrices that map SAE-discovered features to established TF binding sites. This establishes SAEs as valuable tools for mechanistic interpretability in computational biology.

SAE-Borzoi uses the Top-K sparsity approach (Gao et al.) to reconstruct activations from Borzoi's convolutional layers, enabling identification of monosemantic genomic features.

Key Features

Interpretable genomic features: Maps node activations to known/unknown regulatory motifs
Automated motif discovery: Integrates MEME Suite and TomTom for motif analysis

Framework Schematic

Installation

Clone the repository and install dependencies:

# Clone the repo
git clone https://github.com/calico/sae-borzoi.git
cd sae-borzoi

# (Recommended) Create a virtual environment
python3 -m venv venv
source venv/bin/activate

# Install Python dependencies
pip install -r requirements.txt

Or with conda:

conda create -n sae-borzoi python=3.8
conda activate sae-borzoi
pip install -r requirements.txt

Install MEME Suite separately for motif analysis.

Quick Start

Prerequisites

Python 3.8+ with PyTorch
MEME Suite for motif analysis
Borzoi model activations in HDF5 format

Running the Pipeline

Configure paths in config/config.json:

{
  "activations_path": "/path/to/borzoi/activations",
  "models_save_path": "/path/to/save/models",
  "expansion_factor": 4,
  "topk_pct": 0.05,
  "learning_rate": 1e-5
}

Run the complete pipeline:
```
./pipeline.sh
```

Run specific stages:

# Run stages 0-3 (preprocessing through sequence extraction)
./pipeline.sh --stage 0 --stop_stage 3

# Run only motif discovery (stages 4-6)
./pipeline.sh --stage 4 --stop_stage 6

Pipeline Stages

Stage	Script	Description
0	`find_global_max.py`	Compute activation normalization values
1	`batch_train.py`	Train SAE models (distributed)
2	`batch_infer.py`	Run inference to extract top activations
3	`save_seqlets.py`	Extract genomic sequences around activations
4	`run_meme_multi.py`	Discover motifs with MEME
5	`run_meme_post.py`	Post-process MEME results
6	`run_tomtom_multi.py`	Match motifs against known databases
7	`umap_analysis.py`	Generate UMAP visualizations
8	`seqlet_overlaps_analysis.py`	Analyze feature overlaps
9	`jaccard.py`	Compute Jaccard similarities

Saving Activations

The script save_truncated.py saves pretrained Borzoi models truncated up to a particular layer for more efficient memory usage. save_activations.py runs forward passes on selected sequences and saves activations. Activation presave scripts depend on TensorFlow and on the baskerville repository, which should be installed first. Other scripts in this directory do not require baskerville. The usage is as follows:

python scripts/save_truncated.py -c config/config_save.json -o models_trunc
python scripts/save_activations.py -c config/config_save.json -o l2_activations -m models_trunc --chunk_size 8

Training Parameters

Expansion factor: 4 (hidden dim = 4 × input channels)
Sparsity: Top 5% activations per sequence
Learning rate: 1e-5
Input channels: 608 (conv1d_1 layer)
Sequence length: 524,288bp (divided into 4 chunks for memory efficiency) - further broken into reception window seqlets that comprise the batch
Loss function: MSE + Top-K sparsity (no L1 penalty)

Analysis Workflow

Feature Selection: Identifies SAE nodes with:
- ≥1000 seqlets with non-zero activation
Motif Discovery:
- MEME: Discovers up to 2 motifs per node (E-value < 0.05)
- TomTom: Matches against Vierstra motif database (p,q,E-values < 0.05)
Visualization: SAE-vis server for interactive exploration

Example Usage

Train a Single Model

python scripts/train_one_instance.py --topk 0.05 --exp_factor 4 --lr 1e-5

Run Inference

python scripts/infer_one_instance.py --topk 0.05 --exp_factor 4 --top_acts 16

Extract Sequences

python scripts/save_seqlets.py

Visualization

The features are visualized using the SAE-vis server: motifscout.com.

Citation

If you use this code, please cite:

Borzoi paper for the base model

License

MIT licence.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
analysis		analysis
config		config
core		core
fig		fig
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
parse_options.sh		parse_options.sh
pipeline.sh		pipeline.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAE-Borzoi: sparse autoencoders for DNA sequence-to-function model interpretability

Table of Contents

Overview

Key Features

Framework Schematic

Installation

Quick Start

Prerequisites

Running the Pipeline

Pipeline Stages

Saving Activations

Training Parameters

Analysis Workflow

Example Usage

Train a Single Model

Run Inference

Extract Sequences

Visualization

Citation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SAE-Borzoi: sparse autoencoders for DNA sequence-to-function model interpretability

Table of Contents

Overview

Key Features

Framework Schematic

Installation

Quick Start

Prerequisites

Running the Pipeline

Pipeline Stages

Saving Activations

Training Parameters

Analysis Workflow

Example Usage

Train a Single Model

Run Inference

Extract Sequences

Visualization

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages