Skip to content

cjroth/neuroscope

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Neuroscope

SAE-instrumented LLM inference server. Run a local LLM with real-time Sparse Autoencoder (SAE) feature extraction — see what concepts the model is "thinking about" as it generates each token.

Neuroscope wraps mistral.rs for inference and hooks into the model's forward pass to extract activations, run them through a pre-trained SAE encoder, and stream the top activated features over SSE.

How it works

 Chat API (:8080)              Features API (:8081)
 POST /v1/chat/completions     GET /v1/features/stream → SSE
 GET  /v1/models               GET /v1/features/labels → JSON
        │                              ▲
        ▼                              │
 ┌─────────────────────────────────────┤
 │  Inference Engine (mistral.rs)      │
 │                                     │
 │  Transformer layer 20 ──hook──► SAE Encoder
 │                                  │
 │  Token output ───────────────► broadcast channel
 └─────────────────────────────────────┘

The chat API is fully OpenAI-compatible — use it with any existing client (Continue, Open WebUI, curl, etc.). The features API is a separate port that streams SAE feature activations as the model generates tokens.

Quickstart

Prerequisites

  • Rust 1.88+
  • ~5 GB disk space for model weights (downloaded automatically on first run)
  • A HuggingFace account with access to google/gemma-2-2b-it
  • macOS Metal builds: The Metal Toolchain must be installed. If you get cannot execute tool 'metal' errors, run:
    xcodebuild -downloadComponent MetalToolchain

Build

# macOS (Metal GPU)
cargo build --release -p neuroscope-cli --features metal

# NVIDIA GPU
cargo build --release -p neuroscope-cli --features cuda

# CPU only
cargo build --release -p neuroscope-cli

Getting started

All examples below use neuroscope as shorthand for ./target/release/neuroscope.

On first run, model and SAE weights are downloaded from HuggingFace and cached locally. Before serving, you need calibration data (for filtering noisy features) and feature labels (for human-readable descriptions).

Option A: Pull pre-generated data (~1 min):

neuroscope pull

Option B: Generate from scratch (~2+ hours, requires GPU + API key):

neuroscope calibrate run          # ~10 min
neuroscope labels generate        # ~2 hours, requires OPENROUTER_API_KEY or ANTHROPIC_API_KEY

Both generate commands auto-download a WikiText corpus and resume if interrupted. See Environment variables for API key setup.

Then serve:

neuroscope serve

You can skip steps 1–2 and go straight to serve — it will fall back to Neuronpedia labels and disable filtering, but the results will be noisier.

Use

In one terminal, listen for feature activations:

curl -N http://localhost:8081/v1/features/stream

In another, send a chat request:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-2-2b-it",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

The features stream will emit SSE events like:

event: feature_activation
data: {"token_index":0,"token":"Paris","layer":20,"top_features":[{"index":4521,"label":"geography or place names","activation":3.82},{"index":12033,"label":"European countries and capitals","activation":2.14}]}

Environment variables

Neuroscope loads a .env file from the working directory if present.

Variable Purpose
OPENROUTER_API_KEY API key for OpenRouter (used for label generation)
ANTHROPIC_API_KEY API key for Anthropic (fallback for label generation)
LABELER_MODEL LLM model for label generation [default: deepseek/deepseek-v3.2]
LABELER_BACKEND API backend: auto, openrouter, anthropic, or a custom base URL [default: auto]

With auto backend, Neuroscope checks OPENROUTER_API_KEY first, then falls back to ANTHROPIC_API_KEY.

CLI reference

neuroscope serve

Start the inference server with SAE instrumentation.

neuroscope serve [OPTIONS]

Options:
  --model <ID>              HuggingFace model ID [default: google/gemma-2-2b-it]
  --sae <ID>                HuggingFace SAE repo ID [default: google/gemma-scope-2b-pt-res]
  --sae-path <PATH>         Path within SAE repo [default: layer_20/width_16k/average_l0_71/params.npz]
  --sae-local-path <PATH>   Local path to SAE npz file (bypasses HF download)
  --sae-layer <N>           Transformer layer to hook [default: 20]
  --top-k <N>               Number of top features per token [default: 10]
  --port <PORT>             Chat API port [default: 8080]
  --features-port <PORT>    Features SSE port [default: 8081]
  --feature-filter <MODE>   Filter mode: none, frequency, surprise, combined [default: combined]
  --filter-threshold <F>    Filter threshold (e.g. max firing rate) [default: 0.5]
  --calibration-path <PATH> Path to calibration stats JSON (auto-detected from cache if omitted)
  --labeler <MODEL>         Labeler model whose labels to use [default: deepseek/deepseek-v3.2] [env: LABELER_MODEL]

Feature filtering is enabled by default. To disable it:

neuroscope serve --feature-filter none

Filter modes:

  • frequency — exclude features that fire more than --filter-threshold fraction of the time (removes "always-on" features)
  • surprise — rank features by how surprising their activation is relative to calibration statistics
  • combined — exclude high-frequency features, then rank the rest by surprise (default)

neuroscope corpus

Build or inspect the text corpus used for calibration and label generation.

neuroscope corpus build [OPTIONS]

Options:
  --samples <N>             Number of samples to download [default: 2000]
  --output <DIR>            Output directory [default: ~/.cache/neuroscope/corpus/]
  --dataset <NAME>          HuggingFace dataset [default: wikitext]
  --config <NAME>           Dataset config/subset [default: wikitext-103-raw-v1]
neuroscope corpus show [--path <DIR>]

Downloads WikiText-103 parquet files from HuggingFace Hub and extracts text samples locally. No Python dependency needed.

You don't need to run this manually — calibrate run and labels generate will auto-download the corpus if --corpus-path is not provided.

neuroscope calibrate

Collect per-feature firing statistics from a text corpus. This data is used by the feature filter.

neuroscope calibrate run [OPTIONS]

Options:
  --model <ID>              HuggingFace model ID [default: google/gemma-2-2b-it]
  --sae <ID>                HuggingFace SAE repo ID [default: google/gemma-scope-2b-pt-res]
  --sae-path <PATH>         Path within SAE repo [default: layer_20/width_16k/average_l0_71/params.npz]
  --sae-local-path <PATH>   Local path to SAE npz file
  --sae-layer <N>           Transformer layer [default: 20]
  --corpus-path <DIR>       Directory of .txt files (auto-downloads WikiText if omitted)
  --samples <N>             Max samples to process [default: 1000]
  --output <PATH>           Output path (defaults to ~/.cache/neuroscope/calibration/)
neuroscope calibrate show --stats <PATH> [--top <N>]

neuroscope labels

Generate, score, and inspect feature labels.

labels generate

Run a corpus through the model+SAE to collect max-activating examples, then call an LLM API to auto-generate human-readable labels for each feature.

neuroscope labels generate [OPTIONS]

Options:
  --model <ID>                   HuggingFace model ID [default: google/gemma-2-2b-it]
  --sae <ID>                     HuggingFace SAE repo ID [default: google/gemma-scope-2b-pt-res]
  --sae-path <PATH>              Path within SAE repo
  --sae-local-path <PATH>        Local path to SAE npz file
  --sae-layer <N>                Transformer layer [default: 20]
  --corpus-path <DIR>            Directory of .txt files (auto-downloads WikiText if omitted)
  --samples <N>                  Max samples to process [default: 5000]
  --examples-per-feature <N>     Max-activating examples per feature [default: 20]
  --labeler <MODEL>              LLM model for labeling [default: deepseek/deepseek-v3.2] [env: LABELER_MODEL]
  --labeler-backend <BACKEND>    API backend: auto, openrouter, anthropic, or custom URL [default: auto] [env: LABELER_BACKEND]
  --concurrency <N>              Max concurrent API requests [default: 50]
  --score                        Also run detection scoring after generation
  --output <PATH>                Output path (defaults to ~/.cache/neuroscope/autointerp_labels/<labeler>/)

Labels are namespaced by labeler model, so you can generate and compare labels from different LLMs:

# Generate with DeepSeek V3.2 (default, via OpenRouter)
neuroscope labels generate --samples 5000 --concurrency 50

# Generate with Claude Haiku for comparison
neuroscope labels generate --samples 5000 --concurrency 50 \
  --labeler anthropic/claude-haiku-4-5

# Generate with GPT-4o
neuroscope labels generate --samples 5000 --concurrency 50 \
  --labeler openai/gpt-4o

The pipeline has two phases, both with checkpointing:

  1. Corpus pass (~75 min for 5K samples) — runs text through the model+SAE, saves checkpoint every 50 samples. Shared across all labelers.
  2. Label generation (~30 min via OpenRouter) — calls the LLM API for each feature, saves each label as it completes.

Both phases resume automatically if interrupted.

Uses structured output (JSON schema) when the backend supports it (OpenRouter, OpenAI-compatible) for reliable label parsing.

labels score

Score existing labels using detection accuracy.

neuroscope labels score --labels <PATH> [--labeler <MODEL>] [--threshold <F>]

labels show

Show labels for specific features.

neuroscope labels show --features 4521,3022,11612

neuroscope clean

Manage cached data. All clean operations move data to a trash folder and are reversible.

neuroscope clean corpus       # Trash downloaded corpus text files
neuroscope clean checkpoints  # Trash corpus pass checkpoints
neuroscope clean labels       # Trash all generated feature labels
neuroscope clean calibration  # Trash calibration statistics
neuroscope clean all          # Trash everything

neuroscope clean show         # Show what's in the trash
neuroscope clean undo         # Restore the most recently trashed item
neuroscope clean empty        # Permanently delete everything in the trash

neuroscope models / neuroscope sae

neuroscope models list      # Show HuggingFace cache info
neuroscope models pull <ID> # Pre-download a model
neuroscope sae list         # Show cached SAEs
neuroscope sae pull <ID>    # Pre-download an SAE

neuroscope push / neuroscope pull

Sync labels and calibration data with HuggingFace, so new users can skip the hours of compute needed to generate them.

neuroscope push [OPTIONS]

Options:
  --repo <ID>       HuggingFace dataset repo [default: cjroth/neuroscope]
  --message <MSG>   Commit message
neuroscope pull [OPTIONS]

Options:
  --repo <ID>       HuggingFace dataset repo [default: cjroth/neuroscope]

Pull downloads labels and calibration data directly into the local cache — no further setup needed, just neuroscope serve after pulling.

Push requires a HuggingFace write token (huggingface-cli login).

Label priority

When loading feature labels, Neuroscope checks sources in this order:

  1. Auto-interp labels (namespaced by --labeler model) — highest quality
  2. Auto-interp labels (legacy non-namespaced path)
  3. Neuronpedia cache (fetched from neuronpedia.org)
  4. Neuronpedia API (live fetch, cached on first use)
  5. Numeric fallback (feature_0, feature_1, ...)

Input normalization

Gemma Scope SAEs were trained on RMS-normalized hidden states. Neuroscope applies the same normalization before encoding: each hidden state vector is divided by its root mean square, scaling it to unit RMS before the SAE matmul. This prevents distorted activations from magnitude differences between raw and training-time inputs.

API

POST /v1/chat/completions (port 8080)

Standard OpenAI chat completions API. Supports "stream": true.

GET /v1/models (port 8080)

Lists the loaded model.

GET /v1/features/stream (port 8081)

SSE stream of feature activations. Each feature_activation event corresponds to one generated token. A generation_complete event is sent when generation finishes.

When feature filtering is enabled, events include a filtered_count field showing how many features were suppressed. Features may also include a surprise score when using surprise or combined filter modes.

GET /v1/features/labels (port 8081)

Returns the full label map as JSON — a mapping from feature index to human-readable description.

Architecture

The project is split into four crates:

Crate Purpose
neuroscope-core SAE encoder, types, labels, calibration, filtering, auto-interp, corpus download, scoring
neuroscope-engine Inference engine wrapping mistral.rs with SAE hook wiring, corpus runner
neuroscope-server Axum HTTP servers for chat API and features SSE
neuroscope-cli CLI binary tying everything together

Default model and SAE

Model: Gemma 2 2B IT — small enough to run on a laptop, with the best SAE coverage of any open model.

SAE: Gemma Scope layer 20, 16K width — 16,384 learned features extracted from the residual stream at layer 20/26. Deep enough to capture semantic concepts (not just syntax). Uses JumpReLU activation with ~71 features firing per token on average.

Why two ports?

The chat API on :8080 is strictly OpenAI-compatible so it works as a drop-in replacement with existing tools. The features stream on :8081 is a separate concern — this keeps the chat API clean and lets you consume features independently (terminal logger, web UI, or both at once via the broadcast channel).

Tests

# Unit tests (no model download needed)
cargo test

# Integration tests (requires model weights)
NEUROSCOPE_INTEGRATION_TESTS=1 cargo test

License

MIT

Releases

No releases published

Packages

 
 
 

Contributors

Languages