Add attention head characterization scripts and ablation experiments by lee-goodfire · Pull Request #403 · goodfire-ai/spd

lee-goodfire · 2026-02-19T12:15:47Z

Description

Add a suite of attention head detection scripts, component-level induction analysis, new QK attention contribution plot types, and a position-specific ablation experiment with previous-token redundancy testing.

New detection scripts (each in spd/scripts/detect_*/):

Previous-token heads (+ random-token control variant)
Induction heads (synthetic repeated sequences)
Duplicate-token heads
Positional heads (offset profiles + BOS attention)
Delimiter heads
Successor heads (ordinal sequences with random-word controls)
S-inhibition heads (IOI prompts, OV copy scores)

Component-level analysis (spd/scripts/characterize_induction_components/):

Weight concentration, ablation, cross-head interaction, and "why not perfect" analysis for L2H4

Attention ablation experiment (spd/scripts/attention_ablation_experiment/):

Position-specific ablation: ablate heads or SPD components at a single randomly chosen position per sample, measure effect on attention outputs via normalized inner product (NIP) and cosine similarity
Head ablation: zero a head's attention output at one position
Component ablation: zero q/k component masks at position-specific locations (q at t, k at t-1)
Previous-token redundancy test (--prev_token_test): tests whether ablating a head/component makes value ablation at t-1 redundant, quantifying how much of the prev-token information channel the ablation captures
- Six forward passes per sample: baseline, A, B(all), B(specific), A+B(all), A+B(specific)
- Seven pairwise comparisons including B-alone controls for interaction analysis
- Key finding: SPD components (q:279, k:177) capture ~83% of prev-token value flow vs ~35% for full head ablation, providing evidence that SPD finds cross-head structure

Plot enhancements (plot_qk_c_attention_contributions):

Per-head heatmaps across offsets
Head-vs-sum scatter plots
Pair contribution line plots (summed and per-head)

Shared utility (spd/scripts/collect_attention_patterns.py):

Extracted duplicated _collect_attention_patterns from all 7 detect scripts into a shared module

Bug fix:

Fixed GQA bug in detect_s_inhibition_heads: OV copy scores indexed W_V with Q-head indices instead of KV-head indices

Harvest schema migration:

Updated all 10 plotting scripts for new harvest schema

Motivation and Context

Characterize attention head behavior in decomposed LlamaSimpleMLP models to understand component-level circuits. The ablation experiments provide quantitative evidence that SPD components correspond to specific attention mechanisms (previous-token behavior) in a way that cuts across individual attention heads.

How Has This Been Tested?

All scripts run against wandb:goodfire/spd/runs/s-275c8f21
Ablation experiments verified with 1024 samples for both head (L1H1) and component (q:279, k:177) modes
Previous-token redundancy test verified with B-alone controls and interaction analysis
All pass basedpyright and ruff checks (make check)

Does this PR introduce a breaking change?

No. All changes are additive (new scripts and plot functions). The harvest schema update aligns with the migration already merged on dev.

🤖 Generated with Claude Code

Scatter plots of mean CI values per component, arranged in a grid by module. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Per-head Frobenius norm heatmaps for each attention projection (q/k/v/o), showing how each component's weight is distributed across heads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Bar charts of head-spread entropy per component for each attention projection, showing whether components are concentrated on one head or spread across many. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Target vs reconstructed weight matrices per attention projection, with individual component weight visualizations in paginated 4x4 grids. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Per-head activation magnitude heatmaps combining U-norm head structure with actual activation magnitudes from harvest data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Weight-only q·k attention contribution heatmaps between q and k subcomponents. Single grid per layer with summed (all heads) and per-head breakdowns. Uses V-norm-scaled U dot products to account for unnormalized magnitude split. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

K/V component co-activation heatmaps using pre-computed harvest co-occurrence data. Three metrics per layer: CI co-occurrence counts, phi coefficient (binary correlation), and Jaccard similarity of firing sets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Generates per-layer PDF reports and companion markdown files tracing attention component interactions: Q->K weight-only attention contributions and K->V CI co-occurrence associations, with autointerp labels/reasoning. Also excludes detect_* scripts from basedpyright (pre-existing type errors). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The loader was hardcoding is_tokenized=False, causing failures on pre-tokenized datasets like danbraunai/pile-uncopyrighted-tok. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Non-streaming was impractical for large pre-tokenized datasets like pile-uncopyrighted-tok (491M rows shuffled/mapped just for 512 samples). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The ~40k component model OOMs at batch_size=512 on 140GB H200. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add four new plot types: per-head heatmaps across offsets, head-vs-sum scatter, and pair contribution line plots (summed and per-head). Also add top_n_pairs parameter and trim default offsets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Compute mean attention to position i-1 for each head across real text data. Includes a random-tokens control variant to distinguish positional from content-driven attention patterns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use synthetic repeated token sequences [A B C | A B C] to measure induction attention (from second occurrence to token following first occurrence) for each head. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Measure mean attention weight landing on prior positions holding the same token as the current query position, on real text data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Compute positional offset profiles (attention vs relative offset) and BOS attention scores for each head on real text data. Produces three plots: max-offset heatmap, BOS heatmap, and per-head profile lines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Measure fraction of each head's attention landing on delimiter tokens (periods, commas, etc.) on real text, compared to baseline delimiter frequency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use ordinal sequences (digits, letters, days, months) with random-word controls to isolate successor-specific attention patterns for each head. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use IOI prompts to measure S2-position attention and OV copy scores, identifying heads that attend to repeated subject names and inhibit copying via negative OV contributions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Deep-dive into L2H4 at the SPD component level: weight concentration analysis, per-component ablation effects on induction score, cross-head interactions, and analysis of what prevents perfect induction behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Summarize findings from all 7 detection scripts and the component-level analysis: multi-functional early heads, L1H1->L2H4 induction circuit, layer-2 BOS sink pattern, and cross-cutting observations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Document the mean_ci -> firing_density + mean_activations schema change, the workaround for incompatible harvest sub-runs, and the list of files affected by the migration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace _generate_greedy's ablate_first_only boolean with two explicit functions: - _generate_greedy_one_shot: ablation on first step only, then clean generation (for position-specific interventions) - _generate_greedy_persistent: ablation re-applied every step, since no KV cache means the model recomputes from scratch each step (for "model modification" conditions like zeroing all values) Both share _forward_once for the actual forward pass logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

All ablations now apply on the first generated token only. Deleted _generate_greedy_persistent since we never want ablations to persist across generation steps. Renamed "[persist]" conditions to "Vals @ALL prev" — these zero values at all prompt positions but only for one prediction, same as other value ablation conditions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace _generate_greedy_one_shot loop with _predict_next_token that does exactly one forward pass and returns one token ID. Remove gen_len parameter entirely. HTML tables now have one column. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Each ablation condition now shows: - Top-k tokens whose logits increased most vs baseline - Top-k tokens whose logits decreased most vs baseline - Change in the baseline's predicted token's logit Baseline is target model for non-SPD conditions, SPD baseline for component conditions. Also: Prediction class replaces bare int return, ConditionResult tuple tracks which baseline each condition uses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Token-per-line layout with flexbox alignment, color-coded values (green=increase, red=decrease), wider page, proper vertical alignment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Inline token(value) format with nbsp separators instead of one-per-line. Each row is now one line tall. Values still color-coded. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

One cell per token instead of all crammed into one cell. Header shows inc 1..5 and dec 1..5 columns. Each cell has token + colored value. Compact single-line rows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Also relax t >= 2 assertion to t >= 1, skip t-1,t-2 condition when t < 2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Short prompts are better for studying previous-token effects since the ablated position is close to the prediction. The t >= 1 assertion and t >= 2 guard on the t-1,t-2 condition handle 2-token prompts correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Each logit cell shows the token on the first line and the colored change value on the second. Increased default top_k from 5 to 20. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Expanded crafted prompts to 40 with focus on prev-token behavior: bigrams, fixed expressions, sequences, code syntax, repetition. Change values in logit cells now 9px for less visual noise. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace separate prompt div and token info line with inline prompt text in a highlighted code box directly in the h2 heading. Before: "Crafted: HTML | ablation at t=4" + separate prompt div After: "Crafted: HTML | <html><body> | t=4" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tokens separated by | delimiters. Blue = ablated position (t), yellow = previous position (t-1), grey = other tokens. Makes tokenization boundaries and ablation target immediately visible. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds individual sample plots (up to 10) showing raw attention distributions and diffs with actual token text on x-axis. Also reverses x-axis so query position is on the right, and increases legend size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…arability Each head's QK dot product contributions are now divided by T^h (the total QK dot product at offset 0), so pair contributions within a head sum to 1 at offset 0 and are directly interpretable as fractions. Removes the 1/sqrt(d_head) scaling and averages across heads instead of summing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…aseline Divides (ablated - baseline) by the mean baseline attention at each offset across all heads, giving a stable measure of relative attention change that doesn't overweight offsets with small absolute attention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Ablates each of the top N QK component pairs (ranked by attention contribution at a given offset) and plots the fractional attention change per pair, normalized by cross-head mean baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Ablates individual Q and K components (ranked by mean CI) and produces three plot types per projection: raw attention with baseline, attention diff, and fractional change. Supports configurable k_offset for ablating K components at different positions relative to the query. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Resolve conflict in spd/clustering/dataset.py by taking dev's cleaner version (removed old pretrained_model_name lookup and config_kwargs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…n logic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Includes five overlap metric variants: unweighted cosine, strength-weighted, data-weighted, variance-weighted, and data+strength combined. Also includes psi reading strength profiles, t-SNE/PCA bubble plots, and grid visualizations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Describes motivation and detailed equations for all five overlap variants: unweighted, strength-weighted, data-weighted, variance-weighted, and combined. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…verlap Removes t-SNE, PCA bubble plots, psi correlation/scatter grids, singular value histogram, grid squares, and all related infrastructure (reading strengths, attention offset loading, unit-basis variants). Renames main entry point to plot_wv_subspace_overlap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Directory and script renamed to reflect current scope. Old non-overlap output files moved to out/<run_id>/archive/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds three new plotting functions to the W_V subspace overlap script: - Combined side-by-side heatmap (unweighted + variance-weighted overlap) - Component-head amplification heatmap showing ||W_V^h v_c|| per SPD component - Shared helpers for Gram-matrix overlap and lower-triangular heatmap rendering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

# Conflicts: # spd/scripts/detect_prev_token_heads/detect_prev_token_heads.py # spd/scripts/detect_prev_token_heads/detect_prev_token_heads_random_tokens.py

claude-spd1 and others added 30 commits February 19, 2026 13:57

Add plot_mean_ci script

f3dd309

Scatter plots of mean CI values per component, arranged in a grid by module. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add plot_component_head_norms script

b630906

Per-head Frobenius norm heatmaps for each attention projection (q/k/v/o), showing how each component's weight is distributed across heads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add plot_head_spread script

2dd8a6c

Bar charts of head-spread entropy per component for each attention projection, showing whether components are concentrated on one head or spread across many. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add plot_attention_weights script

ed5178f

Target vs reconstructed weight matrices per attention projection, with individual component weight visualizations in paginated 4x4 grids. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add plot_per_head_component_activations script

dd07888

Per-head activation magnitude heatmaps combining U-norm head structure with actual activation magnitudes from harvest data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add plot_kv_vt_similarity script

d61bb8a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rename plot_attention_contributions to plot_qk_c_attention_contributions

b05e55c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add dotenv loading to app backend server

70e0819

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove readonly from InterpDB in InterpRepo.open

4ae5dae

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add RoPE-aware QK attention computation with multi-offset support

523cd92

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add clustering configs for s-275c8f21

4acac7b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix clustering dataset loader to respect is_tokenized from task config

c9b54a1

The loader was hardcoding is_tokenized=False, causing failures on pre-tokenized datasets like danbraunai/pile-uncopyrighted-tok. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Read streaming setting from task config in clustering dataset loader

a6b506d

Non-streaming was impractical for large pre-tokenized datasets like pile-uncopyrighted-tok (491M rows shuffled/mapped just for 512 samples). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reduce clustering batch_size to 32 for s-275c8f21

96b33c9

The ~40k component model OOMs at batch_size=512 on 140GB H200. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Increase clustering iters to 5000 for s-275c8f21

e7e2497

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add 10k and 25k iter clustering configs for s-275c8f21

78d8658

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add previous-token head detection script

587aff2

Compute mean attention to position i-1 for each head across real text data. Includes a random-tokens control variant to distinguish positional from content-driven attention patterns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add induction head detection script

8397950

Use synthetic repeated token sequences [A B C | A B C] to measure induction attention (from second occurrence to token following first occurrence) for each head. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add duplicate-token head detection script

14bc7da

Measure mean attention weight landing on prior positions holding the same token as the current query position, on real text data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add delimiter head detection script

2f23d34

Measure fraction of each head's attention landing on delimiter tokens (periods, commas, etc.) on real text, compared to baseline delimiter frequency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add successor head detection script

33e7061

Use ordinal sequences (digits, letters, days, months) with random-word controls to isolate successor-specific attention patterns for each head. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add S-inhibition head detection script

013791a

Use IOI prompts to measure S2-position attention and OV copy scores, identifying heads that attend to repeated subject names and inhibit copying via negative OV contributions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add harvest data schema migration notes

b08d71b

Document the mean_ci -> firing_density + mean_activations schema change, the workaround for incompatible harvest sub-runs, and the list of files affected by the migration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude-spd1 and others added 7 commits February 25, 2026 15:17

Improve logit analysis HTML formatting

158734a

Token-per-line layout with flexbox alignment, color-coded values (green=increase, red=decrease), wider page, proper vertical alignment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Make logit analysis cells horizontal and compact

c17f3c1

Inline token(value) format with nbsp separators instead of one-per-line. Each row is now one line tall. Values still color-coded. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use separate table cells for each top-k logit change

2e6f687

One cell per token instead of all crammed into one cell. Header shows inc 1..5 and dec 1..5 columns. Each cell has token + colored value. Compact single-line rows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude-spd1 force-pushed the feature/attn_plots branch from 6bd5736 to 2e6f687 Compare February 25, 2026 16:38

claude-spd1 and others added 22 commits February 25, 2026 16:57

Replace crafted prompts with 30 short (4-8 token) examples

5c062f3

Also relax t >= 2 assertion to t >= 1, skip t-1,t-2 condition when t < 2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two-line logit cells (token/change) and top_k=20

d8e8851

Each logit cell shows the token on the first line and the colored change value on the second. Increased default top_k from 5 to 20. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Left-align logit cells with minimal width

0af9a8d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Skip non-ASCII dataset samples (non-English text)

14e005a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge origin/dev into feature/attn_plots

a3f90ca

Resolve conflict in spd/clustering/dataset.py by taking dev's cleaner version (removed old pretrained_model_name lookup and config_kwargs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update attention analysis scripts with expanded plotting and detectio…

a6a0513

…n logic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add attention offset profile plotting script

c2ec847

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add LaTeX writeup of W_V subspace overlap metrics

609733f

Describes motivation and detailed equations for all five overlap variants: unweighted, strength-weighted, data-weighted, variance-weighted, and combined. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rename plot_value_reading_tsne -> plot_wv_subspace_overlap

e469293

Directory and script renamed to reflect current scope. Old non-overlap output files moved to out/<run_id>/archive/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/dev' into feature/attn_plots

9994915

# Conflicts: # spd/scripts/detect_prev_token_heads/detect_prev_token_heads.py # spd/scripts/detect_prev_token_heads/detect_prev_token_heads_random_tokens.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add attention head characterization scripts and ablation experiments#403

Add attention head characterization scripts and ablation experiments#403
lee-goodfire wants to merge 95 commits intodevfrom
feature/attn_plots

lee-goodfire commented Feb 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lee-goodfire commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lee-goodfire commented Feb 19, 2026 •

edited

Loading