Skip to content

Add attention head characterization scripts and ablation experiments#403

Open
lee-goodfire wants to merge 95 commits intodevfrom
feature/attn_plots
Open

Add attention head characterization scripts and ablation experiments#403
lee-goodfire wants to merge 95 commits intodevfrom
feature/attn_plots

Conversation

@lee-goodfire
Copy link
Contributor

@lee-goodfire lee-goodfire commented Feb 19, 2026

Description

Add a suite of attention head detection scripts, component-level induction analysis, new QK attention contribution plot types, and a position-specific ablation experiment with previous-token redundancy testing.

New detection scripts (each in spd/scripts/detect_*/):

  • Previous-token heads (+ random-token control variant)
  • Induction heads (synthetic repeated sequences)
  • Duplicate-token heads
  • Positional heads (offset profiles + BOS attention)
  • Delimiter heads
  • Successor heads (ordinal sequences with random-word controls)
  • S-inhibition heads (IOI prompts, OV copy scores)

Component-level analysis (spd/scripts/characterize_induction_components/):

  • Weight concentration, ablation, cross-head interaction, and "why not perfect" analysis for L2H4

Attention ablation experiment (spd/scripts/attention_ablation_experiment/):

  • Position-specific ablation: ablate heads or SPD components at a single randomly chosen position per sample, measure effect on attention outputs via normalized inner product (NIP) and cosine similarity
  • Head ablation: zero a head's attention output at one position
  • Component ablation: zero q/k component masks at position-specific locations (q at t, k at t-1)
  • Previous-token redundancy test (--prev_token_test): tests whether ablating a head/component makes value ablation at t-1 redundant, quantifying how much of the prev-token information channel the ablation captures
    • Six forward passes per sample: baseline, A, B(all), B(specific), A+B(all), A+B(specific)
    • Seven pairwise comparisons including B-alone controls for interaction analysis
    • Key finding: SPD components (q:279, k:177) capture ~83% of prev-token value flow vs ~35% for full head ablation, providing evidence that SPD finds cross-head structure

Plot enhancements (plot_qk_c_attention_contributions):

  • Per-head heatmaps across offsets
  • Head-vs-sum scatter plots
  • Pair contribution line plots (summed and per-head)

Shared utility (spd/scripts/collect_attention_patterns.py):

  • Extracted duplicated _collect_attention_patterns from all 7 detect scripts into a shared module

Bug fix:

  • Fixed GQA bug in detect_s_inhibition_heads: OV copy scores indexed W_V with Q-head indices instead of KV-head indices

Harvest schema migration:

  • Updated all 10 plotting scripts for new harvest schema

Motivation and Context

Characterize attention head behavior in decomposed LlamaSimpleMLP models to understand component-level circuits. The ablation experiments provide quantitative evidence that SPD components correspond to specific attention mechanisms (previous-token behavior) in a way that cuts across individual attention heads.

How Has This Been Tested?

  • All scripts run against wandb:goodfire/spd/runs/s-275c8f21
  • Ablation experiments verified with 1024 samples for both head (L1H1) and component (q:279, k:177) modes
  • Previous-token redundancy test verified with B-alone controls and interaction analysis
  • All pass basedpyright and ruff checks (make check)

Does this PR introduce a breaking change?

No. All changes are additive (new scripts and plot functions). The harvest schema update aligns with the migration already merged on dev.

🤖 Generated with Claude Code

claude-spd1 and others added 30 commits February 19, 2026 13:57
Scatter plots of mean CI values per component, arranged in a grid by module.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Per-head Frobenius norm heatmaps for each attention projection (q/k/v/o),
showing how each component's weight is distributed across heads.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bar charts of head-spread entropy per component for each attention projection,
showing whether components are concentrated on one head or spread across many.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Target vs reconstructed weight matrices per attention projection, with
individual component weight visualizations in paginated 4x4 grids.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Per-head activation magnitude heatmaps combining U-norm head structure
with actual activation magnitudes from harvest data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Weight-only q·k attention contribution heatmaps between q and k subcomponents.
Single grid per layer with summed (all heads) and per-head breakdowns. Uses
V-norm-scaled U dot products to account for unnormalized magnitude split.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
K/V component co-activation heatmaps using pre-computed harvest co-occurrence
data. Three metrics per layer: CI co-occurrence counts, phi coefficient
(binary correlation), and Jaccard similarity of firing sets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Generates per-layer PDF reports and companion markdown files tracing
attention component interactions: Q->K weight-only attention contributions
and K->V CI co-occurrence associations, with autointerp labels/reasoning.

Also excludes detect_* scripts from basedpyright (pre-existing type errors).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The loader was hardcoding is_tokenized=False, causing failures on
pre-tokenized datasets like danbraunai/pile-uncopyrighted-tok.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Non-streaming was impractical for large pre-tokenized datasets like
pile-uncopyrighted-tok (491M rows shuffled/mapped just for 512 samples).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The ~40k component model OOMs at batch_size=512 on 140GB H200.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add four new plot types: per-head heatmaps across offsets, head-vs-sum
scatter, and pair contribution line plots (summed and per-head). Also
add top_n_pairs parameter and trim default offsets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Compute mean attention to position i-1 for each head across real text
data. Includes a random-tokens control variant to distinguish positional
from content-driven attention patterns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use synthetic repeated token sequences [A B C | A B C] to measure
induction attention (from second occurrence to token following first
occurrence) for each head.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Measure mean attention weight landing on prior positions holding the
same token as the current query position, on real text data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Compute positional offset profiles (attention vs relative offset) and
BOS attention scores for each head on real text data. Produces three
plots: max-offset heatmap, BOS heatmap, and per-head profile lines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Measure fraction of each head's attention landing on delimiter tokens
(periods, commas, etc.) on real text, compared to baseline delimiter
frequency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use ordinal sequences (digits, letters, days, months) with random-word
controls to isolate successor-specific attention patterns for each head.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use IOI prompts to measure S2-position attention and OV copy scores,
identifying heads that attend to repeated subject names and inhibit
copying via negative OV contributions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deep-dive into L2H4 at the SPD component level: weight concentration
analysis, per-component ablation effects on induction score, cross-head
interactions, and analysis of what prevents perfect induction behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summarize findings from all 7 detection scripts and the component-level
analysis: multi-functional early heads, L1H1->L2H4 induction circuit,
layer-2 BOS sink pattern, and cross-cutting observations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the mean_ci -> firing_density + mean_activations schema change,
the workaround for incompatible harvest sub-runs, and the list of files
affected by the migration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
claude-spd1 and others added 7 commits February 25, 2026 15:17
Replace _generate_greedy's ablate_first_only boolean with two explicit
functions:
- _generate_greedy_one_shot: ablation on first step only, then clean
  generation (for position-specific interventions)
- _generate_greedy_persistent: ablation re-applied every step, since
  no KV cache means the model recomputes from scratch each step
  (for "model modification" conditions like zeroing all values)

Both share _forward_once for the actual forward pass logic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All ablations now apply on the first generated token only. Deleted
_generate_greedy_persistent since we never want ablations to persist
across generation steps.

Renamed "[persist]" conditions to "Vals @ALL prev" — these zero values
at all prompt positions but only for one prediction, same as other
value ablation conditions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace _generate_greedy_one_shot loop with _predict_next_token that
does exactly one forward pass and returns one token ID. Remove gen_len
parameter entirely. HTML tables now have one column.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each ablation condition now shows:
- Top-k tokens whose logits increased most vs baseline
- Top-k tokens whose logits decreased most vs baseline
- Change in the baseline's predicted token's logit

Baseline is target model for non-SPD conditions, SPD baseline for
component conditions.

Also: Prediction class replaces bare int return, ConditionResult
tuple tracks which baseline each condition uses.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Token-per-line layout with flexbox alignment, color-coded values
(green=increase, red=decrease), wider page, proper vertical alignment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Inline token(value) format with nbsp separators instead of one-per-line.
Each row is now one line tall. Values still color-coded.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
One cell per token instead of all crammed into one cell. Header shows
inc 1..5 and dec 1..5 columns. Each cell has token + colored value.
Compact single-line rows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
claude-spd1 and others added 22 commits February 25, 2026 16:57
Also relax t >= 2 assertion to t >= 1, skip t-1,t-2 condition when t < 2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Short prompts are better for studying previous-token effects since the
ablated position is close to the prediction. The t >= 1 assertion and
t >= 2 guard on the t-1,t-2 condition handle 2-token prompts correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each logit cell shows the token on the first line and the colored
change value on the second. Increased default top_k from 5 to 20.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Expanded crafted prompts to 40 with focus on prev-token behavior:
bigrams, fixed expressions, sequences, code syntax, repetition.
Change values in logit cells now 9px for less visual noise.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace separate prompt div and token info line with inline prompt
text in a highlighted code box directly in the h2 heading.
Before: "Crafted: HTML | ablation at t=4" + separate prompt div
After:  "Crafted: HTML | <html><body> | t=4"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tokens separated by | delimiters. Blue = ablated position (t),
yellow = previous position (t-1), grey = other tokens. Makes
tokenization boundaries and ablation target immediately visible.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds individual sample plots (up to 10) showing raw attention distributions
and diffs with actual token text on x-axis. Also reverses x-axis so query
position is on the right, and increases legend size.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…arability

Each head's QK dot product contributions are now divided by T^h (the total
QK dot product at offset 0), so pair contributions within a head sum to 1
at offset 0 and are directly interpretable as fractions. Removes the
1/sqrt(d_head) scaling and averages across heads instead of summing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aseline

Divides (ablated - baseline) by the mean baseline attention at each offset
across all heads, giving a stable measure of relative attention change that
doesn't overweight offsets with small absolute attention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ablates each of the top N QK component pairs (ranked by attention
contribution at a given offset) and plots the fractional attention change
per pair, normalized by cross-head mean baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ablates individual Q and K components (ranked by mean CI) and produces
three plot types per projection: raw attention with baseline, attention
diff, and fractional change. Supports configurable k_offset for ablating
K components at different positions relative to the query.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflict in spd/clustering/dataset.py by taking dev's cleaner
version (removed old pretrained_model_name lookup and config_kwargs).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n logic

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Includes five overlap metric variants: unweighted cosine, strength-weighted,
data-weighted, variance-weighted, and data+strength combined. Also includes
psi reading strength profiles, t-SNE/PCA bubble plots, and grid visualizations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Describes motivation and detailed equations for all five overlap variants:
unweighted, strength-weighted, data-weighted, variance-weighted, and combined.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…verlap

Removes t-SNE, PCA bubble plots, psi correlation/scatter grids, singular
value histogram, grid squares, and all related infrastructure (reading
strengths, attention offset loading, unit-basis variants). Renames main
entry point to plot_wv_subspace_overlap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Directory and script renamed to reflect current scope. Old non-overlap
output files moved to out/<run_id>/archive/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds three new plotting functions to the W_V subspace overlap script:
- Combined side-by-side heatmap (unweighted + variance-weighted overlap)
- Component-head amplification heatmap showing ||W_V^h v_c|| per SPD component
- Shared helpers for Gram-matrix overlap and lower-triangular heatmap rendering

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts:
#	spd/scripts/detect_prev_token_heads/detect_prev_token_heads.py
#	spd/scripts/detect_prev_token_heads/detect_prev_token_heads_random_tokens.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants