Add attention head characterization scripts and ablation experiments#403
Open
lee-goodfire wants to merge 95 commits intodevfrom
Open
Add attention head characterization scripts and ablation experiments#403lee-goodfire wants to merge 95 commits intodevfrom
lee-goodfire wants to merge 95 commits intodevfrom
Conversation
Scatter plots of mean CI values per component, arranged in a grid by module. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Per-head Frobenius norm heatmaps for each attention projection (q/k/v/o), showing how each component's weight is distributed across heads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bar charts of head-spread entropy per component for each attention projection, showing whether components are concentrated on one head or spread across many. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Target vs reconstructed weight matrices per attention projection, with individual component weight visualizations in paginated 4x4 grids. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Per-head activation magnitude heatmaps combining U-norm head structure with actual activation magnitudes from harvest data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Weight-only q·k attention contribution heatmaps between q and k subcomponents. Single grid per layer with summed (all heads) and per-head breakdowns. Uses V-norm-scaled U dot products to account for unnormalized magnitude split. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
K/V component co-activation heatmaps using pre-computed harvest co-occurrence data. Three metrics per layer: CI co-occurrence counts, phi coefficient (binary correlation), and Jaccard similarity of firing sets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Generates per-layer PDF reports and companion markdown files tracing attention component interactions: Q->K weight-only attention contributions and K->V CI co-occurrence associations, with autointerp labels/reasoning. Also excludes detect_* scripts from basedpyright (pre-existing type errors). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The loader was hardcoding is_tokenized=False, causing failures on pre-tokenized datasets like danbraunai/pile-uncopyrighted-tok. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Non-streaming was impractical for large pre-tokenized datasets like pile-uncopyrighted-tok (491M rows shuffled/mapped just for 512 samples). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The ~40k component model OOMs at batch_size=512 on 140GB H200. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add four new plot types: per-head heatmaps across offsets, head-vs-sum scatter, and pair contribution line plots (summed and per-head). Also add top_n_pairs parameter and trim default offsets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Compute mean attention to position i-1 for each head across real text data. Includes a random-tokens control variant to distinguish positional from content-driven attention patterns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use synthetic repeated token sequences [A B C | A B C] to measure induction attention (from second occurrence to token following first occurrence) for each head. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Measure mean attention weight landing on prior positions holding the same token as the current query position, on real text data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Compute positional offset profiles (attention vs relative offset) and BOS attention scores for each head on real text data. Produces three plots: max-offset heatmap, BOS heatmap, and per-head profile lines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Measure fraction of each head's attention landing on delimiter tokens (periods, commas, etc.) on real text, compared to baseline delimiter frequency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use ordinal sequences (digits, letters, days, months) with random-word controls to isolate successor-specific attention patterns for each head. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use IOI prompts to measure S2-position attention and OV copy scores, identifying heads that attend to repeated subject names and inhibit copying via negative OV contributions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deep-dive into L2H4 at the SPD component level: weight concentration analysis, per-component ablation effects on induction score, cross-head interactions, and analysis of what prevents perfect induction behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summarize findings from all 7 detection scripts and the component-level analysis: multi-functional early heads, L1H1->L2H4 induction circuit, layer-2 BOS sink pattern, and cross-cutting observations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the mean_ci -> firing_density + mean_activations schema change, the workaround for incompatible harvest sub-runs, and the list of files affected by the migration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace _generate_greedy's ablate_first_only boolean with two explicit functions: - _generate_greedy_one_shot: ablation on first step only, then clean generation (for position-specific interventions) - _generate_greedy_persistent: ablation re-applied every step, since no KV cache means the model recomputes from scratch each step (for "model modification" conditions like zeroing all values) Both share _forward_once for the actual forward pass logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All ablations now apply on the first generated token only. Deleted _generate_greedy_persistent since we never want ablations to persist across generation steps. Renamed "[persist]" conditions to "Vals @ALL prev" — these zero values at all prompt positions but only for one prediction, same as other value ablation conditions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace _generate_greedy_one_shot loop with _predict_next_token that does exactly one forward pass and returns one token ID. Remove gen_len parameter entirely. HTML tables now have one column. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each ablation condition now shows: - Top-k tokens whose logits increased most vs baseline - Top-k tokens whose logits decreased most vs baseline - Change in the baseline's predicted token's logit Baseline is target model for non-SPD conditions, SPD baseline for component conditions. Also: Prediction class replaces bare int return, ConditionResult tuple tracks which baseline each condition uses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Token-per-line layout with flexbox alignment, color-coded values (green=increase, red=decrease), wider page, proper vertical alignment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Inline token(value) format with nbsp separators instead of one-per-line. Each row is now one line tall. Values still color-coded. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
One cell per token instead of all crammed into one cell. Header shows inc 1..5 and dec 1..5 columns. Each cell has token + colored value. Compact single-line rows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6bd5736 to
2e6f687
Compare
Also relax t >= 2 assertion to t >= 1, skip t-1,t-2 condition when t < 2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Short prompts are better for studying previous-token effects since the ablated position is close to the prediction. The t >= 1 assertion and t >= 2 guard on the t-1,t-2 condition handle 2-token prompts correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each logit cell shows the token on the first line and the colored change value on the second. Increased default top_k from 5 to 20. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Expanded crafted prompts to 40 with focus on prev-token behavior: bigrams, fixed expressions, sequences, code syntax, repetition. Change values in logit cells now 9px for less visual noise. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace separate prompt div and token info line with inline prompt text in a highlighted code box directly in the h2 heading. Before: "Crafted: HTML | ablation at t=4" + separate prompt div After: "Crafted: HTML | <html><body> | t=4" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tokens separated by | delimiters. Blue = ablated position (t), yellow = previous position (t-1), grey = other tokens. Makes tokenization boundaries and ablation target immediately visible. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds individual sample plots (up to 10) showing raw attention distributions and diffs with actual token text on x-axis. Also reverses x-axis so query position is on the right, and increases legend size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…arability Each head's QK dot product contributions are now divided by T^h (the total QK dot product at offset 0), so pair contributions within a head sum to 1 at offset 0 and are directly interpretable as fractions. Removes the 1/sqrt(d_head) scaling and averages across heads instead of summing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aseline Divides (ablated - baseline) by the mean baseline attention at each offset across all heads, giving a stable measure of relative attention change that doesn't overweight offsets with small absolute attention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ablates each of the top N QK component pairs (ranked by attention contribution at a given offset) and plots the fractional attention change per pair, normalized by cross-head mean baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ablates individual Q and K components (ranked by mean CI) and produces three plot types per projection: raw attention with baseline, attention diff, and fractional change. Supports configurable k_offset for ablating K components at different positions relative to the query. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflict in spd/clustering/dataset.py by taking dev's cleaner version (removed old pretrained_model_name lookup and config_kwargs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n logic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Includes five overlap metric variants: unweighted cosine, strength-weighted, data-weighted, variance-weighted, and data+strength combined. Also includes psi reading strength profiles, t-SNE/PCA bubble plots, and grid visualizations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Describes motivation and detailed equations for all five overlap variants: unweighted, strength-weighted, data-weighted, variance-weighted, and combined. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…verlap Removes t-SNE, PCA bubble plots, psi correlation/scatter grids, singular value histogram, grid squares, and all related infrastructure (reading strengths, attention offset loading, unit-basis variants). Renames main entry point to plot_wv_subspace_overlap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Directory and script renamed to reflect current scope. Old non-overlap output files moved to out/<run_id>/archive/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds three new plotting functions to the W_V subspace overlap script: - Combined side-by-side heatmap (unweighted + variance-weighted overlap) - Component-head amplification heatmap showing ||W_V^h v_c|| per SPD component - Shared helpers for Gram-matrix overlap and lower-triangular heatmap rendering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts: # spd/scripts/detect_prev_token_heads/detect_prev_token_heads.py # spd/scripts/detect_prev_token_heads/detect_prev_token_heads_random_tokens.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add a suite of attention head detection scripts, component-level induction analysis, new QK attention contribution plot types, and a position-specific ablation experiment with previous-token redundancy testing.
New detection scripts (each in
spd/scripts/detect_*/):Component-level analysis (
spd/scripts/characterize_induction_components/):Attention ablation experiment (
spd/scripts/attention_ablation_experiment/):--prev_token_test): tests whether ablating a head/component makes value ablation at t-1 redundant, quantifying how much of the prev-token information channel the ablation capturesPlot enhancements (
plot_qk_c_attention_contributions):Shared utility (
spd/scripts/collect_attention_patterns.py):_collect_attention_patternsfrom all 7 detect scripts into a shared moduleBug fix:
detect_s_inhibition_heads: OV copy scores indexedW_Vwith Q-head indices instead of KV-head indicesHarvest schema migration:
Motivation and Context
Characterize attention head behavior in decomposed LlamaSimpleMLP models to understand component-level circuits. The ablation experiments provide quantitative evidence that SPD components correspond to specific attention mechanisms (previous-token behavior) in a way that cuts across individual attention heads.
How Has This Been Tested?
wandb:goodfire/spd/runs/s-275c8f21make check)Does this PR introduce a breaking change?
No. All changes are additive (new scripts and plot functions). The harvest schema update aligns with the migration already merged on
dev.🤖 Generated with Claude Code