(dev -> main) App updates and backwards transformer ci functions#355
(dev -> main) App updates and backwards transformer ci functions#355danbraunai-goodfire wants to merge 572 commits intomainfrom
Conversation
|
@claude can you review this PR? Note that your review should cover the scope of this PR only. If you spot things unrelated to this PR, feel free to bring them up and we'll consider them for new issues. I want you to think about possible ways that the implementation could be improved. Including ways it could be re-written completely using different abstractions. I want critical comments, do feel comfortable to give me the hard truths. |
|
Claude finished @danbraunai-goodfire's task —— View job PR Review: App updates and backwards transformer ci functionsReview Scope: 288 commits, 100 files changed, focusing on app updates and transformer CI functions Todo List
Critical Review & Hard Truths🔥 Major Architectural Problems1. Backend Compute Layer is a God Object (
2. Database Schema Fighting the Type System (
3. Frontend State Management is Chaotic (
🚨 Code Quality Issues4. Layer Aliasing Global State ( let cachedArchitecture: Architecture | null = null;
5. Bulk API Endpoint Design (
6. Configuration Explosion (
💡 Missed Opportunities7. TypeScript Discriminated Unions Done Halfway // Current: Can still represent invalid states
type OptimizeConfigDraft = {
loss: LossConfigDraft; // Could be invalid
// ...
}
// Better: Make invalid states unrepresentable
type OptimizeState =
| { status: "draft"; config: Partial<OptimizeConfig> }
| { status: "valid"; config: ValidOptimizeConfig }
| { status: "computing"; config: ValidOptimizeConfig }8. Database Caching Strategy
🎯 Specific Technical Issues9. Performance Anti-patterns:
10. Error Handling Inconsistency:
🔨 Complete Rewrite SuggestionsAlternative 1: Event-Driven Architecture # Commands
@dataclass
class ComputeAttributionsCommand:
prompt_id: int
optimization_params: OptimParams | None
# Events
@dataclass
class AttributionsComputed:
prompt_id: int
graph_data: GraphDataAlternative 2: Functional Core + Imperative Shell def compute_attributions(model: Model, tokens: Tensor, config: Config) -> Graph:
# Pure function - no side effects, easy to test
pass
class AttributionService:
# Imperative shell - handles caching, DB, progress callbacks
pass🚧 Breaking Changes NeededThe current architecture is trying to be too general. Consider:
Bottom Line: This codebase is evolving from research prototype to production-like complexity without the discipline. Either embrace the complexity with proper abstractions, or ruthlessly simplify back to research-code simplicity. The current middle ground is the worst of both worlds. |
…diagram paths Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fire auto-parses JSON strings into Python dicts. Accept both str and dict in worker scripts' config_json parameter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… scope Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… paths Fixes GPT2LMHeadModel compatibility: wte is at .transformer.wte not .wte. Pass embedding_module and unembed_module from the adapter instead of reaching into model internals. Also: harvest defaults batch_size 128, n_batches 2000 (OOM at 256). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unembed_module is needed for output attributions even when lm_head is not among the decomposed targets. Make unembed_path and unembed_module non-optional on ModelAdapter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace role-name-based ArchConfig (kv_roles, o_roles, qkv_group) with full glob path patterns (kv_patterns, o_patterns) and per-model configs. This fixes LlamaSimple not getting SwiGLU role groups and eliminates the "attn" in path heuristic in _resolve_cross_seq_paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use h_torch.* patterns for custom GPT2 (matches registered ModuleList name) - Rename test_gpt2_simple_noln -> test_gpt2_simple_partial_targets - Clean up stale noln comments in tests - Add swiglu group assertion for LlamaSimple test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lity Fire parses JSON null as the string "null", breaking Pydantic validation. Exclude None values from serialized JSON — Pydantic fills in None defaults during validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Required for exclude_none JSON serialization — field must have a default so Pydantic can fill it in when the key is absent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Find block index dynamically instead of assuming paths start with "h.". Add sublayer descriptions for fused c_attn, SwiGLU gate/up projections. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Proposes TransformerTopology as a unified model structure abstraction, replacing scattered path parsing and role detection across 6+ consumers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Core types (SeparateAttention/FusedAttention, StandardFFN/SwiGLUFFN, BlockInfo, LayerInfo) and TransformerTopology class that maps concrete module paths onto canonical abstract roles. ArchConfigs declare role_mapping (glob pattern -> abstract role) for each supported architecture. TransformerTopology resolves all modules at init, builds blocks, and provides describe(), is_cross_seq_pair(), etc. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ModelAdapter is now a thin re-export alias. All topology, embedding/unembed resolution, cross-seq detection, role ordering, role groups, and display names live on TransformerTopology. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove get_model_n_blocks() from compute.py (use topology.n_blocks) - Remove _parse_layer_description() from compact_skeptical.py (use topology.describe() via ArchitectureInfo.layer_descriptions) - Add convenience properties to topology: embedding_path, embedding_module, unembed_path, unembed_module, target_module_paths - model_adapter.py is now a thin re-export alias Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename adapter -> topology throughout: - RunState.adapter -> RunState.topology - compute.py parameter names - All router references - Test files Delete spd/app/backend/model_adapter.py (no remaining imports). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tightens _extract_block_index to assert exactly one digit segment exists, rather than silently returning the first match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LayerInfo is now just path + module. Roles are internal to __init__,
used only to sort paths into struct fields. kv_paths, o_paths, describe(),
and is_cross_seq_pair() all derive from block structs.
SwiGLU layers now get distinct descriptions ("SwiGLU gate/up/down")
rather than generic "MLP" labels.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix duplicate defaults: context_length and max_turns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Deduplicate MAX_OUTPUT_NODES_PER_POS constant Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Simplify investigate module: single inv_id arg, fail-fast patterns - run_agent reads all config from metadata.json instead of duplicating as CLI args (wandb_path, context_length, max_turns) - wait_for_backend raises directly instead of returning bool - _format_model_info accesses keys directly instead of .get() fallbacks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…, investigate - graph_interp/db.py: Extract parameterized _save_label/_get_label/_get_all_labels from 3x3 duplicated CRUD methods - graph_interp/interpret.py: Unify process_output_layer/process_input_layer via _make_process_layer factory - autointerp/prompt_helpers.py: Deduplicate build_fires_on_examples/build_says_examples into _build_examples - graph_interp/prompts.py: Simplify _format_related string building with f-string - investigate/agent_prompt.py: Replace repetitive config blocks with data-driven loop - investigate/scripts/run_agent.py: Remove obvious docstrings, simplify fetch_model_info Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tale docs Backend: - graphs.py: Extract _build_loss_config, _build_loss_result, _maybe_pgd_config, _maybe_adv_pgd helpers - server.py: Move deferred stdlib imports to module-level - __init__.py: Fix __all__ ordering - CLAUDE.md: Remove duplicate router entries - sqlite.py: Fix stale docstring referencing old DB location Frontend components: - Deduplicate getTopEdgeAttributions into shared topEdgeAttributions() in promptAttributionsTypes.ts - Extract generic parseSSEStream<T>() in graphs.ts, eliminating ~50 lines of duplicated SSE parsing - Extract AVAILABILITY_COLUMNS in RunSelector, reducing ~60 lines of duplicated template - Eliminate redundant computeMaxAbsComponentAct in ActivationContextsViewer + ClusterComponentCard - Fix unreachable null check in ClusterComponentCard - Fix mid-file import in ComponentNodeCard - Remove dead fork handler stubs in PromptAttributionsTab - Remove unused isRunEditable export, 5 unused CSS selectors, 12+ unnecessary comments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rt both-or-neither Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n arg It's a runtime value produced by the harvest step, not user config. Thread it as a plain str arg through the call chain, matching how autointerp, graph_interp, and intruder already do it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s, add dataset name - Use md.bullets() and md.numbered() instead of manual \n- lists - Inline token_pmi_pairs (one-liner, used 3 times — not worth a helper) - Add 'danbraunai/pile-uncopyrighted-tok' to DATASET_DESCRIPTIONS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Frontend already called api.deletePrompt() but the endpoint was missing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…directly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The feature landed but was renamed sans→ablated during the merge. Test was using the old name. Now asserts on ablated/ablated_loss correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add data presentation context to autointerp prompts The interpreter LLM had no context about how activation examples were constructed or where output correlations were measured. This led to: - Missing positional patterns (e.g. sequence-start components) - Confusion about whether output tokens are measured at the component's layer or the model's final logits Add a "Data presentation" section to both prompt strategies explaining: - Model sequence length - Window size and truncation at sequence boundaries - That all token correlations are measured at the model's final output Also adds seq_len to ModelMetadata and threads context_tokens_per_side from harvest config through to the prompt builders. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Clarify output correlation explanation in data presentation Only output correlations need explaining — they measure the model's final predicted logits, not the component's direct output. Input correlations are straightforward. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove hardcoded example labels from dual_view prompt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Replace prescriptive 'say unclear' with epistemic honesty guidance Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Note that activation examples are uniformly sampled from all firings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add centralized metric definitions to data presentation section Move metric explanations (recall, precision, PMI) out of inline labels and into a shared definitions block. Key improvement: precision now explains what low-precision-high-recall means (context-dependent firing), which was a major blind spot for the interpreter. Also clarifies the input vs output distinction: input = token at firing position, output = model's predicted logits at that position. Inline metric labels simplified to just the metric name since definitions are now explained upfront. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add decomposition method descriptions, remove recall from prompts - Add decomposition_method field to ModelMetadata with descriptions for SPD, CLT, and MOLT. Replaces hardcoded SPD context in both strategies. - Remove recall metric from compact_skeptical (redundant with PMI + examples, and confusing alongside precision). - Remove include_spd_context config option (now covered by decomposition method description in data presentation section). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix Md() usage: separate paragraphs into separate .p() calls Md.p() already adds paragraph breaks — embedding \n\n inside a single .p() call produces double spacing. Also unchain .bullets() from .h() for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
calc_kl_divergence_lm used F.kl_div(reduction="none") which materializes a full [batch, seq, vocab] intermediate tensor (~13GB with eval_batch_size =128, seq=512, vocab=50K). Called 6x per eval step in CEandKLLosses, this was the memory high-water-mark causing OOMs on runs with many components. Fix: use reduction="sum" which fuses the reduction into the kernel, avoiding the intermediate. Divide by n_positions to match the original mean-over-positions semantics. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add multi-provider LLM support for autointerp (Anthropic, OpenAI, OpenRouter)
Replace hardcoded OpenRouter SDK usage with a provider abstraction that routes
to the right API based on model string:
- "/" in name → OpenRouter (google/gemini-3.1-pro-preview, etc.)
- "claude-*" → first-party Anthropic API (tool_use for structured output)
- "gpt-*"/"o*-*" → first-party OpenAI API (json_schema response format)
This enables using our corporate Anthropic/OpenAI keys directly, avoiding
OpenRouter's rate limits which were bottlenecking autointerp runs.
Key changes:
- New spd/autointerp/providers.py with LLMProvider ABC + 3 implementations
- llm_api.py now provider-agnostic (uses providers.py internally)
- Own ReasoningEffort type replaces openrouter.components.Effort everywhere
- All callers updated: api_key + model → auto-resolved provider
- get_api_key_for_model() reads the right env var per provider
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Encapsulate provider config: callers pass LLMProvider instead of raw strings
Move reasoning_effort into the provider (set at construction), so callers
pass a single LLMProvider object instead of (api_key, model, reasoning_effort).
Entry points call create_provider(model, reasoning_effort) which auto-resolves
the API key from env. Library functions just accept provider: LLMProvider.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix review findings: OpenAI o-series tokens, provider leak, dead field
- OpenAI o-series models need max_completion_tokens (not max_tokens)
- Close provider in app's on-demand interpretation endpoint
- Remove dead LLMJob.schema field (set everywhere, read nowhere)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Replace model+reasoning_effort with discriminated LLMConfig union
LLMConfig = OpenRouterLLMConfig | AnthropicLLMConfig | OpenAILLMConfig
Each variant carries only the fields that apply:
- OpenRouter: model + reasoning_effort
- Anthropic: model (no reasoning_effort — not supported)
- OpenAI: model + reasoning_effort (for o-series)
All configs (AutointerpConfig, AutointerpEvalConfig, IntruderEvalConfig,
GraphInterpConfig) now have `llm: LLMConfig` instead of separate
`model: str` + `reasoning_effort` fields.
YAML format changes from:
model: google/gemini-3.1-pro-preview
reasoning_effort: low
to:
llm:
type: openrouter
model: google/gemini-3.1-pro-preview
reasoning_effort: low
Also removes stale OpenAI pricing (GPT-5 pricing TBD).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add thinking_budget support for Anthropic provider
AnthropicLLMConfig gains optional thinking_budget: int | None field.
When set, enables extended thinking with that token budget and bumps
max_tokens to cover both thinking + output.
Usage:
llm:
type: anthropic
model: claude-sonnet-4-20250514
thinking_budget: 8000
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Remove jose_autointerp config from PR
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add transcoder integration for the harvest pipeline Extends the generic harvest pipeline (from #398) to support transcoders from nn_decompositions. Adds TranscoderAdapter, TranscoderHarvestFn, and TranscoderHarvestConfig so that trained transcoders (loaded from wandb artifacts) can be harvested for activation statistics using the same pipeline as SPD. Includes an example script demonstrating end-to-end harvesting of BatchTopK k=32 transcoders across all 4 LlamaSimpleMLP layers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move tokenizer_name and dataset_name to TranscoderHarvestConfig These were incorrectly hardcoded as "gpt2" and "danbraunai/pile-uncopyrighted-tok" in the adapter. The transcoders are actually trained with the EleutherAI/gpt-neox-20b tokenizer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Extract tokenizer and dataset from base model run info Instead of requiring tokenizer_name and dataset_name in the harvest config, extract them from the base model's PretrainRunInfo. The base model's wandb run already stores the full training config including hf_tokenizer_path and train_dataset_config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Read dataloader config from base model run info Use the base model's train_dataset_config directly instead of hardcoding dataset fields. Only override streaming=True (for harvest) and n_ctx=block_size (strip the extra label token). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Simplify dataloader: construct DatasetConfig from pretrain run config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Derive model_class from actual model type instead of hardcoding Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix prerequisite in example script to use optional dependency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Rename optional dependency from 'transcoder' to 'nn_decompositions' Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Make transcoder harvest launchable from CLI config - Add adapter_from_config() that takes the full method_config, so TranscoderAdapter can be constructed in the harvest worker - Keep adapter_from_id() for downstream consumers (autointerp, intruder) that only have a decomposition ID - Replace Python example script with YAML config for spd-harvest - Exclude transcoder files from basedpyright (optional nn_decompositions dep) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Vendor nn_decompositions transcoder code into spd/adapters/ Copies EncoderConfig and SharedTranscoder + subclasses (474 lines) from bartbussmann/nn_decompositions (MIT) into spd/adapters/, eliminating the optional dependency. Only torch + stdlib needed, both already deps. - spd/adapters/encoder_config.py: EncoderConfig dataclass - spd/adapters/transcoders.py: SharedTranscoder, Vanilla/TopK/BatchTopK/JumpReLU - Remove nn_decompositions optional dep from pyproject.toml Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix type errors in vendored transcoder code - Split encode() into encode() and encode_dense() to avoid union return type - Add type annotations to autograd.Function forward/backward methods - Type _build_loss_dict return as dict[str, Any] - Assert std is not None in postprocess_output, .grad in weight norm - Use int() for dead_features.sum() passed to min() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove pyright ignores from vendored transcoder code - Use *grad_outputs signature for autograd.Function.backward - Replace @torch.no_grad() decorator with context manager - Credit Bart Bussmann by name in vendored file docstrings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Make adapter_from_id work for transcoders via harvest DB lookup For non-SPD decomposition IDs (e.g. tc-*), recover the full method config from the harvest DB. This means spd-autointerp, intruder eval, graph-interp, and label scoring all work with transcoders — no config passing needed, just the decomposition ID. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove EncoderConfig defaults, add transcoder to DecompositionMethod, add paper_vis module - EncoderConfig: all fields now required (values come from checkpoint config.json) - Add "transcoder" to DecompositionMethod literal + description - TranscoderAdapter.model_metadata: add seq_len and decomposition_method fields - paper_vis/: dashboard generation, research post template, comparison visualizations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add CLT adapter/harvest, jose e2e TC support, rename vendored files - CLTAdapter + CLTHarvestFn + CrossLayerTranscoder model for loading CLTs from wandb artifacts (single checkpoint covering all layers) - CLTHarvestConfig with deterministic clt-{hash} IDs - Filter extra e2e fields in TC checkpoint config.json so jose sweep transcoders load correctly via existing TranscoderAdapter - Rename transcoders.py → transcoder_model.py, encoder_config.py merged into transcoder_model.py (matches clt.py / clt_model.py pattern) - Download artifacts to SPD_OUT_DIR/checkpoints/ instead of CWD - Remove MOLT (unused placeholder) - Example YAMLs and smoke test scripts for all 3 decomposition types Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove one-off test/example scripts These were useful during development but don't need to live in the repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove paper_vis from this branch (moved to stacked PR) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove pile_4L_fs_C_2x and pile_4L_fs_C_4x from registry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove non-BatchTopK encoder types and activation_threshold All transcoders and CLTs in our sweeps are BatchTopK. Remove Vanilla, TopK, JumpReLU transcoder classes and the dispatch machinery. Simplify CLT encode_layer to just BatchTopK. Remove activation_threshold from TC/CLT harvest configs (no-op for BatchTopK which produces exact zeros). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add busy_timeout to NFS SQLite writes to handle concurrent access Without this, concurrent detection + fuzzing jobs writing to the same interp.db would immediately fail with "database is locked". Now SQLite retries for up to 30s, which is more than enough for the ~ms writes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Save detection/fuzzing scores incrementally as trials complete Previously, all LLM API calls ran first, then scores were saved in a separate loop. A crash during saves (e.g. SQLite lock contention) would lose all unsaved results despite the API calls having completed. Now each component's score is saved immediately when its last trial arrives. Components with any errored trials are skipped (unreliable). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Extract shared pretrain_dataloader helper, add type hint to TC hooks Addresses PR review: deduplicate identical dataloader logic in CLT and Transcoder adapters into pretrain_dataloader() in base.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Get block_size from model_config_dict instead of passing it through Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: bartbussmann <bartbussmann@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Add alpha sweep script: CE loss vs fixed source value Sweeps alpha in [0, 1] where mask = CI + (1 - CI) * alpha for all components. At alpha=0 masks equal CI (CI-masked), at alpha=1 all components are unmasked. Supports multiple models for comparison. Usage: python spd/scripts/alpha_sweep/alpha_sweep.py <run_ids...> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix batch extraction from data loader dict Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add --labels, --plot-only, JSON data saving, log scale plot, use r notation - Save sweep data as JSON alongside plot for re-plotting without recompute - --plot-only flag to regenerate plots from saved data - --labels flag for custom legend labels - Both linear and log scale plots generated - Use r (not alpha) for source variable in titles, axes, annotations - Annotations use rightarrow and are slightly inset from axes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Rename alpha -> r consistently in code, CLI args, and function names Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude SPD1 <claude_spd1@proton.me> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add sweep summary stats script for generating markdown reports from WandB runs Takes a list of WandB run IDs and produces a markdown report with raw values and summary statistics (mean, std) for all key metrics: CE/KL output quality, eval/train losses, per-module hidden acts reconstruction, and CI-L0 sparsity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Use decimal formatting consistently and add plain-text summary list Replace scientific notation with decimal places throughout. Add a final "All Summary Statistics" section with one line per metric in the format "<name>: <mean> (std: <std>)". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix missing newlines in summary statistics list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Increase decimal precision for very small values Values below 1e-6 and 1e-10 now get 10 and 14 decimal places respectively, so that std of metrics like FaithfulnessLoss doesn't round to zero. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Improve summary list: add section headers, disambiguate metric names L0 metrics now prefixed with "L0", train losses prefixed with "train", and grouped under bold section headers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add blank line before section headings in summary list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add target model info section to sweep summary report Fetches the pretrained target model run from WandB and reports its architecture (layers, hidden dim, heads, MLP width, context, vocab), training dataset, train/val loss, and training steps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix newlines in target model section, label loss as CE Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add Training Compute Recovered metric For each masking mode (unmasked, stochastic, CI, rounded), computes the SPD model's effective CE and finds what percentage through target model training had the same val loss. Interpolates linearly on the target val loss curve. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Use isotonic regression for Training Compute Recovered interpolation Fits a monotone decreasing curve to the target model's noisy val loss history using sklearn's IsotonicRegression, then interpolates on the smoothed curve. This uses all data points rather than just the two nearest, giving more robust estimates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Plot raw and isotonic-fitted target val loss curve Saves a PNG alongside the report showing the raw val loss points and the monotone isotonic regression fit used for compute-recovered interpolation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Replace isotonic regression with bidirectional EMA for val loss smoothing Uses a forward + backward EMA (alpha=0.15) averaged together, giving a smooth curve without lag or staircase artifacts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Use forward-only EMA to avoid flattening early steep drop Bidirectional EMA was pulling early points down toward later values. Forward-only with alpha=0.3 preserves the steep initial descent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add LaTeX summary table with CE loss and compute recovered by masking mode Includes unmasked, stochastic, CI, and rounded (CI > 0) modes with mean +/- std, plus target model baseline row. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update rounded masks label in LaTeX table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove stds from LaTeX summary table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix unmasked label: "All masks=1" not "All CI=1" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add LaTeX tables for eval recon losses and sparsity with n_alive Adds three LaTeX tables to the report: 1. Masking mode quality (CE + compute recovered) — already existed 2. Eval reconstruction losses (StochRecon, PGD, HiddenActs) 3. Sparsity per layer (C, Alive, Mean L0, L0/C %) n_alive is sourced from the harvest DB via --harvest-run flag, since the sweep runs themselves may not be harvested. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add LaTeX table for training losses Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Move ImportanceMinimalityLoss to last in training losses table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add alpha sweep script: CE loss vs fixed source value Sweeps alpha in [0, 1] where mask = CI + (1 - CI) * alpha for all components. At alpha=0 masks equal CI (CI-masked), at alpha=1 all components are unmasked. Supports multiple models for comparison. Usage: python spd/scripts/alpha_sweep/alpha_sweep.py <run_ids...> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix batch extraction from data loader dict Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add --labels, --plot-only, JSON data saving, log scale plot, use r notation - Save sweep data as JSON alongside plot for re-plotting without recompute - --plot-only flag to regenerate plots from saved data - --labels flag for custom legend labels - Both linear and log scale plots generated - Use r (not alpha) for source variable in titles, axes, annotations - Annotations use rightarrow and are slightly inset from axes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove alpha sweep script (moved to feature/alpha-sweep branch) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude SPD1 <claude_spd1@proton.me> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
) Adds: - spd/scripts/collect_attention_patterns.py — shared utility for collecting per-head attention patterns from SPD component models - spd/scripts/rope_aware_qk.py — RoPE-aware QK inner product computation with multi-offset support - spd/scripts/plot_qk_c_attention_contributions/ — per-head grid plots of QK pair attention contributions with caching and selective plot generation Co-authored-by: Claude SPD1 <claude_spd1@proton.me> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add QK attention contribution plots and shared attention utilities Adds: - spd/scripts/collect_attention_patterns.py — shared utility for collecting per-head attention patterns from SPD component models - spd/scripts/rope_aware_qk.py — RoPE-aware QK inner product computation with multi-offset support - spd/scripts/plot_qk_c_attention_contributions/ — per-head grid plots of QK pair attention contributions with caching and selective plot generation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add previous-token head detection script Detects SPD components that implement previous-token attention patterns. Includes both crafted-prompt and random-token evaluation modes. Depends on spd/scripts/collect_attention_patterns.py from the QK attention contributions PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude SPD1 <claude_spd1@proton.me> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add QK attention contribution plots and shared attention utilities Adds: - spd/scripts/collect_attention_patterns.py — shared utility for collecting per-head attention patterns from SPD component models - spd/scripts/rope_aware_qk.py — RoPE-aware QK inner product computation with multi-offset support - spd/scripts/plot_qk_c_attention_contributions/ — per-head grid plots of QK pair attention contributions with caching and selective plot generation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add previous-token head detection script Detects SPD components that implement previous-token attention patterns. Includes both crafted-prompt and random-token evaluation modes. Depends on spd/scripts/collect_attention_patterns.py from the QK attention contributions PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add attention ablation experiment suite Component-level attention ablation analysis including: - Single-component and multi-pair fractional attention change plots - Attention pattern difference visualization - Generation with ablated components for output comparison - Prev-token head redundancy testing Depends on detect_prev_token_heads and collect_attention_patterns from earlier PRs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude SPD1 <claude_spd1@proton.me> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multiple harvest workers downloading the same artifact concurrently could read a partially-written file. Use O_CREAT|O_EXCL lockfile (atomic on NFS) so one process downloads while others poll for a .complete sentinel. 5 minute timeout on the poll. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Visualizes the L2 norms of SPD component U and V matrices projected into each attention head's subspace, showing how components distribute across heads. Produces per-layer heatmaps and bar charts. Co-authored-by: Claude SPD1 <claude_spd1@proton.me> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Plots mean attention weight and pre-softmax QK logit by relative offset for each head. Produces per-layer grids with dual y-axes showing how attention distributes across token distances (offset tau = query_pos - key_pos). Co-authored-by: Claude SPD1 <claude_spd1@proton.me> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Analyzes how attention head W_V weight matrices share subspaces via Gram matrix cosine similarity. Includes: - Unweighted and data-variance-weighted subspace overlap heatmaps - Combined paper figure (side-by-side) - Component-head amplification heatmap (||W_V^h @ v_c||) - LaTeX writeup of overlap metrics Co-authored-by: Claude SPD1 <claude_spd1@proton.me> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add rich_examples autointerp strategy and compare tab New autointerp strategy (rich_examples) that shows per-token CI and activation values inline, letting the LLM judge evidence quality directly. Also adds an Autointerp Compare tab to the app for side-by-side comparison of interpretation results across different strategies/models/subruns. Backend: 3 new endpoints for listing subruns, bulk headlines, and detail. Frontend: SubrunSelector (multiselect chips), stacked SubrunInterpCard, two-panel AutointerpComparer with full component data on the right panel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Restrict Anthropic autointerp models and use structured outputs --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix rich_examples prompt: explain signed component activations Adds explanation to the SPD decomposition description that component activation sign is arbitrary (inner product with read direction) and does not indicate suppression. Trims redundant legend text. Also adds render_prompt.py script for iterating on prompt templates. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Expose snapshot_branch in spd-autointerp CLI Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Improve rich_examples prompt clarity - Show raw text before annotated version in examples (helps with dense token sequences like code/LaTeX) - Add explicit explanation of <<<token (ci:X, act:Y)>>> format - Add "consider evidence critically" paragraph from dual_view Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Use XML blocks with raw + highlighted text in rich_examples examples Replaces sanitized single-line format with: <example> <raw>...unmodified text...</raw> <highlighted>...<<<token (ci:X, act:Y)>>>...</highlighted> </example> Adds AppTokenizer.get_raw_spans for LLM prompt rendering where actual whitespace (newlines, indentation) is meaningful. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Show all subruns in autointerp comparer, not just .done ones Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* Fix rich_examples prompt: explain signed component activations Adds explanation to the SPD decomposition description that component activation sign is arbitrary (inner product with read direction) and does not indicate suppression. Trims redundant legend text. Also adds render_prompt.py script for iterating on prompt templates. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Expose snapshot_branch in spd-autointerp CLI Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Improve rich_examples prompt clarity - Show raw text before annotated version in examples (helps with dense token sequences like code/LaTeX) - Add explicit explanation of <<<token (ci:X, act:Y)>>> format - Add "consider evidence critically" paragraph from dual_view Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Use XML blocks with raw + highlighted text in rich_examples examples Replaces sanitized single-line format with: <example> <raw>...unmodified text...</raw> <highlighted>...<<<token (ci:X, act:Y)>>>...</highlighted> </example> Adds AppTokenizer.get_raw_spans for LLM prompt rendering where actual whitespace (newlines, indentation) is meaningful. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Show all subruns in autointerp comparer, not just .done ones Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Add autointerp_subrun_id to scoring CLI and InterpRepo.open_subrun Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Remove confidence field from autointerp + improve act legend Drops the confidence field entirely from InterpretationResult, all DB schemas, JSON output schemas, prompts, API responses, and frontend UI. Expands the act legend in rich_examples to explain that sign is meaningful within a component's examples even though the global convention is arbitrary — polarity may indicate distinct input patterns. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Description
Related Issue
Motivation and Context
How Has This Been Tested?
Does this PR introduce a breaking change?