Skip to content

(dev -> main) App updates and backwards transformer ci functions#355

Draft
danbraunai-goodfire wants to merge 572 commits intomainfrom
dev
Draft

(dev -> main) App updates and backwards transformer ci functions#355
danbraunai-goodfire wants to merge 572 commits intomainfrom
dev

Conversation

@danbraunai-goodfire
Copy link
Collaborator

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

@danbraunai-goodfire
Copy link
Collaborator Author

@claude can you review this PR? Note that your review should cover the scope of this PR only. If you spot things unrelated to this PR, feel free to bring them up and we'll consider them for new issues. I want you to think about possible ways that the implementation could be improved. Including ways it could be re-written completely using different abstractions. I want critical comments, do feel comfortable to give me the hard truths.

@claude
Copy link
Contributor

claude bot commented Feb 4, 2026

Claude finished @danbraunai-goodfire's task —— View job


PR Review: App updates and backwards transformer ci functions

Review Scope: 288 commits, 100 files changed, focusing on app updates and transformer CI functions

Todo List

  • Activate virtual environment and setup
  • Analyze major architectural changes in the app
  • Review new abstractions and components
  • Examine backend API changes
  • Review frontend component refactoring
  • Analyze config changes across experiments
  • Provide critical feedback on implementation approach
  • Suggest alternative architectures and improvements

Critical Review & Hard Truths

🔥 Major Architectural Problems

1. Backend Compute Layer is a God Object (spd/app/backend/compute.py:1-827)

  • Problem: 827 lines doing gradient computation, edge calculation, intervention logic, AND result formatting
  • Anti-pattern: Multiple responsibilities violating SRP
  • Rewrite suggestion:
    # Split into focused modules:
    spd/app/backend/compute/
      ├── gradient_engine.py     # Core autograd operations
      ├── attribution_builder.py # Edge construction logic  
      ├── intervention_runner.py # Forward passes with masks
      └── result_formatters.py   # Output transformation

2. Database Schema Fighting the Type System (spd/app/backend/database.py)

  • Problem: StoredGraph uses discriminated unions in Python, but database still has nullable columns for all possible fields
  • Result: Invalid states are representable at DB level despite type safety in Python
  • Solution: Either use proper table inheritance OR embrace the union and validate constraints in application code, don't half-ass both approaches

3. Frontend State Management is Chaotic (PromptAttributionsTab.svelte:565+ lines)

  • Problem: No state machine - implicit transitions everywhere
  • Examples:
    • TabViewState + GraphComputeState + ComposerState can be in invalid combinations
    • UI can be in "loading" state while also in "draft" mode
  • Rewrite suggestion: Use XState or explicit state machine pattern. Current approach doesn't scale.

🚨 Code Quality Issues

4. Layer Aliasing Global State (spd/app/frontend/src/lib/layerAliasing.ts:57)

let cachedArchitecture: Architecture | null = null;
  • Problem: Global mutable cache that can become stale, not reset between contexts
  • Solution: Pass architecture context down explicitly or use a proper store

5. Bulk API Endpoint Design (spd/app/backend/routers/component_data.py)

  • Good: Eliminated GIL contention by batching requests
  • Bad: Still sequential execution inside the handler - missed opportunity for true parallelism
  • Alternative: Use asyncio.gather() to run the three data fetches concurrently

6. Configuration Explosion (spd/configs.py)

  • Added complex global_reverse_residual CI config with 7+ required fields and complex validation
  • Problem: Config complexity is growing faster than the value it provides
  • Question: Why not use dataclasses with defaults instead of elaborate Pydantic validation?

💡 Missed Opportunities

7. TypeScript Discriminated Unions Done Halfway

// Current: Can still represent invalid states
type OptimizeConfigDraft = {
    loss: LossConfigDraft;  // Could be invalid
    // ...
}

// Better: Make invalid states unrepresentable  
type OptimizeState = 
  | { status: "draft"; config: Partial<OptimizeConfig> }
  | { status: "valid"; config: ValidOptimizeConfig }
  | { status: "computing"; config: ValidOptimizeConfig }

8. Database Caching Strategy

  • Current: Ad-hoc SQLite caching with complex uniqueness constraints
  • Alternative: Consider Redis/in-memory cache for computed graphs + persistent storage for configuration only
  • Benefit: Simpler schema, better performance for large graphs

🎯 Specific Technical Issues

9. Performance Anti-patterns:

  • compute_prompt_attributions() does full gradient computation even for display-only graphs
  • useComponentData and useComponentDataExpectCached are nearly identical - code duplication
  • Frontend doesn't virtualize large lists (could be 1000+ components)

10. Error Handling Inconsistency:

  • Some places use assert for validation (good, fail-fast)
  • Others use try/catch with graceful degradation (bad for research code)
  • Pick one approach: Research code should fail fast, not silently degrade

🔨 Complete Rewrite Suggestions

Alternative 1: Event-Driven Architecture
Replace current request/response with events:

# Commands
@dataclass
class ComputeAttributionsCommand:
    prompt_id: int
    optimization_params: OptimParams | None

# Events  
@dataclass
class AttributionsComputed:
    prompt_id: int
    graph_data: GraphData

Alternative 2: Functional Core + Imperative Shell
Move all computation to pure functions, keep I/O at edges:

def compute_attributions(model: Model, tokens: Tensor, config: Config) -> Graph:
    # Pure function - no side effects, easy to test
    pass

class AttributionService:
    # Imperative shell - handles caching, DB, progress callbacks
    pass

🚧 Breaking Changes Needed

The current architecture is trying to be too general. Consider:

  1. Remove the three graph types - just have "computed graphs" with optional optimization metadata
  2. Eliminate the database - store graphs as JSON files, much simpler
  3. Split the mega-components - PromptAttributionsTab should be 4-5 smaller components

Bottom Line: This codebase is evolving from research prototype to production-like complexity without the discipline. Either embrace the complexity with proper abstractions, or ruthlessly simplify back to research-code simplicity.

The current middle ground is the worst of both worlds.


@danbraunai-goodfire danbraunai-goodfire changed the title App updates and backwards transformer ci functions (dev -> main) App updates and backwards transformer ci functions Feb 6, 2026
claude-spd1 and others added 27 commits February 10, 2026 15:03
…diagram paths

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fire auto-parses JSON strings into Python dicts. Accept both str and dict
in worker scripts' config_json parameter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… scope

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… paths

Fixes GPT2LMHeadModel compatibility: wte is at .transformer.wte not .wte.
Pass embedding_module and unembed_module from the adapter instead of
reaching into model internals.

Also: harvest defaults batch_size 128, n_batches 2000 (OOM at 256).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unembed_module is needed for output attributions even when lm_head is not
among the decomposed targets. Make unembed_path and unembed_module
non-optional on ModelAdapter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace role-name-based ArchConfig (kv_roles, o_roles, qkv_group) with
full glob path patterns (kv_patterns, o_patterns) and per-model configs.
This fixes LlamaSimple not getting SwiGLU role groups and eliminates the
"attn" in path heuristic in _resolve_cross_seq_paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use h_torch.* patterns for custom GPT2 (matches registered ModuleList name)
- Rename test_gpt2_simple_noln -> test_gpt2_simple_partial_targets
- Clean up stale noln comments in tests
- Add swiglu group assertion for LlamaSimple test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lity

Fire parses JSON null as the string "null", breaking Pydantic validation.
Exclude None values from serialized JSON — Pydantic fills in None defaults
during validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Required for exclude_none JSON serialization — field must have a default
so Pydantic can fill it in when the key is absent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Find block index dynamically instead of assuming paths start with "h.".
Add sublayer descriptions for fused c_attn, SwiGLU gate/up projections.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Proposes TransformerTopology as a unified model structure abstraction,
replacing scattered path parsing and role detection across 6+ consumers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Core types (SeparateAttention/FusedAttention, StandardFFN/SwiGLUFFN,
BlockInfo, LayerInfo) and TransformerTopology class that maps concrete
module paths onto canonical abstract roles.

ArchConfigs declare role_mapping (glob pattern -> abstract role) for each
supported architecture. TransformerTopology resolves all modules at init,
builds blocks, and provides describe(), is_cross_seq_pair(), etc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ModelAdapter is now a thin re-export alias. All topology, embedding/unembed
resolution, cross-seq detection, role ordering, role groups, and display
names live on TransformerTopology.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove get_model_n_blocks() from compute.py (use topology.n_blocks)
- Remove _parse_layer_description() from compact_skeptical.py
  (use topology.describe() via ArchitectureInfo.layer_descriptions)
- Add convenience properties to topology: embedding_path, embedding_module,
  unembed_path, unembed_module, target_module_paths
- model_adapter.py is now a thin re-export alias

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename adapter -> topology throughout:
- RunState.adapter -> RunState.topology
- compute.py parameter names
- All router references
- Test files

Delete spd/app/backend/model_adapter.py (no remaining imports).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tightens _extract_block_index to assert exactly one digit segment
exists, rather than silently returning the first match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LayerInfo is now just path + module. Roles are internal to __init__,
used only to sort paths into struct fields. kv_paths, o_paths, describe(),
and is_cross_seq_pair() all derive from block structs.

SwiGLU layers now get distinct descriptions ("SwiGLU gate/up/down")
rather than generic "MLP" labels.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ocg-goodfire and others added 30 commits March 6, 2026 17:19
* Fix duplicate defaults: context_length and max_turns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Deduplicate MAX_OUTPUT_NODES_PER_POS constant

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Simplify investigate module: single inv_id arg, fail-fast patterns

- run_agent reads all config from metadata.json instead of duplicating
  as CLI args (wandb_path, context_length, max_turns)
- wait_for_backend raises directly instead of returning bool
- _format_model_info accesses keys directly instead of .get() fallbacks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…, investigate

- graph_interp/db.py: Extract parameterized _save_label/_get_label/_get_all_labels from 3x3 duplicated CRUD methods
- graph_interp/interpret.py: Unify process_output_layer/process_input_layer via _make_process_layer factory
- autointerp/prompt_helpers.py: Deduplicate build_fires_on_examples/build_says_examples into _build_examples
- graph_interp/prompts.py: Simplify _format_related string building with f-string
- investigate/agent_prompt.py: Replace repetitive config blocks with data-driven loop
- investigate/scripts/run_agent.py: Remove obvious docstrings, simplify fetch_model_info

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tale docs

Backend:
- graphs.py: Extract _build_loss_config, _build_loss_result, _maybe_pgd_config, _maybe_adv_pgd helpers
- server.py: Move deferred stdlib imports to module-level
- __init__.py: Fix __all__ ordering
- CLAUDE.md: Remove duplicate router entries
- sqlite.py: Fix stale docstring referencing old DB location

Frontend components:
- Deduplicate getTopEdgeAttributions into shared topEdgeAttributions() in promptAttributionsTypes.ts
- Extract generic parseSSEStream<T>() in graphs.ts, eliminating ~50 lines of duplicated SSE parsing
- Extract AVAILABILITY_COLUMNS in RunSelector, reducing ~60 lines of duplicated template
- Eliminate redundant computeMaxAbsComponentAct in ActivationContextsViewer + ClusterComponentCard
- Fix unreachable null check in ClusterComponentCard
- Fix mid-file import in ComponentNodeCard
- Remove dead fork handler stubs in PromptAttributionsTab
- Remove unused isRunEditable export, 5 unused CSS selectors, 12+ unnecessary comments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rt both-or-neither

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n arg

It's a runtime value produced by the harvest step, not user config. Thread
it as a plain str arg through the call chain, matching how autointerp,
graph_interp, and intruder already do it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s, add dataset name

- Use md.bullets() and md.numbered() instead of manual \n- lists
- Inline token_pmi_pairs (one-liner, used 3 times — not worth a helper)
- Add 'danbraunai/pile-uncopyrighted-tok' to DATASET_DESCRIPTIONS

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Frontend already called api.deletePrompt() but the endpoint was missing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…directly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The feature landed but was renamed sans→ablated during the merge. Test was
using the old name. Now asserts on ablated/ablated_loss correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add data presentation context to autointerp prompts

The interpreter LLM had no context about how activation examples were
constructed or where output correlations were measured. This led to:
- Missing positional patterns (e.g. sequence-start components)
- Confusion about whether output tokens are measured at the component's
  layer or the model's final logits

Add a "Data presentation" section to both prompt strategies explaining:
- Model sequence length
- Window size and truncation at sequence boundaries
- That all token correlations are measured at the model's final output

Also adds seq_len to ModelMetadata and threads context_tokens_per_side
from harvest config through to the prompt builders.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Clarify output correlation explanation in data presentation

Only output correlations need explaining — they measure the model's
final predicted logits, not the component's direct output. Input
correlations are straightforward.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove hardcoded example labels from dual_view prompt

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Replace prescriptive 'say unclear' with epistemic honesty guidance

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Note that activation examples are uniformly sampled from all firings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add centralized metric definitions to data presentation section

Move metric explanations (recall, precision, PMI) out of inline labels
and into a shared definitions block. Key improvement: precision now
explains what low-precision-high-recall means (context-dependent firing),
which was a major blind spot for the interpreter.

Also clarifies the input vs output distinction: input = token at firing
position, output = model's predicted logits at that position.

Inline metric labels simplified to just the metric name since definitions
are now explained upfront.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add decomposition method descriptions, remove recall from prompts

- Add decomposition_method field to ModelMetadata with descriptions
  for SPD, CLT, and MOLT. Replaces hardcoded SPD context in both
  strategies.
- Remove recall metric from compact_skeptical (redundant with PMI +
  examples, and confusing alongside precision).
- Remove include_spd_context config option (now covered by
  decomposition method description in data presentation section).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix Md() usage: separate paragraphs into separate .p() calls

Md.p() already adds paragraph breaks — embedding \n\n inside a single
.p() call produces double spacing. Also unchain .bullets() from .h()
for clarity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
calc_kl_divergence_lm used F.kl_div(reduction="none") which materializes
a full [batch, seq, vocab] intermediate tensor (~13GB with eval_batch_size
=128, seq=512, vocab=50K). Called 6x per eval step in CEandKLLosses, this
was the memory high-water-mark causing OOMs on runs with many components.

Fix: use reduction="sum" which fuses the reduction into the kernel,
avoiding the intermediate. Divide by n_positions to match the original
mean-over-positions semantics.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add multi-provider LLM support for autointerp (Anthropic, OpenAI, OpenRouter)

Replace hardcoded OpenRouter SDK usage with a provider abstraction that routes
to the right API based on model string:
  - "/" in name → OpenRouter (google/gemini-3.1-pro-preview, etc.)
  - "claude-*" → first-party Anthropic API (tool_use for structured output)
  - "gpt-*"/"o*-*" → first-party OpenAI API (json_schema response format)

This enables using our corporate Anthropic/OpenAI keys directly, avoiding
OpenRouter's rate limits which were bottlenecking autointerp runs.

Key changes:
- New spd/autointerp/providers.py with LLMProvider ABC + 3 implementations
- llm_api.py now provider-agnostic (uses providers.py internally)
- Own ReasoningEffort type replaces openrouter.components.Effort everywhere
- All callers updated: api_key + model → auto-resolved provider
- get_api_key_for_model() reads the right env var per provider

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Encapsulate provider config: callers pass LLMProvider instead of raw strings

Move reasoning_effort into the provider (set at construction), so callers
pass a single LLMProvider object instead of (api_key, model, reasoning_effort).

Entry points call create_provider(model, reasoning_effort) which auto-resolves
the API key from env. Library functions just accept provider: LLMProvider.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix review findings: OpenAI o-series tokens, provider leak, dead field

- OpenAI o-series models need max_completion_tokens (not max_tokens)
- Close provider in app's on-demand interpretation endpoint
- Remove dead LLMJob.schema field (set everywhere, read nowhere)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Replace model+reasoning_effort with discriminated LLMConfig union

LLMConfig = OpenRouterLLMConfig | AnthropicLLMConfig | OpenAILLMConfig

Each variant carries only the fields that apply:
  - OpenRouter: model + reasoning_effort
  - Anthropic: model (no reasoning_effort — not supported)
  - OpenAI: model + reasoning_effort (for o-series)

All configs (AutointerpConfig, AutointerpEvalConfig, IntruderEvalConfig,
GraphInterpConfig) now have `llm: LLMConfig` instead of separate
`model: str` + `reasoning_effort` fields.

YAML format changes from:
  model: google/gemini-3.1-pro-preview
  reasoning_effort: low
to:
  llm:
    type: openrouter
    model: google/gemini-3.1-pro-preview
    reasoning_effort: low

Also removes stale OpenAI pricing (GPT-5 pricing TBD).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add thinking_budget support for Anthropic provider

AnthropicLLMConfig gains optional thinking_budget: int | None field.
When set, enables extended thinking with that token budget and bumps
max_tokens to cover both thinking + output.

Usage:
  llm:
    type: anthropic
    model: claude-sonnet-4-20250514
    thinking_budget: 8000

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove jose_autointerp config from PR

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add transcoder integration for the harvest pipeline

Extends the generic harvest pipeline (from #398) to support transcoders
from nn_decompositions. Adds TranscoderAdapter, TranscoderHarvestFn, and
TranscoderHarvestConfig so that trained transcoders (loaded from wandb
artifacts) can be harvested for activation statistics using the same
pipeline as SPD.

Includes an example script demonstrating end-to-end harvesting of
BatchTopK k=32 transcoders across all 4 LlamaSimpleMLP layers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Move tokenizer_name and dataset_name to TranscoderHarvestConfig

These were incorrectly hardcoded as "gpt2" and
"danbraunai/pile-uncopyrighted-tok" in the adapter. The transcoders
are actually trained with the EleutherAI/gpt-neox-20b tokenizer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Extract tokenizer and dataset from base model run info

Instead of requiring tokenizer_name and dataset_name in the harvest
config, extract them from the base model's PretrainRunInfo. The base
model's wandb run already stores the full training config including
hf_tokenizer_path and train_dataset_config.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Read dataloader config from base model run info

Use the base model's train_dataset_config directly instead of
hardcoding dataset fields. Only override streaming=True (for harvest)
and n_ctx=block_size (strip the extra label token).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Simplify dataloader: construct DatasetConfig from pretrain run config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Derive model_class from actual model type instead of hardcoding

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix prerequisite in example script to use optional dependency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Rename optional dependency from 'transcoder' to 'nn_decompositions'

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Make transcoder harvest launchable from CLI config

- Add adapter_from_config() that takes the full method_config, so
  TranscoderAdapter can be constructed in the harvest worker
- Keep adapter_from_id() for downstream consumers (autointerp, intruder)
  that only have a decomposition ID
- Replace Python example script with YAML config for spd-harvest
- Exclude transcoder files from basedpyright (optional nn_decompositions dep)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Vendor nn_decompositions transcoder code into spd/adapters/

Copies EncoderConfig and SharedTranscoder + subclasses (474 lines) from
bartbussmann/nn_decompositions (MIT) into spd/adapters/, eliminating the
optional dependency. Only torch + stdlib needed, both already deps.

- spd/adapters/encoder_config.py: EncoderConfig dataclass
- spd/adapters/transcoders.py: SharedTranscoder, Vanilla/TopK/BatchTopK/JumpReLU
- Remove nn_decompositions optional dep from pyproject.toml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix type errors in vendored transcoder code

- Split encode() into encode() and encode_dense() to avoid union return type
- Add type annotations to autograd.Function forward/backward methods
- Type _build_loss_dict return as dict[str, Any]
- Assert std is not None in postprocess_output, .grad in weight norm
- Use int() for dead_features.sum() passed to min()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove pyright ignores from vendored transcoder code

- Use *grad_outputs signature for autograd.Function.backward
- Replace @torch.no_grad() decorator with context manager
- Credit Bart Bussmann by name in vendored file docstrings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Make adapter_from_id work for transcoders via harvest DB lookup

For non-SPD decomposition IDs (e.g. tc-*), recover the full method
config from the harvest DB. This means spd-autointerp, intruder eval,
graph-interp, and label scoring all work with transcoders — no config
passing needed, just the decomposition ID.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove EncoderConfig defaults, add transcoder to DecompositionMethod, add paper_vis module

- EncoderConfig: all fields now required (values come from checkpoint config.json)
- Add "transcoder" to DecompositionMethod literal + description
- TranscoderAdapter.model_metadata: add seq_len and decomposition_method fields
- paper_vis/: dashboard generation, research post template, comparison visualizations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add CLT adapter/harvest, jose e2e TC support, rename vendored files

- CLTAdapter + CLTHarvestFn + CrossLayerTranscoder model for loading
  CLTs from wandb artifacts (single checkpoint covering all layers)
- CLTHarvestConfig with deterministic clt-{hash} IDs
- Filter extra e2e fields in TC checkpoint config.json so jose sweep
  transcoders load correctly via existing TranscoderAdapter
- Rename transcoders.py → transcoder_model.py, encoder_config.py merged
  into transcoder_model.py (matches clt.py / clt_model.py pattern)
- Download artifacts to SPD_OUT_DIR/checkpoints/ instead of CWD
- Remove MOLT (unused placeholder)
- Example YAMLs and smoke test scripts for all 3 decomposition types

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove one-off test/example scripts

These were useful during development but don't need to live in the repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove paper_vis from this branch (moved to stacked PR)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove pile_4L_fs_C_2x and pile_4L_fs_C_4x from registry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove non-BatchTopK encoder types and activation_threshold

All transcoders and CLTs in our sweeps are BatchTopK. Remove Vanilla,
TopK, JumpReLU transcoder classes and the dispatch machinery. Simplify
CLT encode_layer to just BatchTopK. Remove activation_threshold from
TC/CLT harvest configs (no-op for BatchTopK which produces exact zeros).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add busy_timeout to NFS SQLite writes to handle concurrent access

Without this, concurrent detection + fuzzing jobs writing to the same
interp.db would immediately fail with "database is locked". Now SQLite
retries for up to 30s, which is more than enough for the ~ms writes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Save detection/fuzzing scores incrementally as trials complete

Previously, all LLM API calls ran first, then scores were saved in a
separate loop. A crash during saves (e.g. SQLite lock contention) would
lose all unsaved results despite the API calls having completed.

Now each component's score is saved immediately when its last trial
arrives. Components with any errored trials are skipped (unreliable).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Extract shared pretrain_dataloader helper, add type hint to TC hooks

Addresses PR review: deduplicate identical dataloader logic in CLT and
Transcoder adapters into pretrain_dataloader() in base.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Get block_size from model_config_dict instead of passing it through

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: bartbussmann <bartbussmann@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Add alpha sweep script: CE loss vs fixed source value

Sweeps alpha in [0, 1] where mask = CI + (1 - CI) * alpha for all
components. At alpha=0 masks equal CI (CI-masked), at alpha=1 all
components are unmasked. Supports multiple models for comparison.

Usage: python spd/scripts/alpha_sweep/alpha_sweep.py <run_ids...>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix batch extraction from data loader dict

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add --labels, --plot-only, JSON data saving, log scale plot, use r notation

- Save sweep data as JSON alongside plot for re-plotting without recompute
- --plot-only flag to regenerate plots from saved data
- --labels flag for custom legend labels
- Both linear and log scale plots generated
- Use r (not alpha) for source variable in titles, axes, annotations
- Annotations use rightarrow and are slightly inset from axes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Rename alpha -> r consistently in code, CLI args, and function names

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude SPD1 <claude_spd1@proton.me>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add sweep summary stats script for generating markdown reports from WandB runs

Takes a list of WandB run IDs and produces a markdown report with raw values
and summary statistics (mean, std) for all key metrics: CE/KL output quality,
eval/train losses, per-module hidden acts reconstruction, and CI-L0 sparsity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Use decimal formatting consistently and add plain-text summary list

Replace scientific notation with decimal places throughout. Add a final
"All Summary Statistics" section with one line per metric in the format
"<name>: <mean> (std: <std>)".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix missing newlines in summary statistics list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Increase decimal precision for very small values

Values below 1e-6 and 1e-10 now get 10 and 14 decimal places respectively,
so that std of metrics like FaithfulnessLoss doesn't round to zero.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Improve summary list: add section headers, disambiguate metric names

L0 metrics now prefixed with "L0", train losses prefixed with "train",
and grouped under bold section headers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add blank line before section headings in summary list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add target model info section to sweep summary report

Fetches the pretrained target model run from WandB and reports its
architecture (layers, hidden dim, heads, MLP width, context, vocab),
training dataset, train/val loss, and training steps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix newlines in target model section, label loss as CE

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add Training Compute Recovered metric

For each masking mode (unmasked, stochastic, CI, rounded), computes the SPD
model's effective CE and finds what percentage through target model training
had the same val loss. Interpolates linearly on the target val loss curve.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Use isotonic regression for Training Compute Recovered interpolation

Fits a monotone decreasing curve to the target model's noisy val loss
history using sklearn's IsotonicRegression, then interpolates on the
smoothed curve. This uses all data points rather than just the two
nearest, giving more robust estimates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Plot raw and isotonic-fitted target val loss curve

Saves a PNG alongside the report showing the raw val loss points and the
monotone isotonic regression fit used for compute-recovered interpolation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Replace isotonic regression with bidirectional EMA for val loss smoothing

Uses a forward + backward EMA (alpha=0.15) averaged together, giving a
smooth curve without lag or staircase artifacts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Use forward-only EMA to avoid flattening early steep drop

Bidirectional EMA was pulling early points down toward later values.
Forward-only with alpha=0.3 preserves the steep initial descent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add LaTeX summary table with CE loss and compute recovered by masking mode

Includes unmasked, stochastic, CI, and rounded (CI > 0) modes with
mean +/- std, plus target model baseline row.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update rounded masks label in LaTeX table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove stds from LaTeX summary table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix unmasked label: "All masks=1" not "All CI=1"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add LaTeX tables for eval recon losses and sparsity with n_alive

Adds three LaTeX tables to the report:
1. Masking mode quality (CE + compute recovered) — already existed
2. Eval reconstruction losses (StochRecon, PGD, HiddenActs)
3. Sparsity per layer (C, Alive, Mean L0, L0/C %)

n_alive is sourced from the harvest DB via --harvest-run flag, since
the sweep runs themselves may not be harvested.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add LaTeX table for training losses

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Move ImportanceMinimalityLoss to last in training losses table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add alpha sweep script: CE loss vs fixed source value

Sweeps alpha in [0, 1] where mask = CI + (1 - CI) * alpha for all
components. At alpha=0 masks equal CI (CI-masked), at alpha=1 all
components are unmasked. Supports multiple models for comparison.

Usage: python spd/scripts/alpha_sweep/alpha_sweep.py <run_ids...>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix batch extraction from data loader dict

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add --labels, --plot-only, JSON data saving, log scale plot, use r notation

- Save sweep data as JSON alongside plot for re-plotting without recompute
- --plot-only flag to regenerate plots from saved data
- --labels flag for custom legend labels
- Both linear and log scale plots generated
- Use r (not alpha) for source variable in titles, axes, annotations
- Annotations use rightarrow and are slightly inset from axes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove alpha sweep script (moved to feature/alpha-sweep branch)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude SPD1 <claude_spd1@proton.me>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
)

Adds:
- spd/scripts/collect_attention_patterns.py — shared utility for collecting
  per-head attention patterns from SPD component models
- spd/scripts/rope_aware_qk.py — RoPE-aware QK inner product computation
  with multi-offset support
- spd/scripts/plot_qk_c_attention_contributions/ — per-head grid plots of
  QK pair attention contributions with caching and selective plot generation

Co-authored-by: Claude SPD1 <claude_spd1@proton.me>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add QK attention contribution plots and shared attention utilities

Adds:
- spd/scripts/collect_attention_patterns.py — shared utility for collecting
  per-head attention patterns from SPD component models
- spd/scripts/rope_aware_qk.py — RoPE-aware QK inner product computation
  with multi-offset support
- spd/scripts/plot_qk_c_attention_contributions/ — per-head grid plots of
  QK pair attention contributions with caching and selective plot generation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add previous-token head detection script

Detects SPD components that implement previous-token attention patterns.
Includes both crafted-prompt and random-token evaluation modes.

Depends on spd/scripts/collect_attention_patterns.py from the QK attention
contributions PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude SPD1 <claude_spd1@proton.me>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add QK attention contribution plots and shared attention utilities

Adds:
- spd/scripts/collect_attention_patterns.py — shared utility for collecting
  per-head attention patterns from SPD component models
- spd/scripts/rope_aware_qk.py — RoPE-aware QK inner product computation
  with multi-offset support
- spd/scripts/plot_qk_c_attention_contributions/ — per-head grid plots of
  QK pair attention contributions with caching and selective plot generation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add previous-token head detection script

Detects SPD components that implement previous-token attention patterns.
Includes both crafted-prompt and random-token evaluation modes.

Depends on spd/scripts/collect_attention_patterns.py from the QK attention
contributions PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add attention ablation experiment suite

Component-level attention ablation analysis including:
- Single-component and multi-pair fractional attention change plots
- Attention pattern difference visualization
- Generation with ablated components for output comparison
- Prev-token head redundancy testing

Depends on detect_prev_token_heads and collect_attention_patterns from
earlier PRs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude SPD1 <claude_spd1@proton.me>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multiple harvest workers downloading the same artifact concurrently
could read a partially-written file. Use O_CREAT|O_EXCL lockfile
(atomic on NFS) so one process downloads while others poll for a
.complete sentinel. 5 minute timeout on the poll.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Visualizes the L2 norms of SPD component U and V matrices projected into
each attention head's subspace, showing how components distribute across
heads. Produces per-layer heatmaps and bar charts.

Co-authored-by: Claude SPD1 <claude_spd1@proton.me>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Plots mean attention weight and pre-softmax QK logit by relative offset
for each head. Produces per-layer grids with dual y-axes showing how
attention distributes across token distances (offset tau = query_pos - key_pos).

Co-authored-by: Claude SPD1 <claude_spd1@proton.me>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Analyzes how attention head W_V weight matrices share subspaces via Gram
matrix cosine similarity. Includes:
- Unweighted and data-variance-weighted subspace overlap heatmaps
- Combined paper figure (side-by-side)
- Component-head amplification heatmap (||W_V^h @ v_c||)
- LaTeX writeup of overlap metrics

Co-authored-by: Claude SPD1 <claude_spd1@proton.me>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add rich_examples autointerp strategy and compare tab

New autointerp strategy (rich_examples) that shows per-token CI and activation
values inline, letting the LLM judge evidence quality directly. Also adds an
Autointerp Compare tab to the app for side-by-side comparison of interpretation
results across different strategies/models/subruns.

Backend: 3 new endpoints for listing subruns, bulk headlines, and detail.
Frontend: SubrunSelector (multiselect chips), stacked SubrunInterpCard, two-panel
AutointerpComparer with full component data on the right panel.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Restrict Anthropic autointerp models and use structured outputs

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix rich_examples prompt: explain signed component activations

Adds explanation to the SPD decomposition description that component
activation sign is arbitrary (inner product with read direction) and
does not indicate suppression. Trims redundant legend text.

Also adds render_prompt.py script for iterating on prompt templates.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Expose snapshot_branch in spd-autointerp CLI

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Improve rich_examples prompt clarity

- Show raw text before annotated version in examples (helps with dense
  token sequences like code/LaTeX)
- Add explicit explanation of <<<token (ci:X, act:Y)>>> format
- Add "consider evidence critically" paragraph from dual_view

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Use XML blocks with raw + highlighted text in rich_examples examples

Replaces sanitized single-line format with:
  <example>
  <raw>...unmodified text...</raw>
  <highlighted>...<<<token (ci:X, act:Y)>>>...</highlighted>
  </example>

Adds AppTokenizer.get_raw_spans for LLM prompt rendering where actual
whitespace (newlines, indentation) is meaningful.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Show all subruns in autointerp comparer, not just .done ones

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
* Fix rich_examples prompt: explain signed component activations

Adds explanation to the SPD decomposition description that component
activation sign is arbitrary (inner product with read direction) and
does not indicate suppression. Trims redundant legend text.

Also adds render_prompt.py script for iterating on prompt templates.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Expose snapshot_branch in spd-autointerp CLI

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Improve rich_examples prompt clarity

- Show raw text before annotated version in examples (helps with dense
  token sequences like code/LaTeX)
- Add explicit explanation of <<<token (ci:X, act:Y)>>> format
- Add "consider evidence critically" paragraph from dual_view

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Use XML blocks with raw + highlighted text in rich_examples examples

Replaces sanitized single-line format with:
  <example>
  <raw>...unmodified text...</raw>
  <highlighted>...<<<token (ci:X, act:Y)>>>...</highlighted>
  </example>

Adds AppTokenizer.get_raw_spans for LLM prompt rendering where actual
whitespace (newlines, indentation) is meaningful.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Show all subruns in autointerp comparer, not just .done ones

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Add autointerp_subrun_id to scoring CLI and InterpRepo.open_subrun

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Remove confidence field from autointerp + improve act legend

Drops the confidence field entirely from InterpretationResult, all DB
schemas, JSON output schemas, prompts, API responses, and frontend UI.

Expands the act legend in rich_examples to explain that sign is
meaningful within a component's examples even though the global
convention is arbitrary — polarity may indicate distinct input patterns.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants