Improve rich_examples autointerp prompt + remove confidence field by ocg-goodfire · Pull Request #458 · goodfire-ai/spd

ocg-goodfire · 2026-03-18T18:05:08Z

Summary

rich_examples prompt improvements:

Fix signed activation misinterpretation: local _DECOMPOSITION_DESCRIPTIONS explains that component_activation sign is arbitrary (inner product with read direction v_i) and does not indicate suppression
Expand act legend to explain polarity is meaningful within a component — examples may cluster by sign, representing distinct input patterns
Show raw + highlighted XML example format so dense token sequences (code, LaTeX, multilingual) are readable alongside annotations
Add "consider evidence critically" paragraph and explicit <<<token (ci:X, act:Y)>>> format explanation
Add AppTokenizer.get_raw_spans for LLM prompt rendering with literal whitespace (no control-char escaping)
Add render_prompt.py script for iterating on prompt templates without loading a full run

Remove confidence field:

Drops confidence from InterpretationResult, all DB schemas, JSON output schemas, prompts, API responses, and frontend UI (27 files, 229 deletions)
Removes confidence badges/CSS from InterpretationBadge, GraphInterpBadge, SubrunInterpCard, EdgeAttributionList, ModelGraph

Autointerp tooling:

Expose --snapshot_branch on spd-autointerp CLI so SLURM jobs run from a specific git branch
Add InterpRepo.open_subrun(run_id, subrun_id) to open a specific subrun by ID
Add --autointerp_subrun_id to scoring CLI to target a specific subrun
Autointerp compare tab now lists all subruns regardless of .done marker

Test plan

python -m spd.autointerp.scripts.render_prompt renders correctly
make check passes (basedpyright + ruff)
make check-app passes (svelte-check + eslint + prettier)

🤖 Generated with Claude Code

Adds explanation to the SPD decomposition description that component activation sign is arbitrary (inner product with read direction) and does not indicate suppression. Trims redundant legend text. Also adds render_prompt.py script for iterating on prompt templates. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Show raw text before annotated version in examples (helps with dense token sequences like code/LaTeX) - Add explicit explanation of <<<token (ci:X, act:Y)>>> format - Add "consider evidence critically" paragraph from dual_view Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Replaces sanitized single-line format with: <example> <raw>...unmodified text...</raw> <highlighted>...<<<token (ci:X, act:Y)>>>...</highlighted> </example> Adds AppTokenizer.get_raw_spans for LLM prompt rendering where actual whitespace (newlines, indentation) is meaningful. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Drops the confidence field entirely from InterpretationResult, all DB schemas, JSON output schemas, prompts, API responses, and frontend UI. Expands the act legend in rich_examples to explain that sign is meaningful within a component's examples even though the global convention is arbitrary — polarity may indicate distinct input patterns. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

ocg-goodfire and others added 7 commits March 18, 2026 15:23

Expose snapshot_branch in spd-autointerp CLI

8744104

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Show all subruns in autointerp comparer, not just .done ones

d456f50

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Add autointerp_subrun_id to scoring CLI and InterpRepo.open_subrun

91aa2f9

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

ocg-goodfire changed the base branch from main to dev March 18, 2026 18:05

Resolve merge conflict in rich_examples.py (keep expanded act legend)

673adb9

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

ocg-goodfire marked this pull request as ready for review March 18, 2026 18:11

ocg-goodfire merged commit 16b583f into dev Mar 18, 2026
2 checks passed

ocg-goodfire changed the title ~~Fix/autointerp activations explanation~~ Improve rich_examples autointerp prompt + remove confidence field Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve rich_examples autointerp prompt + remove confidence field#458

Improve rich_examples autointerp prompt + remove confidence field#458
ocg-goodfire merged 8 commits intodevfrom
fix/autointerp-activations-explanation

ocg-goodfire commented Mar 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ocg-goodfire commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ocg-goodfire commented Mar 18, 2026 •

edited

Loading