Skip to content

Improve rich_examples autointerp prompt + remove confidence field#458

Merged
ocg-goodfire merged 8 commits intodevfrom
fix/autointerp-activations-explanation
Mar 18, 2026
Merged

Improve rich_examples autointerp prompt + remove confidence field#458
ocg-goodfire merged 8 commits intodevfrom
fix/autointerp-activations-explanation

Conversation

@ocg-goodfire
Copy link
Collaborator

@ocg-goodfire ocg-goodfire commented Mar 18, 2026

Summary

rich_examples prompt improvements:

  • Fix signed activation misinterpretation: local _DECOMPOSITION_DESCRIPTIONS explains that component_activation sign is arbitrary (inner product with read direction v_i) and does not indicate suppression
  • Expand act legend to explain polarity is meaningful within a component — examples may cluster by sign, representing distinct input patterns
  • Show raw + highlighted XML example format so dense token sequences (code, LaTeX, multilingual) are readable alongside annotations
  • Add "consider evidence critically" paragraph and explicit <<<token (ci:X, act:Y)>>> format explanation
  • Add AppTokenizer.get_raw_spans for LLM prompt rendering with literal whitespace (no control-char escaping)
  • Add render_prompt.py script for iterating on prompt templates without loading a full run

Remove confidence field:

  • Drops confidence from InterpretationResult, all DB schemas, JSON output schemas, prompts, API responses, and frontend UI (27 files, 229 deletions)
  • Removes confidence badges/CSS from InterpretationBadge, GraphInterpBadge, SubrunInterpCard, EdgeAttributionList, ModelGraph

Autointerp tooling:

  • Expose --snapshot_branch on spd-autointerp CLI so SLURM jobs run from a specific git branch
  • Add InterpRepo.open_subrun(run_id, subrun_id) to open a specific subrun by ID
  • Add --autointerp_subrun_id to scoring CLI to target a specific subrun
  • Autointerp compare tab now lists all subruns regardless of .done marker

Test plan

  • python -m spd.autointerp.scripts.render_prompt renders correctly
  • make check passes (basedpyright + ruff)
  • make check-app passes (svelte-check + eslint + prettier)

🤖 Generated with Claude Code

ocg-goodfire and others added 7 commits March 18, 2026 15:23
Adds explanation to the SPD decomposition description that component
activation sign is arbitrary (inner product with read direction) and
does not indicate suppression. Trims redundant legend text.

Also adds render_prompt.py script for iterating on prompt templates.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Show raw text before annotated version in examples (helps with dense
  token sequences like code/LaTeX)
- Add explicit explanation of <<<token (ci:X, act:Y)>>> format
- Add "consider evidence critically" paragraph from dual_view

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Replaces sanitized single-line format with:
  <example>
  <raw>...unmodified text...</raw>
  <highlighted>...<<<token (ci:X, act:Y)>>>...</highlighted>
  </example>

Adds AppTokenizer.get_raw_spans for LLM prompt rendering where actual
whitespace (newlines, indentation) is meaningful.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Drops the confidence field entirely from InterpretationResult, all DB
schemas, JSON output schemas, prompts, API responses, and frontend UI.

Expands the act legend in rich_examples to explain that sign is
meaningful within a component's examples even though the global
convention is arbitrary — polarity may indicate distinct input patterns.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@ocg-goodfire ocg-goodfire changed the base branch from main to dev March 18, 2026 18:05
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@ocg-goodfire ocg-goodfire marked this pull request as ready for review March 18, 2026 18:11
@ocg-goodfire ocg-goodfire merged commit 16b583f into dev Mar 18, 2026
2 checks passed
@ocg-goodfire ocg-goodfire changed the title Fix/autointerp activations explanation Improve rich_examples autointerp prompt + remove confidence field Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant