Parent: #5 (H-1 — structural fix shipped)
Description
H-1 shipped with unit tests that prove the structural invariant: untrusted text cannot escape its <evidence> / <profile> delimiters, control chars and zero-width chars are stripped, angle brackets are defanged, claims >4KB are truncated. What's missing is the behavioral check — confirming that a capable LLM, when given the sanitized input, is not steered by embedded injection attempts.
Current State
- Structural tests passing at
packages/core/src/__tests__/sanitize.test.ts and packages/scoring/src/__tests__/signal-extractor.test.ts (5 injection-defense tests).
- Eval harness ready at
apps/cli/scripts/h1-adversarial-eval.mjs — 5 adversarial fixtures (direct instruction, authority impersonation, tag forgery, markdown hijack, social engineering) paired with baseline-junior evidence.
- Blocked on:
ANTHROPIC_API_KEY was not available in the Claude Code session (OAuth token only). Eval was not run.
Suggested Fix
Verification
Automation Hints
scope: apps/cli/scripts, docs/evals
do-not-touch: packages
approach: add-tests
risk: low
max-files-changed: 2
blocked-by: none
bail-if: fewer than 5/5 defended (then needs human review, not auto-close)
Priority
Medium — structural fix is in place; this upgrades confidence from "provably can't escape" to "empirically doesn't steer."
Cost
~$0.05 on Sonnet (5 fixtures × 2 LLM calls each = 10 calls, ~30–60s wall time).
Parent: #5 (H-1 — structural fix shipped)
Description
H-1 shipped with unit tests that prove the structural invariant: untrusted text cannot escape its
<evidence>/<profile>delimiters, control chars and zero-width chars are stripped, angle brackets are defanged, claims >4KB are truncated. What's missing is the behavioral check — confirming that a capable LLM, when given the sanitized input, is not steered by embedded injection attempts.Current State
packages/core/src/__tests__/sanitize.test.tsandpackages/scoring/src/__tests__/signal-extractor.test.ts(5 injection-defense tests).apps/cli/scripts/h1-adversarial-eval.mjs— 5 adversarial fixtures (direct instruction, authority impersonation, tag forgery, markdown hijack, social engineering) paired with baseline-junior evidence.ANTHROPIC_API_KEYwas not available in the Claude Code session (OAuth token only). Eval was not run.Suggested Fix
ANTHROPIC_API_KEYand run:node apps/cli/scripts/h1-adversarial-eval.mjsdocs/evals/h1-adversarial-2026-04-XX.mdfor historical reference.Verification
docs/evals/.Automation Hints
scope: apps/cli/scripts, docs/evals
do-not-touch: packages
approach: add-tests
risk: low
max-files-changed: 2
blocked-by: none
bail-if: fewer than 5/5 defended (then needs human review, not auto-close)
Priority
Medium — structural fix is in place; this upgrades confidence from "provably can't escape" to "empirically doesn't steer."
Cost
~$0.05 on Sonnet (5 fixtures × 2 LLM calls each = 10 calls, ~30–60s wall time).