A pluggable framework for using a cheap aux-LLM as a judge over agent traces. The same primitive feeds risk scoring (#619), leak-detection enrichment (#465), brand-voice scoring (#551), trajectory rating for fine-tuning (#553), and periodic agent performance reviews.
Priority: High Priority (P1) — foundational | Manifesto: Principles VII (audit + reversibility) and VIII (doesn't break overnight).
Why foundation: without this, every "LLM-as-judge" use case in the roadmap (risk, leak, brand, trajectory selection, performance review) reinvents the same primitive — model dispatch, trace windowing, redaction, eval harness for the judge itself. Centralizing it is a 2-3x leverage win.
Children:
DoD: any feature that needs an LLM rating of a trace plugs in with one subscriber registration; judge model + prompt are versioned; judge accuracy is measurable.
Depends on: F2 (#415) ✅, F4 (#417) ✅, F5 (#418) ✅.
A pluggable framework for using a cheap aux-LLM as a judge over agent traces. The same primitive feeds risk scoring (#619), leak-detection enrichment (#465), brand-voice scoring (#551), trajectory rating for fine-tuning (#553), and periodic agent performance reviews.
Priority: High Priority (P1) — foundational | Manifesto: Principles VII (audit + reversibility) and VIII (doesn't break overnight).
Why foundation: without this, every "LLM-as-judge" use case in the roadmap (risk, leak, brand, trajectory selection, performance review) reinvents the same primitive — model dispatch, trace windowing, redaction, eval harness for the judge itself. Centralizing it is a 2-3x leverage win.
Children:
DoD: any feature that needs an LLM rating of a trace plugs in with one subscriber registration; judge model + prompt are versioned; judge accuracy is measurable.
Depends on: F2 (#415) ✅, F4 (#417) ✅, F5 (#418) ✅.