Skip to content

[Foundation F11] Aux-LLM trace-judge framework #628

@furukama

Description

@furukama

A pluggable framework for using a cheap aux-LLM as a judge over agent traces. The same primitive feeds risk scoring (#619), leak-detection enrichment (#465), brand-voice scoring (#551), trajectory rating for fine-tuning (#553), and periodic agent performance reviews.

Priority: High Priority (P1) — foundational | Manifesto: Principles VII (audit + reversibility) and VIII (doesn't break overnight).

Why foundation: without this, every "LLM-as-judge" use case in the roadmap (risk, leak, brand, trajectory selection, performance review) reinvents the same primitive — model dispatch, trace windowing, redaction, eval harness for the judge itself. Centralizing it is a 2-3x leverage win.

Children:

DoD: any feature that needs an LLM rating of a trace plugs in with one subscriber registration; judge model + prompt are versioned; judge accuracy is measurable.

Depends on: F2 (#415) ✅, F4 (#417) ✅, F5 (#418) ✅.

Metadata

Metadata

Assignees

No one assigned

    Labels

    foundationFoundational primitive blocking other roadmap workpriority/P1High Priority — depth workroadmapTrusted Coworker roadmap

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions