-
Notifications
You must be signed in to change notification settings - Fork 0
[P1] Define advanced execution policy for eval engines and judges #31
Copy link
Copy link
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationenhancementNew feature or requestNew feature or request
Description
Objective
Define how one logical eval run expands into engine executions and judge passes, and how those results are reported.
Priority
P1 — Should Fix
Details
The current eval flow assumes a single engine and a single judge pass per case, but the design backlog also asks whether the same case can run across multiple engines and whether LLM-as-judge results need stabilization via repeated judging or majority vote. These questions affect the same execution and reporting model and should be resolved together.
Acceptance Criteria
- Matrix mode is either specified or explicitly deferred
- The default judge execution policy is defined
- If advanced execution modes are supported, the CLI and report format are defined clearly
- Report semantics remain interpretable and comparable across runs
Notes
Source: work/eval-design-discussion.md
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationenhancementNew feature or requestNew feature or request