Skip to content

[P1] Define advanced execution policy for eval engines and judges #31

@spazyCZ

Description

@spazyCZ

Objective

Define how one logical eval run expands into engine executions and judge passes, and how those results are reported.

Priority

P1 — Should Fix

Details

The current eval flow assumes a single engine and a single judge pass per case, but the design backlog also asks whether the same case can run across multiple engines and whether LLM-as-judge results need stabilization via repeated judging or majority vote. These questions affect the same execution and reporting model and should be resolved together.

Acceptance Criteria

  • Matrix mode is either specified or explicitly deferred
  • The default judge execution policy is defined
  • If advanced execution modes are supported, the CLI and report format are defined clearly
  • Report semantics remain interpretable and comparable across runs

Notes

Source: work/eval-design-discussion.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions