Skip to content

HOK-1391: Add explicit plan critique to eval judge#378

Merged
timogilvie merged 12 commits intomainfrom
task/tighten-the-planning-judge
Apr 23, 2026
Merged

HOK-1391: Add explicit plan critique to eval judge#378
timogilvie merged 12 commits intomainfrom
task/tighten-the-planning-judge

Conversation

@timogilvie
Copy link
Copy Markdown
Owner

Summary

  • Adds a planCritique rubric to the eval judge that scores plans directly across five dimensions: component_boundaries, invariant_coverage, approach_soundness, missed_patches, and overall
  • Introduces a standalone tools/prompts/plan-critique.md prompt for evaluating plans in isolation (useful when you have the plan and diff but want a focused critique without running the full eval)
  • Bumps the eval schema to v1.9.0 with PlanCritique / PlanCritiqueDimension TypeScript interfaces and attaches the critique to the plan stage score when available
  • Enables direct model comparison on planning quality without inferring it purely from downstream diff cleanliness

Changes

  • tools/prompts/eval-judge.md — new Plan Critique section instructs the judge to emit planCritique when an implementation plan artifact is present
  • tools/prompts/plan-critique.md — new standalone prompt for targeted plan evaluation
  • shared/lib/eval-schema.ts — v1.9.0: PlanCritique, PlanCritiqueDimension types; planCritique field on StageScore
  • shared/lib/eval-record-builder.tsattachStageOutcomes now accepts and attaches planCritique to the plan stage
  • shared/lib/eval-schema.test.ts, eval-record-builder.test.ts, eval.test.js, llm-cli.ts/test.ts — schema version bump and related test updates
  • tests/lifecycle-harness.test.sh, tests/lifecycle-scenarios.test.sh — stub _restore_inflight_task_window_if_missing to fix lifecycle harness tests

Test plan

  • Existing eval schema tests pass with the v1.9.0 schema version
  • eval-record-builder tests confirm planCritique is attached to the plan stage when provided and omitted otherwise
  • Manually verify that the eval judge prompt produces a planCritique object for workflows that include an implementation plan and omits it when the plan is unavailable
  • Standalone plan-critique.md prompt can be used directly with a task prompt + plan content + optional diff

Self-review

  • Review tool verdict: ready (exit code 0, no findings)
  • Iterations run: 1

Closes HOK-1391

@timogilvie timogilvie merged commit 923dae0 into main Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant