HOK-1391_c: Add explicit plan critique scoring to eval pipeline by timogilvie · Pull Request #379 · timogilvie/wavemill

timogilvie · 2026-04-23T01:39:39Z

Summary

Adds a standalone plan-critique.md prompt that judges plan quality across five dimensions (component boundaries, invariants, interface contracts, missed patches, overall) independently of diff cleanliness
Extends the eval judge prompt (eval-judge.md) with a new ## Plan Critique section so plan quality is captured in every eval that has a plan artifact
Wires parsing and storage of PlanCritique results through eval.ts and eval-schema.ts, surfacing them at EvalRecord.metadata.planCritique for GEPA training attribution
Covers new code paths with unit tests in eval.test.js for both presence and absence of plan artifacts

Changes

tools/prompts/plan-critique.md — new standalone plan critique prompt (5 scored dimensions, structured JSON output)
tools/prompts/eval-judge.md — new ## Plan Critique section appended to existing judge prompt
shared/lib/eval.ts — parse planCritique from judge response and store in EvalRecord.metadata
shared/lib/eval-schema.ts — add PlanCritique and PlanCritiqueDimension interfaces; bump schema to 1.9.0
shared/lib/eval.test.js — unit tests for plan critique parsing (valid, missing, malformed)
tests/lifecycle-harness.test.sh / tests/lifecycle-scenarios.test.sh — test fixture stubs

Test plan

Unit tests in eval.test.js cover plan critique extraction when present, absent, and malformed
Lifecycle harness stubs prevent test interference from the new _restore_inflight_task_window_if_missing call

Self-review

Ran self-review tool (iteration 1): one warning — dead planCritique field on StageScore interface (field was never populated; actual storage is in EvalRecord.metadata)
Fixed: removed the dead field from StageScore, corrected the version changelog comment
Ran self-review tool (iteration 2): verdict ready, zero findings

Closes HOK-1391_c

Add planCritique field to eval records to directly measure plan quality across 4 dimensions (component boundaries, invariant coverage, approach soundness, missed patches) rather than inferring it from diff cleanliness. Changes: - Schema (eval-schema.ts): Add PlanCritique + PlanCritiqueDimension interfaces, add optional planCritique to StageScore, bump schema to 1.9.0 - Prompts: Add plan-critique.md standalone prompt, extend eval-judge.md with plan critique section when plan artifact is available - Parser (eval.ts): Parse and validate planCritique from judge response, include in record.metadata, bump SCHEMA_VERSION to 1.9.0 - Tests (eval.test.js): Add coverage for valid planCritique parsing, graceful omission when plan unavailable, and out-of-range score handling The planCritique field is optional at every layer to maintain backward compatibility. Enables head-to-head plan quality comparisons for model selection without requiring full workflow execution. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

planCritique is stored in EvalRecord.metadata, not inside StageScore objects. Correct the version comment accordingly. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ning-judge-challenger

timogilvie and others added 4 commits April 22, 2026 21:34

fix: remove dead planCritique field from StageScore interface

7715570

planCritique is stored in EvalRecord.metadata, not inside StageScore objects. Correct the version comment accordingly. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into task/tighten-the-plan…

4462c1e

…ning-judge-challenger

fix: Resolve ready-check failure (attempt 2/3)

92d2f82

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HOK-1391_c: Add explicit plan critique scoring to eval pipeline#379

HOK-1391_c: Add explicit plan critique scoring to eval pipeline#379
timogilvie wants to merge 4 commits intomainfrom
task/tighten-the-planning-judge-challenger

timogilvie commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

timogilvie commented Apr 23, 2026

Summary

Changes

Test plan

Self-review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant