Skip to content

HOK-1391_c: Add explicit plan critique scoring to eval pipeline#379

Open
timogilvie wants to merge 4 commits intomainfrom
task/tighten-the-planning-judge-challenger
Open

HOK-1391_c: Add explicit plan critique scoring to eval pipeline#379
timogilvie wants to merge 4 commits intomainfrom
task/tighten-the-planning-judge-challenger

Conversation

@timogilvie
Copy link
Copy Markdown
Owner

Summary

  • Adds a standalone plan-critique.md prompt that judges plan quality across five dimensions (component boundaries, invariants, interface contracts, missed patches, overall) independently of diff cleanliness
  • Extends the eval judge prompt (eval-judge.md) with a new ## Plan Critique section so plan quality is captured in every eval that has a plan artifact
  • Wires parsing and storage of PlanCritique results through eval.ts and eval-schema.ts, surfacing them at EvalRecord.metadata.planCritique for GEPA training attribution
  • Covers new code paths with unit tests in eval.test.js for both presence and absence of plan artifacts

Changes

  • tools/prompts/plan-critique.md — new standalone plan critique prompt (5 scored dimensions, structured JSON output)
  • tools/prompts/eval-judge.md — new ## Plan Critique section appended to existing judge prompt
  • shared/lib/eval.ts — parse planCritique from judge response and store in EvalRecord.metadata
  • shared/lib/eval-schema.ts — add PlanCritique and PlanCritiqueDimension interfaces; bump schema to 1.9.0
  • shared/lib/eval.test.js — unit tests for plan critique parsing (valid, missing, malformed)
  • tests/lifecycle-harness.test.sh / tests/lifecycle-scenarios.test.sh — test fixture stubs

Test plan

  • Unit tests in eval.test.js cover plan critique extraction when present, absent, and malformed
  • Lifecycle harness stubs prevent test interference from the new _restore_inflight_task_window_if_missing call

Self-review

  • Ran self-review tool (iteration 1): one warning — dead planCritique field on StageScore interface (field was never populated; actual storage is in EvalRecord.metadata)
  • Fixed: removed the dead field from StageScore, corrected the version changelog comment
  • Ran self-review tool (iteration 2): verdict ready, zero findings

Closes HOK-1391_c

timogilvie and others added 4 commits April 22, 2026 21:34
Add planCritique field to eval records to directly measure plan quality
across 4 dimensions (component boundaries, invariant coverage, approach
soundness, missed patches) rather than inferring it from diff cleanliness.

Changes:
- Schema (eval-schema.ts): Add PlanCritique + PlanCritiqueDimension interfaces,
  add optional planCritique to StageScore, bump schema to 1.9.0
- Prompts: Add plan-critique.md standalone prompt, extend eval-judge.md
  with plan critique section when plan artifact is available
- Parser (eval.ts): Parse and validate planCritique from judge response,
  include in record.metadata, bump SCHEMA_VERSION to 1.9.0
- Tests (eval.test.js): Add coverage for valid planCritique parsing,
  graceful omission when plan unavailable, and out-of-range score handling

The planCritique field is optional at every layer to maintain backward
compatibility. Enables head-to-head plan quality comparisons for model
selection without requiring full workflow execution.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
planCritique is stored in EvalRecord.metadata, not inside StageScore
objects. Correct the version comment accordingly.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant