HOK-1391_c: Add explicit plan critique scoring to eval pipeline#379
Open
timogilvie wants to merge 4 commits intomainfrom
Open
HOK-1391_c: Add explicit plan critique scoring to eval pipeline#379timogilvie wants to merge 4 commits intomainfrom
timogilvie wants to merge 4 commits intomainfrom
Conversation
Add planCritique field to eval records to directly measure plan quality across 4 dimensions (component boundaries, invariant coverage, approach soundness, missed patches) rather than inferring it from diff cleanliness. Changes: - Schema (eval-schema.ts): Add PlanCritique + PlanCritiqueDimension interfaces, add optional planCritique to StageScore, bump schema to 1.9.0 - Prompts: Add plan-critique.md standalone prompt, extend eval-judge.md with plan critique section when plan artifact is available - Parser (eval.ts): Parse and validate planCritique from judge response, include in record.metadata, bump SCHEMA_VERSION to 1.9.0 - Tests (eval.test.js): Add coverage for valid planCritique parsing, graceful omission when plan unavailable, and out-of-range score handling The planCritique field is optional at every layer to maintain backward compatibility. Enables head-to-head plan quality comparisons for model selection without requiring full workflow execution. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
planCritique is stored in EvalRecord.metadata, not inside StageScore objects. Correct the version comment accordingly. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ning-judge-challenger
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
plan-critique.mdprompt that judges plan quality across five dimensions (component boundaries, invariants, interface contracts, missed patches, overall) independently of diff cleanlinesseval-judge.md) with a new## Plan Critiquesection so plan quality is captured in every eval that has a plan artifactPlanCritiqueresults througheval.tsandeval-schema.ts, surfacing them atEvalRecord.metadata.planCritiquefor GEPA training attributioneval.test.jsfor both presence and absence of plan artifactsChanges
tools/prompts/plan-critique.md— new standalone plan critique prompt (5 scored dimensions, structured JSON output)tools/prompts/eval-judge.md— new## Plan Critiquesection appended to existing judge promptshared/lib/eval.ts— parseplanCritiquefrom judge response and store inEvalRecord.metadatashared/lib/eval-schema.ts— addPlanCritiqueandPlanCritiqueDimensioninterfaces; bump schema to 1.9.0shared/lib/eval.test.js— unit tests for plan critique parsing (valid, missing, malformed)tests/lifecycle-harness.test.sh/tests/lifecycle-scenarios.test.sh— test fixture stubsTest plan
eval.test.jscover plan critique extraction when present, absent, and malformed_restore_inflight_task_window_if_missingcallSelf-review
planCritiquefield onStageScoreinterface (field was never populated; actual storage is inEvalRecord.metadata)StageScore, corrected the version changelog commentCloses HOK-1391_c