Skip to content

Feat/harness retry stabilization#5

Open
romanstark wants to merge 5 commits intocoleam00:mainfrom
romanstark:feat/harness-retry-stabilization
Open

Feat/harness retry stabilization#5
romanstark wants to merge 5 commits intocoleam00:mainfrom
romanstark:feat/harness-retry-stabilization

Conversation

@romanstark
Copy link
Copy Markdown

Summary

This PR introduces a new stabilized retry strategy for harness resume flows and improves evaluator reliability for long-running sprint evaluations.
Key outcomes:

  • Prevents already-verified criteria from flapping to fail on transient/inconclusive re-checks.
  • Focuses retry work on still-failing criteria to reduce accidental regressions.
  • Hardens evaluator JSON parsing with an automatic recovery pass when output is non-parseable.

What changed

1) Stabilized retry strategy (new feature)

  • Added retryStrategy config with:
    • stabilized (default)
    • strict (legacy behavior)
  • Added hardFailUnlockStreak config (default: 2):
    • Previously passed criteria are "locked".
    • Locked criteria are only unlocked after repeated hard failures.
  • Added persistent sprint stability snapshots:
    • feedback/sprint-{n}-stability.json

2) Focused retries for generator

  • On retry, generator now receives explicit failed-criteria focus.
  • Prompt includes scope-control guidance to minimize changes outside failing criteria.

3) Threshold semantics hardening

  • Contract criterion threshold is treated as an integer score gate on 1-10 scale.
  • Invalid threshold values (e.g., metric targets like 90/250/48) fall back to global pass threshold.
  • Prompt guidance updated so raw metrics belong in criterion descriptions, not in score threshold fields.

4) Evaluator JSON parse reliability hardening

  • If evaluator output is not parseable JSON on first pass, harness retries evaluation once with strict JSON-only instruction.
  • Claude evaluator response collection now captures additional result-event text to improve recoverability.
  • Increased CLAUDE_MAX_TURNS from 50 to 80 to reduce truncation/non-final responses in complex evaluations.

Why

In long sprints, fully re-evaluating all criteria each retry can cause pass/fail oscillation due to:

  • environment limitations (e.g., Lighthouse/Chrome unavailable),
  • partial/inconclusive checks,
  • model output formatting failures (non-JSON final responses).
    This PR keeps strong quality gates while reducing flaky regressions and improving run completion reliability.

CLI / behavior notes

  • New flags:
    • --retry-strategy=strict|stabilized
    • --hard-fail-unlock-streak=<int>
  • Default behavior now uses stabilized retries.
  • Strict mode remains available for full immediate regression enforcement.

Files of interest

  • shared/evaluation.ts (new): stabilization + threshold helpers
  • shared/types.ts
  • shared/files.ts
  • shared/config.ts
  • shared/prompts.ts
  • claude-harness/{index.ts,harness.ts,generator.ts,evaluator.ts}
  • codex-harness/{index.ts,harness.ts,generator.ts,evaluator.ts}
  • README.md

Guard per-criterion thresholds to the 0-10 score range and fall back to the default threshold when contracts contain metric-style values. Clarify negotiation/evaluation prompts so threshold remains a score gate, preventing false sprint failures on resume.
Introduce criterion lock state, inconclusive handling, and focused retry prompts so flaky re-evaluations do not regress already verified criteria. Add CLI/config controls and persisted stability snapshots while keeping strict mode available for full regression checks.
Retry evaluation once when the first response is not parseable JSON, including stricter retry instructions and better Claude result-event text capture. Raise Claude max turns and document the new evaluator reliability behavior in the README.
Clear sprint stability snapshots when using --resume=reset-contract so locked criterion state cannot leak into a newly negotiated contract.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant