Feat/harness retry stabilization by romanstark · Pull Request #5 · coleam00/adversarial-dev

romanstark · 2026-04-09T15:09:04Z

Summary

This PR introduces a new stabilized retry strategy for harness resume flows and improves evaluator reliability for long-running sprint evaluations.
Key outcomes:

Prevents already-verified criteria from flapping to fail on transient/inconclusive re-checks.
Focuses retry work on still-failing criteria to reduce accidental regressions.
Hardens evaluator JSON parsing with an automatic recovery pass when output is non-parseable.

What changed

1) Stabilized retry strategy (new feature)

Added retryStrategy config with:
- stabilized (default)
- strict (legacy behavior)
Added hardFailUnlockStreak config (default: 2):
- Previously passed criteria are "locked".
- Locked criteria are only unlocked after repeated hard failures.
Added persistent sprint stability snapshots:
- feedback/sprint-{n}-stability.json

2) Focused retries for generator

On retry, generator now receives explicit failed-criteria focus.
Prompt includes scope-control guidance to minimize changes outside failing criteria.

3) Threshold semantics hardening

Contract criterion threshold is treated as an integer score gate on 1-10 scale.
Invalid threshold values (e.g., metric targets like 90/250/48) fall back to global pass threshold.
Prompt guidance updated so raw metrics belong in criterion descriptions, not in score threshold fields.

4) Evaluator JSON parse reliability hardening

If evaluator output is not parseable JSON on first pass, harness retries evaluation once with strict JSON-only instruction.
Claude evaluator response collection now captures additional result-event text to improve recoverability.
Increased CLAUDE_MAX_TURNS from 50 to 80 to reduce truncation/non-final responses in complex evaluations.

Why

In long sprints, fully re-evaluating all criteria each retry can cause pass/fail oscillation due to:

environment limitations (e.g., Lighthouse/Chrome unavailable),
partial/inconclusive checks,
model output formatting failures (non-JSON final responses).
This PR keeps strong quality gates while reducing flaky regressions and improving run completion reliability.

CLI / behavior notes

New flags:
- --retry-strategy=strict|stabilized
- --hard-fail-unlock-streak=<int>
Default behavior now uses stabilized retries.
Strict mode remains available for full immediate regression enforcement.

Files of interest

shared/evaluation.ts (new): stabilization + threshold helpers
shared/types.ts
shared/files.ts
shared/config.ts
shared/prompts.ts
claude-harness/{index.ts,harness.ts,generator.ts,evaluator.ts}
codex-harness/{index.ts,harness.ts,generator.ts,evaluator.ts}
README.md

Guard per-criterion thresholds to the 0-10 score range and fall back to the default threshold when contracts contain metric-style values. Clarify negotiation/evaluation prompts so threshold remains a score gate, preventing false sprint failures on resume.

Introduce criterion lock state, inconclusive handling, and focused retry prompts so flaky re-evaluations do not regress already verified criteria. Add CLI/config controls and persisted stability snapshots while keeping strict mode available for full regression checks.

Retry evaluation once when the first response is not parseable JSON, including stricter retry instructions and better Claude result-event text capture. Raise Claude max turns and document the new evaluator reliability behavior in the README.

Clear sprint stability snapshots when using --resume=reset-contract so locked criterion state cannot leak into a newly negotiated contract.

romanstark added 5 commits April 3, 2026 22:33

add resume modes and per-criterion evaluator thresholds

e677a3a

reset stability state when resume contract resets

94843ba

Clear sprint stability snapshots when using --resume=reset-contract so locked criterion state cannot leak into a newly negotiated contract.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/harness retry stabilization#5

Feat/harness retry stabilization#5
romanstark wants to merge 5 commits intocoleam00:mainfrom
romanstark:feat/harness-retry-stabilization

romanstark commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

romanstark commented Apr 9, 2026

Summary

What changed

1) Stabilized retry strategy (new feature)

2) Focused retries for generator

3) Threshold semantics hardening

4) Evaluator JSON parse reliability hardening

Why

CLI / behavior notes

Files of interest

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant