Feat/contract negotiation hardening by romanstark · Pull Request #6 · coleam00/adversarial-dev

romanstark · 2026-04-09T16:53:46Z

Summary

This PR adds a focused contract negotiation hardening feature to the existing harness architecture.
Inspired by ideas from @lliWcWill in coleam00/adversarial-dev PR #3, this implementation is adapted to my current codebase and integrated with my existing resume + stabilized retry logic.

What’s included

Iterative contract negotiation (up to 3 rounds)
- Generator and evaluator can iterate before finalizing a sprint contract.
Fail-closed contract parsing
- Invalid contract JSON no longer silently falls back to a generic default.
- Negotiation now retries on malformed output instead.
Robust approval detection
- Approval checks now accept case-insensitive APPROVED... responses.
Mid-sprint contract renegotiation trigger
- During retries, renegotiation is triggered when contract/implementation fit is clearly poor:
  - all criteria failing, or
  - average score < 4
Empty-feedback safety guard
- Evaluator parsing now requires non-empty feedback arrays before treating output as valid.

Why

In difficult sprints, contracts can become misaligned with real implementation constraints.
This feature reduces wasted retry loops and improves harness resilience by renegotiating when quality signals indicate the current contract is no longer actionable.

Scope

Updated files:

claude-harness/harness.ts
codex-harness/harness.ts
claude-harness/evaluator.ts
codex-harness/evaluator.ts
README.md

Credit

Design inspiration and several hardening concepts were inspired by:

@lliWcWill
Upstream PR context: https://github.com/coleam00/adversarial-dev/pull/3

Guard per-criterion thresholds to the 0-10 score range and fall back to the default threshold when contracts contain metric-style values. Clarify negotiation/evaluation prompts so threshold remains a score gate, preventing false sprint failures on resume.

Introduce criterion lock state, inconclusive handling, and focused retry prompts so flaky re-evaluations do not regress already verified criteria. Add CLI/config controls and persisted stability snapshots while keeping strict mode available for full regression checks.

Retry evaluation once when the first response is not parseable JSON, including stricter retry instructions and better Claude result-event text capture. Raise Claude max turns and document the new evaluator reliability behavior in the README.

Clear sprint stability snapshots when using --resume=reset-contract so locked criterion state cannot leak into a newly negotiated contract.

Add iterative multi-round contract negotiation with fail-closed parsing and retry-on-invalid output, plus case-insensitive APPROVED detection. Trigger mid-sprint renegotiation on clearly misaligned evaluations and guard evaluators against empty feedback arrays.

This reverts commit 92729e3.

Add iterative multi-round contract negotiation with fail-closed parsing and retry-on-invalid output, plus case-insensitive APPROVED detection. Trigger mid-sprint renegotiation on clearly misaligned evaluations and guard evaluators against empty feedback arrays.

romanstark added 9 commits April 3, 2026 22:33

add resume modes and per-criterion evaluator thresholds

e677a3a

merge stabilized retry and resume improvements

d08dce1

reset stability state when resume contract resets

010ae74

Clear sprint stability snapshots when using --resume=reset-contract so locked criterion state cannot leak into a newly negotiated contract.

Revert "harden contract negotiation and renegotiation logic"

119f6fc

This reverts commit 92729e3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/contract negotiation hardening#6

Feat/contract negotiation hardening#6
romanstark wants to merge 9 commits intocoleam00:mainfrom
romanstark:feat/contract-negotiation-hardening

romanstark commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

romanstark commented Apr 9, 2026

Summary

What’s included

Why

Scope

Credit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant