Skip to content

Feat/harness resume modes and criterion-specific evaluator thresholds#4

Open
romanstark wants to merge 2 commits intocoleam00:mainfrom
romanstark:feat/harness-resume
Open

Feat/harness resume modes and criterion-specific evaluator thresholds#4
romanstark wants to merge 2 commits intocoleam00:mainfrom
romanstark:feat/harness-resume

Conversation

@romanstark
Copy link
Copy Markdown

Summary

This PR improves harness reliability for long-running runs by adding resumable execution modes to both harness implementations and aligning evaluator pass/fail logic with per-criterion sprint contract thresholds.

What changed

  • Added --resume support to both CLIs:
    • claude-harness/index.ts
    • codex-harness/index.ts
  • Added resume modes:
    • strict (default when --resume is used without a value)
    • reset-retries
    • reset-contract
  • Updated harness orchestration to resume from existing workspace state (progress.json, existing contracts/feedback) without cleaning artifacts on resume.
  • Extended shared file utilities:
    • initWorkspace(..., { clean })
    • findLatestFeedbackRound(...)
  • Updated evaluators (Claude + Codex) to enforce criterion-specific thresholds from sprint contracts, with fallback to global passThreshold.
  • Updated README with resume usage examples and behavior.

Why

In real runs, long sprint sequences can fail or stop late in the process. Restarting from scratch is expensive and often unnecessary. This adds explicit operational controls for continuation strategy and makes pass/fail decisions consistent with contract intent when thresholds differ per criterion.

Resume behavior

  • --resume / --resume=strict
    • Continues from current progress state.
    • Reuses existing contract.
    • Preserves retry progression.
    • If retry budget is already exhausted, exits with a clear error.
  • --resume=reset-retries
    • Resumes current sprint with retry counter reset to 0.
    • Reuses existing contract.
  • --resume=reset-contract
    • Resumes current sprint with retry counter reset to 0.
    • Re-negotiates sprint contract before rebuilding.

Scope

Included:

  • README updates
  • Resume functionality in Claude/Codex harnesses
  • Shared type/file utility changes needed for resume flow
  • Evaluator threshold logic update (per criterion)
    Intentionally excluded:
  • Unrelated local config changes (shared/config.ts not part of this PR)

Notes for review

  • Resume logic is implemented symmetrically across Claude and Codex harness paths.
  • Existing non-resume behavior is preserved (clean workspace init on fresh runs).
  • Evaluator logging now includes effective threshold used per criterion, which helps debug contract/evaluation mismatches.

Guard per-criterion thresholds to the 0-10 score range and fall back to the default threshold when contracts contain metric-style values. Clarify negotiation/evaluation prompts so threshold remains a score gate, preventing false sprint failures on resume.
@romanstark romanstark force-pushed the feat/harness-resume branch from d0e3823 to 571464d Compare April 9, 2026 14:13
@romanstark romanstark changed the title Add harness resume modes and criterion-specific evaluator thresholds Feat/harness resume modes and criterion-specific evaluator thresholds Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant