Skip to content

Harness-evolution sweep: April 2026 model class transitions (Claude Opus 4.7 + GPT-5.5 + Gemini 3.1 Pro) #17

@EsatanGW

Description

@EsatanGW

Trigger

Material releases detected in three of the methodology's bridged runtimes within the 2026-03-31 → 2026-04-30 window:

Runtime Release Date
Claude Code (Anthropic) Claude Opus 4.7 2026-04-16
Codex (OpenAI) GPT-5.5 2026-04-23 (Codex), 2026-04-24 (API)
Gemini CLI (Google) Gemini 3.1 Pro April 2026

Cursor (3.1 / 3.2 / SDK / /multitask / /debug) and Windsurf 2.0 + Devin integration were classified as non-material — IDE / partnership feature work, no underlying model class transition.

Previous sweep fire: none. This is the first execution of the harness-evolution sweep, run interactively immediately after v1.35.0 shipped the discipline.

This issue executes Step 1 (Map) of the harness-evolution discipline (docs/harness-evolution-discipline.md). Steps 2 (empirical re-test), 3 (classification), and 4 (record) remain with the maintainer.


Step 1 — Capability shifts × existing methodology components

A. Claude Opus 4.7 (2026-04-16)

Anthropic's release notes explicitly direct users to "re-tune their prompts and harnesses accordingly" — this is the trigger criterion the discipline was authored against. Per-axis mapping:

Capability axis claimed Existing methodology component SoT location
"Substantially better at following instructions" + "works coherently for hours, pushes through hard problems rather than giving up" — direct anti-§12 claim §12 Context anxiety docs/ai-operating-contract.md §12
"Most consistent long-context performance" + "better at using file system-based memory… across long, multi-session work" Compaction algorithm + resume-mode 30% rule + §Pre-compression protection list docs/change-manifest-spec.md §Compaction algorithm, skills/engineering-workflow/references/resumption-protocol.md §Step 2b, docs/ai-project-memory.md §Pre-compression protection list
"Double-digit jump in accuracy of tool calls and planning" + "loop resistance most critical improvement" §11 Verbal-completion illusion, review-loop-pattern iteration cap (5), fix-retest-loop three-failure rule docs/ai-operating-contract.md §11, skills/engineering-workflow/references/review-loop-pattern.md, skills/engineering-workflow/phases/fix-retest-loop.md
"Correctly reports when data is missing" + "resists dissonant-data traps that even Opus 4.6 falls for" — direct anti-fabrication claim Pre-handoff self-check Q2 (reference-existence verification), §Anti-rationalization Rule 1 (confidence without substantiation), §9 Non-fabrication list docs/multi-agent-handoff.md §Pre-handoff self-check Q2, §Anti-rationalization rules Rule 1, docs/ai-operating-contract.md §9
"More opinionated… rather than simply agreeing with the user" §Anti-rationalization rules (potential new failure-shape symmetry — Reviewer over-pushback) docs/multi-agent-handoff.md §Anti-rationalization rules
Introduces xhigh effort tier between high and max Already integrated at bridge layer in v1.34.0 (no methodology shift required) agents/implementer-deep.md, agents/README.md

B. GPT-5.5 (2026-04-23) and Gemini 3.1 Pro (April 2026)

Less granular per-axis claims; mapping at axis level:

Axis Affected methodology components
Agentic / Computer Use (Gemini's new tool, GPT-5.5's positioning for "computer use") docs/cross-cutting-concerns.md §Application-driven verification, skills/engineering-workflow/references/application-driven-loop.md, §11 Verbal-completion illusion
Multi-step coherence (both releases) docs/ai-operating-contract.md §12 Context anxiety, skills/engineering-workflow/references/long-running-delegation.md D1/D2
Token efficiency ("significantly fewer tokens" in GPT-5.5) Manifest size ceiling at 2000 lines — output shape may shift
MCP for Deep Research (Gemini) docs/ai-project-memory.md §Tier 2 / CCKN, docs/repo-as-context-discipline.md Rule 3 (transcoding boundaries)

Step 1 — Components newly at risk (require Step 2 to confirm)

🔴 High priority — direct claim against the exact failure mode the component was authored against:

  • §12 Context anxiety — Opus 4.7's "works coherently for hours, pushes through hard problems rather than giving up" names this exact failure. The discipline was just shipped today (v1.35.0). The ironic case: this may be the first component to surface as retire-eligible. ⚠️ Test under realistic long-task workload before any retirement is considered. Empirical evidence is the load-bearing step (per docs/harness-evolution-discipline.md §Anti-patterns "the new model is much better, drop the scaffolding").
  • Pre-handoff self-check Q2 (reference-existence verification)"correctly reports when data is missing" + "resists dissonant-data traps" directly target Q2's failure shape. Test fabrication rate on representative changes.

🟡 Medium priority — overlapping claim, indirect mapping:

  • §Anti-rationalization Rule 1 (confidence-without-substantiation) — overlapping claim with Q2 above.
  • review-loop-pattern iteration cap (5)"loop resistance most critical" may mean fewer iterations are needed; capability ≠ tail-failure rate, so do not retire on claim alone.

🟢 Watch but lower priority:

  • §11 Verbal-completion illusion — tool-use precision claims are general, not aimed at action-transition turn-end specifically.
  • fix-retest-loop three-failure rule — same axis as review-loop-pattern.

Step 1 — Possible new failure modes (require Step 3 to confirm)

  • More-opinionated-base behavior: Opus 4.7's "takes a more opinionated perspective, rather than simply agreeing with the user" could surface as Reviewer too aggressive with send-backs (overriding Planner intent without escalation). Symmetric to existing §Anti-rationalization rules but on the other side of the boundary. Watch in Step 2.
  • Computer-use-tool integration drift: GPT-5.5 + Gemini 3.1 Pro both expose Computer Use as a first-class tool. runtime-hook-contract.md Categories may not have anticipated this surface explicitly. Likely a new hook category is NOT needed (existing categories cover the I/O shape), but verify in Step 2.
  • Token accounting shift: GPT-5.5 uses significantly fewer tokens; Opus 4.7's tokenizer changed (1.0–1.35× ratio). The 30% context-budget rule may need numeric recalibration to preserve effective headroom — not a discipline change, but a tuning.

Discipline guard (do not skip)

Per docs/harness-evolution-discipline.md: "a claimed shift does not by itself motivate a methodology edit — only an empirical re-test showing the failure mode no longer surfaces (or moved) does."

Everything above is Step 1 only. Three components are flagged as candidates for retire-eligibility, but none can be retired without Step 2 evidence. Skipping Step 2 and acting on Step 1 alone is the "the new model is much better, drop the scaffolding" anti-pattern.


Next actions for the maintainer

  • Step 2 (empirical re-test) — run representative changes through the methodology under each of the three new model classes:

    • One Lean-mode change (single surface, ≤5-min verification)
    • One Full-mode change (multi-surface scope)
    • One capability-frontier task

    For each component flagged above, record: did the component fire? did it add value, fire-but-find-nothing, or did a new failure surface that no component caught?

  • Step 3 (classification) — for each evaluated component, classify into:

    • Re-justified — still load-bearing; update rationale paragraph to cite this sweep
    • Retire-eligible — open a sweep-backed retirement Lean-mode change per mode-decision-tree.md §Scenarios that force Lean (the harness-evolution-sweep-backed canonical retirement row)
    • New-failure-surfaced — open a Full-mode L1+ change to add a new component
  • Step 4 (record) — single CHANGELOG entry naming the sweep, components evaluated, per-component outcome, and pointer to empirical re-test artifacts. Per-component edits cite the sweep entry.

Window: ~30 days from the latest material release (Gemini 3.1 Pro, late April 2026) → soft deadline late May 2026.


Sources


🤖 First execution of the harness-evolution sweep. Run interactively (not via the scheduled routine — that was created and disabled in the same session per maintainer's preference for direct execution). Subsequent sweeps remain ad-hoc until / unless a routine is re-enabled.

Note: label harness-evolution-sweep does not yet exist in this repo; consider creating it for routing of future sweep issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions