Trigger
Material releases detected in three of the methodology's bridged runtimes within the 2026-03-31 → 2026-04-30 window:
| Runtime |
Release |
Date |
| Claude Code (Anthropic) |
Claude Opus 4.7 |
2026-04-16 |
| Codex (OpenAI) |
GPT-5.5 |
2026-04-23 (Codex), 2026-04-24 (API) |
| Gemini CLI (Google) |
Gemini 3.1 Pro |
April 2026 |
Cursor (3.1 / 3.2 / SDK / /multitask / /debug) and Windsurf 2.0 + Devin integration were classified as non-material — IDE / partnership feature work, no underlying model class transition.
Previous sweep fire: none. This is the first execution of the harness-evolution sweep, run interactively immediately after v1.35.0 shipped the discipline.
This issue executes Step 1 (Map) of the harness-evolution discipline (docs/harness-evolution-discipline.md). Steps 2 (empirical re-test), 3 (classification), and 4 (record) remain with the maintainer.
Step 1 — Capability shifts × existing methodology components
A. Claude Opus 4.7 (2026-04-16)
Anthropic's release notes explicitly direct users to "re-tune their prompts and harnesses accordingly" — this is the trigger criterion the discipline was authored against. Per-axis mapping:
| Capability axis claimed |
Existing methodology component |
SoT location |
| "Substantially better at following instructions" + "works coherently for hours, pushes through hard problems rather than giving up" — direct anti-§12 claim |
§12 Context anxiety |
docs/ai-operating-contract.md §12 |
| "Most consistent long-context performance" + "better at using file system-based memory… across long, multi-session work" |
Compaction algorithm + resume-mode 30% rule + §Pre-compression protection list |
docs/change-manifest-spec.md §Compaction algorithm, skills/engineering-workflow/references/resumption-protocol.md §Step 2b, docs/ai-project-memory.md §Pre-compression protection list |
| "Double-digit jump in accuracy of tool calls and planning" + "loop resistance most critical improvement" |
§11 Verbal-completion illusion, review-loop-pattern iteration cap (5), fix-retest-loop three-failure rule |
docs/ai-operating-contract.md §11, skills/engineering-workflow/references/review-loop-pattern.md, skills/engineering-workflow/phases/fix-retest-loop.md |
| "Correctly reports when data is missing" + "resists dissonant-data traps that even Opus 4.6 falls for" — direct anti-fabrication claim |
Pre-handoff self-check Q2 (reference-existence verification), §Anti-rationalization Rule 1 (confidence without substantiation), §9 Non-fabrication list |
docs/multi-agent-handoff.md §Pre-handoff self-check Q2, §Anti-rationalization rules Rule 1, docs/ai-operating-contract.md §9 |
| "More opinionated… rather than simply agreeing with the user" |
§Anti-rationalization rules (potential new failure-shape symmetry — Reviewer over-pushback) |
docs/multi-agent-handoff.md §Anti-rationalization rules |
Introduces xhigh effort tier between high and max |
Already integrated at bridge layer in v1.34.0 (no methodology shift required) |
agents/implementer-deep.md, agents/README.md |
B. GPT-5.5 (2026-04-23) and Gemini 3.1 Pro (April 2026)
Less granular per-axis claims; mapping at axis level:
| Axis |
Affected methodology components |
| Agentic / Computer Use (Gemini's new tool, GPT-5.5's positioning for "computer use") |
docs/cross-cutting-concerns.md §Application-driven verification, skills/engineering-workflow/references/application-driven-loop.md, §11 Verbal-completion illusion |
| Multi-step coherence (both releases) |
docs/ai-operating-contract.md §12 Context anxiety, skills/engineering-workflow/references/long-running-delegation.md D1/D2 |
| Token efficiency ("significantly fewer tokens" in GPT-5.5) |
Manifest size ceiling at 2000 lines — output shape may shift |
| MCP for Deep Research (Gemini) |
docs/ai-project-memory.md §Tier 2 / CCKN, docs/repo-as-context-discipline.md Rule 3 (transcoding boundaries) |
Step 1 — Components newly at risk (require Step 2 to confirm)
🔴 High priority — direct claim against the exact failure mode the component was authored against:
- §12 Context anxiety — Opus 4.7's "works coherently for hours, pushes through hard problems rather than giving up" names this exact failure. The discipline was just shipped today (v1.35.0). The ironic case: this may be the first component to surface as retire-eligible. ⚠️ Test under realistic long-task workload before any retirement is considered. Empirical evidence is the load-bearing step (per
docs/harness-evolution-discipline.md §Anti-patterns "the new model is much better, drop the scaffolding").
- Pre-handoff self-check Q2 (reference-existence verification) — "correctly reports when data is missing" + "resists dissonant-data traps" directly target Q2's failure shape. Test fabrication rate on representative changes.
🟡 Medium priority — overlapping claim, indirect mapping:
- §Anti-rationalization Rule 1 (confidence-without-substantiation) — overlapping claim with Q2 above.
- review-loop-pattern iteration cap (5) — "loop resistance most critical" may mean fewer iterations are needed; capability ≠ tail-failure rate, so do not retire on claim alone.
🟢 Watch but lower priority:
- §11 Verbal-completion illusion — tool-use precision claims are general, not aimed at action-transition turn-end specifically.
- fix-retest-loop three-failure rule — same axis as review-loop-pattern.
Step 1 — Possible new failure modes (require Step 3 to confirm)
- More-opinionated-base behavior: Opus 4.7's "takes a more opinionated perspective, rather than simply agreeing with the user" could surface as Reviewer too aggressive with send-backs (overriding Planner intent without escalation). Symmetric to existing §Anti-rationalization rules but on the other side of the boundary. Watch in Step 2.
- Computer-use-tool integration drift: GPT-5.5 + Gemini 3.1 Pro both expose Computer Use as a first-class tool.
runtime-hook-contract.md Categories may not have anticipated this surface explicitly. Likely a new hook category is NOT needed (existing categories cover the I/O shape), but verify in Step 2.
- Token accounting shift: GPT-5.5 uses significantly fewer tokens; Opus 4.7's tokenizer changed (1.0–1.35× ratio). The 30% context-budget rule may need numeric recalibration to preserve effective headroom — not a discipline change, but a tuning.
Discipline guard (do not skip)
Per docs/harness-evolution-discipline.md: "a claimed shift does not by itself motivate a methodology edit — only an empirical re-test showing the failure mode no longer surfaces (or moved) does."
Everything above is Step 1 only. Three components are flagged as candidates for retire-eligibility, but none can be retired without Step 2 evidence. Skipping Step 2 and acting on Step 1 alone is the "the new model is much better, drop the scaffolding" anti-pattern.
Next actions for the maintainer
Window: ~30 days from the latest material release (Gemini 3.1 Pro, late April 2026) → soft deadline late May 2026.
Sources
🤖 First execution of the harness-evolution sweep. Run interactively (not via the scheduled routine — that was created and disabled in the same session per maintainer's preference for direct execution). Subsequent sweeps remain ad-hoc until / unless a routine is re-enabled.
Note: label harness-evolution-sweep does not yet exist in this repo; consider creating it for routing of future sweep issues.
Trigger
Material releases detected in three of the methodology's bridged runtimes within the 2026-03-31 → 2026-04-30 window:
Cursor (3.1 / 3.2 / SDK /
/multitask//debug) and Windsurf 2.0 + Devin integration were classified as non-material — IDE / partnership feature work, no underlying model class transition.Previous sweep fire: none. This is the first execution of the harness-evolution sweep, run interactively immediately after v1.35.0 shipped the discipline.
This issue executes Step 1 (Map) of the harness-evolution discipline (
docs/harness-evolution-discipline.md). Steps 2 (empirical re-test), 3 (classification), and 4 (record) remain with the maintainer.Step 1 — Capability shifts × existing methodology components
A. Claude Opus 4.7 (2026-04-16)
Anthropic's release notes explicitly direct users to "re-tune their prompts and harnesses accordingly" — this is the trigger criterion the discipline was authored against. Per-axis mapping:
docs/ai-operating-contract.md §12docs/change-manifest-spec.md §Compaction algorithm,skills/engineering-workflow/references/resumption-protocol.md §Step 2b,docs/ai-project-memory.md §Pre-compression protection listdocs/ai-operating-contract.md §11,skills/engineering-workflow/references/review-loop-pattern.md,skills/engineering-workflow/phases/fix-retest-loop.mddocs/multi-agent-handoff.md §Pre-handoff self-check Q2,§Anti-rationalization rulesRule 1,docs/ai-operating-contract.md §9docs/multi-agent-handoff.md §Anti-rationalization rulesxhigheffort tier betweenhighandmaxagents/implementer-deep.md,agents/README.mdB. GPT-5.5 (2026-04-23) and Gemini 3.1 Pro (April 2026)
Less granular per-axis claims; mapping at axis level:
docs/cross-cutting-concerns.md §Application-driven verification,skills/engineering-workflow/references/application-driven-loop.md, §11 Verbal-completion illusiondocs/ai-operating-contract.md §12 Context anxiety,skills/engineering-workflow/references/long-running-delegation.mdD1/D2docs/ai-project-memory.md §Tier 2 / CCKN,docs/repo-as-context-discipline.mdRule 3 (transcoding boundaries)Step 1 — Components newly at risk (require Step 2 to confirm)
🔴 High priority — direct claim against the exact failure mode the component was authored against:
docs/harness-evolution-discipline.md §Anti-patterns"the new model is much better, drop the scaffolding").🟡 Medium priority — overlapping claim, indirect mapping:
🟢 Watch but lower priority:
Step 1 — Possible new failure modes (require Step 3 to confirm)
runtime-hook-contract.mdCategories may not have anticipated this surface explicitly. Likely a new hook category is NOT needed (existing categories cover the I/O shape), but verify in Step 2.Discipline guard (do not skip)
Per
docs/harness-evolution-discipline.md: "a claimed shift does not by itself motivate a methodology edit — only an empirical re-test showing the failure mode no longer surfaces (or moved) does."Everything above is Step 1 only. Three components are flagged as candidates for retire-eligibility, but none can be retired without Step 2 evidence. Skipping Step 2 and acting on Step 1 alone is the "the new model is much better, drop the scaffolding" anti-pattern.
Next actions for the maintainer
Step 2 (empirical re-test) — run representative changes through the methodology under each of the three new model classes:
For each component flagged above, record: did the component fire? did it add value, fire-but-find-nothing, or did a new failure surface that no component caught?
Step 3 (classification) — for each evaluated component, classify into:
mode-decision-tree.md §Scenarios that force Lean(the harness-evolution-sweep-backed canonical retirement row)Step 4 (record) — single CHANGELOG entry naming the sweep, components evaluated, per-component outcome, and pointer to empirical re-test artifacts. Per-component edits cite the sweep entry.
Window: ~30 days from the latest material release (Gemini 3.1 Pro, late April 2026) → soft deadline late May 2026.
Sources
🤖 First execution of the harness-evolution sweep. Run interactively (not via the scheduled routine — that was created and disabled in the same session per maintainer's preference for direct execution). Subsequent sweeps remain ad-hoc until / unless a routine is re-enabled.
Note: label
harness-evolution-sweepdoes not yet exist in this repo; consider creating it for routing of future sweep issues.