Harness-evolution sweep: April 2026 model class transitions (Claude Opus 4.7 + GPT-5.5 + Gemini 3.1 Pro)

## Trigger

**Material releases detected** in three of the methodology's bridged runtimes within the 2026-03-31 → 2026-04-30 window:

| Runtime | Release | Date |
|---|---|---|
| Claude Code (Anthropic) | Claude Opus 4.7 | 2026-04-16 |
| Codex (OpenAI) | GPT-5.5 | 2026-04-23 (Codex), 2026-04-24 (API) |
| Gemini CLI (Google) | Gemini 3.1 Pro | April 2026 |

Cursor (3.1 / 3.2 / SDK / `/multitask` / `/debug`) and Windsurf 2.0 + Devin integration were classified as **non-material** — IDE / partnership feature work, no underlying model class transition.

**Previous sweep fire**: none. This is the first execution of the harness-evolution sweep, run interactively immediately after v1.35.0 shipped the discipline.

This issue executes **Step 1 (Map)** of the harness-evolution discipline (`docs/harness-evolution-discipline.md`). Steps 2 (empirical re-test), 3 (classification), and 4 (record) remain with the maintainer.

---

## Step 1 — Capability shifts × existing methodology components

### A. Claude Opus 4.7 (2026-04-16)

Anthropic's release notes explicitly direct users to *"re-tune their prompts and harnesses accordingly"* — this is the trigger criterion the discipline was authored against. Per-axis mapping:

| Capability axis claimed | Existing methodology component | SoT location |
|---|---|---|
| *"Substantially better at following instructions"* + *"works coherently for hours, pushes through hard problems rather than giving up"* — direct anti-§12 claim | **§12 Context anxiety** | `docs/ai-operating-contract.md §12` |
| *"Most consistent long-context performance"* + *"better at using file system-based memory… across long, multi-session work"* | **Compaction algorithm** + **resume-mode 30% rule** + **§Pre-compression protection list** | `docs/change-manifest-spec.md §Compaction algorithm`, `skills/engineering-workflow/references/resumption-protocol.md §Step 2b`, `docs/ai-project-memory.md §Pre-compression protection list` |
| *"Double-digit jump in accuracy of tool calls and planning"* + *"loop resistance most critical improvement"* | **§11 Verbal-completion illusion**, **review-loop-pattern iteration cap (5)**, **fix-retest-loop three-failure rule** | `docs/ai-operating-contract.md §11`, `skills/engineering-workflow/references/review-loop-pattern.md`, `skills/engineering-workflow/phases/fix-retest-loop.md` |
| *"Correctly reports when data is missing"* + *"resists dissonant-data traps that even Opus 4.6 falls for"* — direct anti-fabrication claim | **Pre-handoff self-check Q2 (reference-existence verification)**, **§Anti-rationalization Rule 1 (confidence without substantiation)**, **§9 Non-fabrication list** | `docs/multi-agent-handoff.md §Pre-handoff self-check Q2`, `§Anti-rationalization rules` Rule 1, `docs/ai-operating-contract.md §9` |
| *"More opinionated… rather than simply agreeing with the user"* | **§Anti-rationalization rules** (potential new failure-shape symmetry — Reviewer over-pushback) | `docs/multi-agent-handoff.md §Anti-rationalization rules` |
| Introduces `xhigh` effort tier between `high` and `max` | Already integrated at bridge layer in v1.34.0 (no methodology shift required) | `agents/implementer-deep.md`, `agents/README.md` |

### B. GPT-5.5 (2026-04-23) and Gemini 3.1 Pro (April 2026)

Less granular per-axis claims; mapping at axis level:

| Axis | Affected methodology components |
|---|---|
| Agentic / Computer Use (Gemini's new tool, GPT-5.5's positioning for "computer use") | `docs/cross-cutting-concerns.md §Application-driven verification`, `skills/engineering-workflow/references/application-driven-loop.md`, §11 Verbal-completion illusion |
| Multi-step coherence (both releases) | `docs/ai-operating-contract.md §12 Context anxiety`, `skills/engineering-workflow/references/long-running-delegation.md` D1/D2 |
| Token efficiency (*"significantly fewer tokens"* in GPT-5.5) | Manifest size ceiling at 2000 lines — output shape may shift |
| MCP for Deep Research (Gemini) | `docs/ai-project-memory.md §Tier 2 / CCKN`, `docs/repo-as-context-discipline.md` Rule 3 (transcoding boundaries) |

---

## Step 1 — Components newly at risk (require Step 2 to confirm)

🔴 **High priority** — direct claim against the exact failure mode the component was authored against:

- **§12 Context anxiety** — Opus 4.7's *"works coherently for hours, pushes through hard problems rather than giving up"* names this exact failure. **The discipline was just shipped today (v1.35.0).** The ironic case: this may be the first component to surface as retire-eligible. ⚠️ Test under realistic long-task workload before any retirement is considered. Empirical evidence is the load-bearing step (per `docs/harness-evolution-discipline.md §Anti-patterns` *"the new model is much better, drop the scaffolding"*).
- **Pre-handoff self-check Q2 (reference-existence verification)** — *"correctly reports when data is missing"* + *"resists dissonant-data traps"* directly target Q2's failure shape. Test fabrication rate on representative changes.

🟡 **Medium priority** — overlapping claim, indirect mapping:

- **§Anti-rationalization Rule 1** (confidence-without-substantiation) — overlapping claim with Q2 above.
- **review-loop-pattern iteration cap (5)** — *"loop resistance most critical"* may mean fewer iterations are needed; capability ≠ tail-failure rate, so do not retire on claim alone.

🟢 **Watch but lower priority**:

- **§11 Verbal-completion illusion** — tool-use precision claims are general, not aimed at action-transition turn-end specifically.
- **fix-retest-loop three-failure rule** — same axis as review-loop-pattern.

---

## Step 1 — Possible new failure modes (require Step 3 to confirm)

- **More-opinionated-base behavior**: Opus 4.7's *"takes a more opinionated perspective, rather than simply agreeing with the user"* could surface as Reviewer too aggressive with send-backs (overriding Planner intent without escalation). Symmetric to existing §Anti-rationalization rules but on the *other* side of the boundary. Watch in Step 2.
- **Computer-use-tool integration drift**: GPT-5.5 + Gemini 3.1 Pro both expose Computer Use as a first-class tool. `runtime-hook-contract.md` Categories may not have anticipated this surface explicitly. Likely a new hook category is NOT needed (existing categories cover the I/O shape), but verify in Step 2.
- **Token accounting shift**: GPT-5.5 uses significantly fewer tokens; Opus 4.7's tokenizer changed (1.0–1.35× ratio). The 30% context-budget rule may need numeric recalibration to preserve effective headroom — not a discipline change, but a tuning.

---

## Discipline guard (do not skip)

Per `docs/harness-evolution-discipline.md`: *"a claimed shift does not by itself motivate a methodology edit — only an empirical re-test showing the failure mode no longer surfaces (or moved) does."*

Everything above is **Step 1 only**. Three components are flagged as candidates for retire-eligibility, but **none can be retired without Step 2 evidence**. Skipping Step 2 and acting on Step 1 alone is the *"the new model is much better, drop the scaffolding"* anti-pattern.

---

## Next actions for the maintainer

- [ ] **Step 2 (empirical re-test)** — run representative changes through the methodology under each of the three new model classes:
  - One Lean-mode change (single surface, ≤5-min verification)
  - One Full-mode change (multi-surface scope)
  - One capability-frontier task
  
  For each component flagged above, record: *did the component fire? did it add value, fire-but-find-nothing, or did a new failure surface that no component caught?*
  
- [ ] **Step 3 (classification)** — for each evaluated component, classify into:
  - **Re-justified** — still load-bearing; update rationale paragraph to cite this sweep
  - **Retire-eligible** — open a sweep-backed retirement Lean-mode change per `mode-decision-tree.md §Scenarios that force Lean` (the harness-evolution-sweep-backed canonical retirement row)
  - **New-failure-surfaced** — open a Full-mode L1+ change to add a new component

- [ ] **Step 4 (record)** — single CHANGELOG entry naming the sweep, components evaluated, per-component outcome, and pointer to empirical re-test artifacts. Per-component edits cite the sweep entry.

**Window**: ~30 days from the latest material release (Gemini 3.1 Pro, late April 2026) → soft deadline late May 2026.

---

## Sources

- [Anthropic — Introducing Claude Opus 4.7](https://www.anthropic.com/news/claude-opus-4-7)
- [OpenAI — Introducing GPT-5.5](https://openai.com/index/introducing-gpt-5-5/)
- [Google Gemini API release notes](https://ai.google.dev/gemini-api/docs/changelog)
- [Cursor changelog](https://www.cursor.com/changelog) (non-material; classified out)
- [Windsurf blog](https://windsurf.com/blog) (non-material; classified out)

---

🤖 First execution of the harness-evolution sweep. Run interactively (not via the scheduled routine — that was created and disabled in the same session per maintainer's preference for direct execution). Subsequent sweeps remain ad-hoc until / unless a routine is re-enabled.

*Note: label `harness-evolution-sweep` does not yet exist in this repo; consider creating it for routing of future sweep issues.*

Capability axis claimed	Existing methodology component	SoT location
"Substantially better at following instructions" + "works coherently for hours, pushes through hard problems rather than giving up" — direct anti-§12 claim	§12 Context anxiety	`docs/ai-operating-contract.md §12`
"Most consistent long-context performance" + "better at using file system-based memory… across long, multi-session work"	Compaction algorithm + resume-mode 30% rule + §Pre-compression protection list	`docs/change-manifest-spec.md §Compaction algorithm`, `skills/engineering-workflow/references/resumption-protocol.md §Step 2b`, `docs/ai-project-memory.md §Pre-compression protection list`
"Double-digit jump in accuracy of tool calls and planning" + "loop resistance most critical improvement"	§11 Verbal-completion illusion, review-loop-pattern iteration cap (5), fix-retest-loop three-failure rule	`docs/ai-operating-contract.md §11`, `skills/engineering-workflow/references/review-loop-pattern.md`, `skills/engineering-workflow/phases/fix-retest-loop.md`
"Correctly reports when data is missing" + "resists dissonant-data traps that even Opus 4.6 falls for" — direct anti-fabrication claim	Pre-handoff self-check Q2 (reference-existence verification), §Anti-rationalization Rule 1 (confidence without substantiation), §9 Non-fabrication list	`docs/multi-agent-handoff.md §Pre-handoff self-check Q2`, `§Anti-rationalization rules` Rule 1, `docs/ai-operating-contract.md §9`
"More opinionated… rather than simply agreeing with the user"	§Anti-rationalization rules (potential new failure-shape symmetry — Reviewer over-pushback)	`docs/multi-agent-handoff.md §Anti-rationalization rules`
Introduces `xhigh` effort tier between `high` and `max`	Already integrated at bridge layer in v1.34.0 (no methodology shift required)	`agents/implementer-deep.md`, `agents/README.md`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harness-evolution sweep: April 2026 model class transitions (Claude Opus 4.7 + GPT-5.5 + Gemini 3.1 Pro) #17

Trigger

Step 1 — Capability shifts × existing methodology components

A. Claude Opus 4.7 (2026-04-16)

B. GPT-5.5 (2026-04-23) and Gemini 3.1 Pro (April 2026)

Step 1 — Components newly at risk (require Step 2 to confirm)

Step 1 — Possible new failure modes (require Step 3 to confirm)

Discipline guard (do not skip)

Next actions for the maintainer

Sources

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Runtime	Release	Date
Claude Code (Anthropic)	Claude Opus 4.7	2026-04-16
Codex (OpenAI)	GPT-5.5	2026-04-23 (Codex), 2026-04-24 (API)
Gemini CLI (Google)	Gemini 3.1 Pro	April 2026

Axis	Affected methodology components
Agentic / Computer Use (Gemini's new tool, GPT-5.5's positioning for "computer use")	`docs/cross-cutting-concerns.md §Application-driven verification`, `skills/engineering-workflow/references/application-driven-loop.md`, §11 Verbal-completion illusion
Multi-step coherence (both releases)	`docs/ai-operating-contract.md §12 Context anxiety`, `skills/engineering-workflow/references/long-running-delegation.md` D1/D2
Token efficiency ("significantly fewer tokens" in GPT-5.5)	Manifest size ceiling at 2000 lines — output shape may shift
MCP for Deep Research (Gemini)	`docs/ai-project-memory.md §Tier 2 / CCKN`, `docs/repo-as-context-discipline.md` Rule 3 (transcoding boundaries)

Harness-evolution sweep: April 2026 model class transitions (Claude Opus 4.7 + GPT-5.5 + Gemini 3.1 Pro) #17

Description

Trigger

Step 1 — Capability shifts × existing methodology components

A. Claude Opus 4.7 (2026-04-16)

B. GPT-5.5 (2026-04-23) and Gemini 3.1 Pro (April 2026)

Step 1 — Components newly at risk (require Step 2 to confirm)

Step 1 — Possible new failure modes (require Step 3 to confirm)

Discipline guard (do not skip)

Next actions for the maintainer

Sources

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions