diff --git a/CHANGELOG.json b/CHANGELOG.json index aabbe5a..8d15cd2 100644 --- a/CHANGELOG.json +++ b/CHANGELOG.json @@ -28,6 +28,30 @@ { "title": null, "body": "**`schemas/change-manifest.schema.yaml` (and generated `.json`) — new optional `implementation_clusters[*].isolation_root` field.** Declarative companion to the new §Runtime isolation section: a string field naming the cluster's exclusive write region (worktree path, container mount, scoped filesystem view). **Optional**, additive (existing manifests without the field validate unchanged); not part of the cluster's `required` list because not every runtime can produce a stable identifier (ephemeral container mounts, in-memory filesystems). When declared, gives the Reviewer a verifiable target (\"did this cluster's Implementer in fact write only within `isolation_root`?\"); when absent, the runtime-isolation property still binds but cannot be audited from the manifest alone. Field description cross-references `cluster-parallelism.md §Runtime isolation §Per-cluster exclusive write region` so a runtime bridge author landing in the schema can navigate to the binding rule. Backwards compatibility: existing 1.x manifests using `implementation_clusters` without `isolation_root` continue to validate; pre-1.30 manifests (without `implementation_clusters` at all) continue to validate." + }, + { + "title": "`docs/ai-operating-contract.md` — new `§12 Context anxiety` (premature task-shrink under perceived context pressure) section + `docs/glossary.md §Anti-pattern vocabulary` term + cross-references in `resumption-protocol.md` and `ai-project-memory.md`.", + "body": "Adds a new failure-mode section parallel in structure to §11 (verbal-completion illusion). §12 names the **intra-session** failure mode where a long-running agent **perceives** it is approaching its context window limit and **prematurely truncates remaining work** — task scope shrinks, planned evidence rows disappear from `evidence_plan.artifacts`, the session ends with a wrap-up narrative instead of executing the rest of the plan. The perception may or may not be accurate; the failure is the agent **acting on the perception without verifying** and **without declaring the truncation as a scope change or escalation**. Distinct from §4 context hygiene (cross-session fact loss to compression), §11 verbal-completion illusion (single tool call missed at action transition), and §3 evidence-before-completion failure (self-claim mistaken for verification): the four share the surface symptom *\"work declared done that is not done\"* but their upstream causes and corrective actions differ, and misdiagnosing one as another routes the agent into the wrong remedy (e.g. §11's couple-narration-with-tool-call rule does not lift §12's perceived pressure, and §12's prefer-Context-Reset rule does not address §11's missing tool call). The new section follows §11's structural template (Why this rule exists / Symptoms / Must do / Must not do / Distinguishing from adjacent failures / Risk-point inventory / Relation to other sections) and explicitly **prefers Context Reset over Compaction** as the remedy — compaction shrinks the conversation but preserves the same session's pressure perception, so the anxiety re-fires on the next stretch of work; a fresh session reading a Manifest-backed handoff does not carry the perception forward. The corresponding glossary entry lands in `§Anti-pattern vocabulary` after `False completion` with a three-row distinguishing table mapping each adjacent failure to its strike-point, cause, and remedy. Cross-references propagate: `skills/engineering-workflow/references/resumption-protocol.md §Step 2b` gains an \"Outgoing-session counterpart\" callout naming §12 as the symmetric outgoing rule (incoming sessions estimate before reading; outgoing sessions estimate before declaring done); `docs/ai-project-memory.md §Pre-compression protection list` gains a callout distinguishing the rescue protocol (what to save when compression is imminent) from §12 (what *not to do* when the agent merely *anticipates* compression). Closes the gap where the methodology had names for the cross-session form of context loss (§4) and the action-transition form (§11) but not the long-tail intra-session form most often observed at the wrap-up boundary of multi-task plans." + }, + { + "title": "`docs/multi-agent-handoff.md` — new `§Acceptance criteria as a Sprint Contract` subsection (Planner-side, Phase 2).", + "body": "Inserted after `§Task Prompt structure §Mode application` and before `### Implementer`. The §Task Prompt structure six-column table already requires AC to be \"Numbered, individually-checkable statements verifiable against a `file:line` and an evidence artifact.\" The new subsection adds the **time-axis discipline**: it is the Planner's obligation to clear AC quality at the time of writing, not just to leave the verification check to the Implementer's egress (`§Pre-handoff self-check Q1`). Two named rules: **(1) Reviewer-anticipation rule (Planner → Reviewer direction)** — before handing the Task Prompt to the Implementer, the Planner asks \"if the Reviewer audited this AC, what specifically would they look for?\" Each AC must answer with a concrete `file:line` + evidence-type pair *at the time of writing*. §Pre-handoff self-check Q1 is the last line of defence; this rule catches the same failure at the source, before the Implementer has spent cycles on AC that could not have been verified anyway. Crucially, this is **not** the Reviewer participating in Phase 2 — the Planner imagines the Reviewer's audit; the Reviewer themselves remains a separate identity entering at Phase 5 (bringing a real Reviewer into Phase 2 would collapse §Single-agent anti-collusion rule). **(2) Reverse-shape rule (AC text → Implementer direction)** — the wording of an AC steers what the Implementer optimises for, often in ways the Planner did not intend. An AC stated as \"the endpoint returns within 50ms\" pulls toward latency tuning; an AC stated as \"the endpoint returns the correct shape under malformed input\" pulls toward error handling. Writing only one **silently de-prioritises the other** — the Implementer reads AC text as the contract; what is not in the AC is not in the contract. This is the Task Prompt analogue of `docs/agent-persona-discipline.md`'s observation that the medium of the output reverse-shapes the persona that produces it. Three pre-handoff self-check questions follow: pre-verifiability (can the Planner state the verification path *before* code is written?), dimension coverage (does the AC set cover every dimension the change cares about explicitly?), verifiability symmetry (would a takeover session reading only AC + manifest + diff know how to audit?). Mode application: Lean collapses the three questions into \"can I cite where to look and what counts as proof?\"; Zero-ceremony has the same agent run the discipline against itself. Closes the gap where the methodology bound the *evidence-side* of AC quality (verification paths must exist) but not the *time-axis* — Planners writing AC in Phase 2 had no explicit obligation to imagine the Reviewer's audit before handing off, so unverifiable AC surfaced only at the Implementer's egress-time self-check, after work had been done against them." + }, + { + "title": "`docs/harness-evolution-discipline.md` — new file (Tier-3 discipline) + force-Lean retirement row in `mode-decision-tree.md` + index registrations + back-pointers.", + "body": "New canonical methodology document landing in Tier-3 of `docs/README.md`. Names a sweep distinct from the existing `anti-entropy-discipline.md` Rule 3 sweep classes: anti-entropy targets *project-local* discipline-provenance drift on a calendar cadence and explicitly excludes canonical methodology components (\"origins by definition\"); this discipline targets *canonical-methodology* components whose load-bearing-ness depends on a specific model-class failure mode, on a per-material-model-release cadence. The discipline operationalises the principle that **every methodology component encodes an assumption about what the model cannot do reliably without scaffolding** — §Single-agent anti-collusion rule encodes the self-evaluation-bias failure, §Pre-handoff self-check encodes the \"done without evidence\" failure, §11 encodes the verbal-completion illusion, §12 encodes the context-anxiety failure, §Anti-rationalization rules encode the Reviewer-rationalising-approval failure, the Compaction algorithm encodes the manifest-overflow failure. Some assumptions tighten as models improve, some become outdated, some new failure modes emerge. Without periodic re-evaluation, the methodology accumulates ceremony monotonically (every observed incident motivates a new safeguard, no symmetric pressure removes obsolete ones). Four-step procedure: (Step 1) map the release's measured capability shifts against the methodology's existing failure-mode list; (Step 2) **empirically re-test** the targeted components on representative changes (Lean / Full / capability-frontier); (Step 3) classify each component into re-justified / retire-eligible / new-failure-surfaced; (Step 4) record the sweep as a single CHANGELOG entry with per-component rationale updates citing the sweep. Anti-patterns explicitly enumerated: \"the methodology has worked for years,\" \"the new model is much better, drop the scaffolding,\" \"add new components for every interesting capability claim,\" \"run on every patch release,\" \"treat the sweep as additive only,\" \"bundle findings into the next big methodology change.\" Cadence is per material model release with a ~30-day window. Owner is repo maintainer (not adopter teams). Companion mode-decision-tree change: `skills/engineering-workflow/references/mode-decision-tree.md §Scenarios that force Lean` gains a row covering harness-evolution-sweep-backed canonical-component retirements — symmetric to the existing Discipline-provenance-sweep-backed project-local retirement row, both rides applying the asymmetric-cost lever (sweep-backed retirement Lean-eligible; addition still Full L1+) so the methodology can shed canonical weight as well as project-local weight. Companion edits: `docs/anti-entropy-discipline.md §Relationship to other documents` gains a back-pointer naming harness-evolution as the canonical-methodology counterpart; `docs/file-role-map.md` registers the new file in the topic-specific SoT list; `docs/README.md` Tier-3 disciplines section gains a row. No new schema fields, no new role definitions, no new manifest enums." + }, + { + "title": "`docs/multi-agent-handoff.md §Single-agent anti-collusion rule` — new `### Why this rule exists` preamble naming self-evaluation bias as the underlying behavioural failure.", + "body": "Adds a preamble before the structural rule explaining the **behavioural failure** the structural rule enforces against. The pattern named: *AI agents asked to evaluate work they have produced systematically over-report quality.* The preamble traces the pattern across multiple surfaces (Reviewer's anti-rationalisation failures, Implementer's self-supervising loop in `ai-operating-contract.md §Rejected patterns`, autonomous self-terminating loop, broader self-praising of mediocre output) and identifies the shared mechanism: *the same identity that produced the work cannot reliably hold an adversarial stance toward it.* The preamble explicitly names that this is **not prompt-engineerable away** — instructing an agent to \"be critical of your own work\" makes the surface text more critical without making the evaluation more accurate; the reliable fix is **structural separation**, with the auditor's tool envelope mechanically prevented from touching the work. The §Tool-permission matrix's *Reviewer has no write tools* row is identified as the load-bearing form. The preamble also cross-references `harness-evolution-discipline.md` for re-evaluation of whether the failure still binds on a given model class — the structural rule stays as long as the failure does, but the discipline's boundary is sweep-evaluable rather than perpetually fixed. Closes the gap where the rule was structurally complete (every consumer cited it) but lacked a stated rationale at the SoT — readers landing on the rule from a runtime bridge or thin-bridge file would see the prohibition without seeing the underlying behavioural pattern, so the rule's strictness read as procedural preference rather than counter-pressure to a behavioural failure." + }, + { + "title": "`docs/multi-agent-handoff.md §Capability gating by risk level` — new \"Risk is one axis; capability frontier is another\" callout.", + "body": "The existing matrix scales the role envelope on `breaking_change.level × rollback_mode` — a *blast-radius* axis. The callout names a second axis the matrix deliberately does not encode: how far the task sits from what the current model class does reliably solo (the *capability-frontier* axis). At equivalent blast-radius, a low-L mode-1 change at the capability frontier (unfamiliar SoT pattern, multi-step task at the edge of long-context coherence, model-class-new task domain) earns more Reviewer attention and benefits more from a registered specialist than the same risk profile applied to a well-trodden change shape. Conversely, a change shape the model has executed reliably across many prior changes can be reviewed with the baseline envelope even when the matrix's risk-axis row would technically permit more. The matrix's *additional gating* column is a **floor, not a ceiling** — capability-frontier signals (Discovery-loop frequency on similar prior changes, novel SoT pattern, model-class-new task domain) can motivate raising envelope strictness above the floor, recorded as an `escalations[*]` entry naming the capability-frontier rationale rather than a risk-axis trigger. The risk-axis is the encoded mechanical enforcement boundary; the capability-frontier axis is the human / Planner judgement signal that lives alongside it. Re-evaluation of where the capability frontier sits is a `harness-evolution-discipline.md` concern; per-change sensitivity to it is a Planner concern. Closes the gap where adopters reading the matrix in isolation might infer the role envelope's strictness was a function of risk only — not so; both axes apply." + }, + { + "title": "`docs/mechanical-enforcement-discipline.md` — new `§Boundary with non-mechanical evaluation` section + Planner-side allocation rule + cross-references from `multi-agent-handoff.md §Reviewer`.", + "body": "Inserted after `§How much enforcement is right` (which already discusses the boundary in negative — \"Reviewer effort going to mechanical issues\" as an under-enforcement signal). The new section names the boundary in **positive**: what each evaluator type is *for*, and how the three layer. Names the **three evaluator types** the methodology relies on as a single coherent stack — **Mechanical** (this doc; pass/fail predicates over source / structure / artifact shape; cheapest, fires on every event, uniform coverage), **Application-driven** (`cross-cutting-concerns.md §Application-driven verification` + `application-driven-loop.md`; runtime behaviour evidence — page actually renders, API actually returns the shape, log entry actually appears in deployed stack; per-check cost; mandatory on user / operational surfaces above L2 per `autonomy-ladder-discipline.md`), **Agentic Reviewer audit** (`multi-agent-handoff.md §Reviewer`; cross-cutting concerns / breaking-change classification / rollback-mode appropriateness / surface coverage / claim substantiation; per-change cost, bounded by §Anti-rationalization rules and `review-loop-pattern.md` iteration cap). Layering: mechanical is the floor, application-driven is the bridge (mechanical execution + agentic-or-mechanical interpretation — the locus where the two meet), agentic Reviewer audit is the ceiling. **Allocation rule (Planner-side, Phase 3):** each acceptance criterion is allocated to the cheapest evaluator that catches its failure shape — pass/fail predicates → mechanical; runtime-behaviour evidence → application-driven; judgment-heavy / cross-cutting → agentic Reviewer audit. The allocation is per-AC, not per-change; a change with all-mechanical evidence is under-evaluated if it touches cross-cutting concerns, a change with all-agentic evidence is over-evaluated if checks were achievable mechanically. Parallel to the new AC-as-Sprint-Contract discipline (above): same time-axis, complementary axis (Sprint Contract asks \"is this AC pre-verifiable?\"; allocation rule asks \"by which evaluator?\"). **New anti-pattern:** *routing by familiarity rather than by failure shape* — mechanical-heavy team writes lint rules for everything (high-noise, gets bypassed); agentic-heavy team relies on Reviewer for every dimension (Reviewer attention saturates on lint-level noise). Detection: `review_notes` repeatedly surfacing \"lint should have caught this\" or \"every iteration of this audit catches the same shape.\" Companion edit: `multi-agent-handoff.md §Reviewer §Must not do` gains a row \"Spend audit attention on what a mechanical check should have caught\" pointing back at the new section. `§Relationship to other documents` of `mechanical-enforcement-discipline.md` gains two rows naming `multi-agent-handoff.md §Reviewer` (agentic-evaluator counterpart) and `cross-cutting-concerns.md §Application-driven verification` (bridge-evaluator counterpart). Closes the gap where the methodology had three evaluator surfaces — mechanical / application-driven / agentic — operating in parallel without a single canonical comparison or allocation rule, leaving Planners to allocate AC rows by familiarity and Reviewers to absorb mechanical-shaped findings the methodology had no positive guidance against." } ], "changed": [ diff --git a/CHANGELOG.md b/CHANGELOG.md index fa01a21..01c16ad 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -20,6 +20,18 @@ Format inspired by Keep a Changelog; versioning policy in `VERSIONING.md`. - **`schemas/change-manifest.schema.yaml` (and generated `.json`) — new optional `implementation_clusters[*].isolation_root` field.** Declarative companion to the new §Runtime isolation section: a string field naming the cluster's exclusive write region (worktree path, container mount, scoped filesystem view). **Optional**, additive (existing manifests without the field validate unchanged); not part of the cluster's `required` list because not every runtime can produce a stable identifier (ephemeral container mounts, in-memory filesystems). When declared, gives the Reviewer a verifiable target ("did this cluster's Implementer in fact write only within `isolation_root`?"); when absent, the runtime-isolation property still binds but cannot be audited from the manifest alone. Field description cross-references `cluster-parallelism.md §Runtime isolation §Per-cluster exclusive write region` so a runtime bridge author landing in the schema can navigate to the binding rule. Backwards compatibility: existing 1.x manifests using `implementation_clusters` without `isolation_root` continue to validate; pre-1.30 manifests (without `implementation_clusters` at all) continue to validate. +- **`docs/ai-operating-contract.md` — new `§12 Context anxiety` (premature task-shrink under perceived context pressure) section + `docs/glossary.md §Anti-pattern vocabulary` term + cross-references in `resumption-protocol.md` and `ai-project-memory.md`.** Adds a new failure-mode section parallel in structure to §11 (verbal-completion illusion). §12 names the **intra-session** failure mode where a long-running agent **perceives** it is approaching its context window limit and **prematurely truncates remaining work** — task scope shrinks, planned evidence rows disappear from `evidence_plan.artifacts`, the session ends with a wrap-up narrative instead of executing the rest of the plan. The perception may or may not be accurate; the failure is the agent **acting on the perception without verifying** and **without declaring the truncation as a scope change or escalation**. Distinct from §4 context hygiene (cross-session fact loss to compression), §11 verbal-completion illusion (single tool call missed at action transition), and §3 evidence-before-completion failure (self-claim mistaken for verification): the four share the surface symptom *"work declared done that is not done"* but their upstream causes and corrective actions differ, and misdiagnosing one as another routes the agent into the wrong remedy (e.g. §11's couple-narration-with-tool-call rule does not lift §12's perceived pressure, and §12's prefer-Context-Reset rule does not address §11's missing tool call). The new section follows §11's structural template (Why this rule exists / Symptoms / Must do / Must not do / Distinguishing from adjacent failures / Risk-point inventory / Relation to other sections) and explicitly **prefers Context Reset over Compaction** as the remedy — compaction shrinks the conversation but preserves the same session's pressure perception, so the anxiety re-fires on the next stretch of work; a fresh session reading a Manifest-backed handoff does not carry the perception forward. The corresponding glossary entry lands in `§Anti-pattern vocabulary` after `False completion` with a three-row distinguishing table mapping each adjacent failure to its strike-point, cause, and remedy. Cross-references propagate: `skills/engineering-workflow/references/resumption-protocol.md §Step 2b` gains an "Outgoing-session counterpart" callout naming §12 as the symmetric outgoing rule (incoming sessions estimate before reading; outgoing sessions estimate before declaring done); `docs/ai-project-memory.md §Pre-compression protection list` gains a callout distinguishing the rescue protocol (what to save when compression is imminent) from §12 (what *not to do* when the agent merely *anticipates* compression). Closes the gap where the methodology had names for the cross-session form of context loss (§4) and the action-transition form (§11) but not the long-tail intra-session form most often observed at the wrap-up boundary of multi-task plans. + +- **`docs/multi-agent-handoff.md` — new `§Acceptance criteria as a Sprint Contract` subsection (Planner-side, Phase 2).** Inserted after `§Task Prompt structure §Mode application` and before `### Implementer`. The §Task Prompt structure six-column table already requires AC to be "Numbered, individually-checkable statements verifiable against a `file:line` and an evidence artifact." The new subsection adds the **time-axis discipline**: it is the Planner's obligation to clear AC quality at the time of writing, not just to leave the verification check to the Implementer's egress (`§Pre-handoff self-check Q1`). Two named rules: **(1) Reviewer-anticipation rule (Planner → Reviewer direction)** — before handing the Task Prompt to the Implementer, the Planner asks "if the Reviewer audited this AC, what specifically would they look for?" Each AC must answer with a concrete `file:line` + evidence-type pair *at the time of writing*. §Pre-handoff self-check Q1 is the last line of defence; this rule catches the same failure at the source, before the Implementer has spent cycles on AC that could not have been verified anyway. Crucially, this is **not** the Reviewer participating in Phase 2 — the Planner imagines the Reviewer's audit; the Reviewer themselves remains a separate identity entering at Phase 5 (bringing a real Reviewer into Phase 2 would collapse §Single-agent anti-collusion rule). **(2) Reverse-shape rule (AC text → Implementer direction)** — the wording of an AC steers what the Implementer optimises for, often in ways the Planner did not intend. An AC stated as "the endpoint returns within 50ms" pulls toward latency tuning; an AC stated as "the endpoint returns the correct shape under malformed input" pulls toward error handling. Writing only one **silently de-prioritises the other** — the Implementer reads AC text as the contract; what is not in the AC is not in the contract. This is the Task Prompt analogue of `docs/agent-persona-discipline.md`'s observation that the medium of the output reverse-shapes the persona that produces it. Three pre-handoff self-check questions follow: pre-verifiability (can the Planner state the verification path *before* code is written?), dimension coverage (does the AC set cover every dimension the change cares about explicitly?), verifiability symmetry (would a takeover session reading only AC + manifest + diff know how to audit?). Mode application: Lean collapses the three questions into "can I cite where to look and what counts as proof?"; Zero-ceremony has the same agent run the discipline against itself. Closes the gap where the methodology bound the *evidence-side* of AC quality (verification paths must exist) but not the *time-axis* — Planners writing AC in Phase 2 had no explicit obligation to imagine the Reviewer's audit before handing off, so unverifiable AC surfaced only at the Implementer's egress-time self-check, after work had been done against them. + +- **`docs/harness-evolution-discipline.md` — new file (Tier-3 discipline) + force-Lean retirement row in `mode-decision-tree.md` + index registrations + back-pointers.** New canonical methodology document landing in Tier-3 of `docs/README.md`. Names a sweep distinct from the existing `anti-entropy-discipline.md` Rule 3 sweep classes: anti-entropy targets *project-local* discipline-provenance drift on a calendar cadence and explicitly excludes canonical methodology components ("origins by definition"); this discipline targets *canonical-methodology* components whose load-bearing-ness depends on a specific model-class failure mode, on a per-material-model-release cadence. The discipline operationalises the principle that **every methodology component encodes an assumption about what the model cannot do reliably without scaffolding** — §Single-agent anti-collusion rule encodes the self-evaluation-bias failure, §Pre-handoff self-check encodes the "done without evidence" failure, §11 encodes the verbal-completion illusion, §12 encodes the context-anxiety failure, §Anti-rationalization rules encode the Reviewer-rationalising-approval failure, the Compaction algorithm encodes the manifest-overflow failure. Some assumptions tighten as models improve, some become outdated, some new failure modes emerge. Without periodic re-evaluation, the methodology accumulates ceremony monotonically (every observed incident motivates a new safeguard, no symmetric pressure removes obsolete ones). Four-step procedure: (Step 1) map the release's measured capability shifts against the methodology's existing failure-mode list; (Step 2) **empirically re-test** the targeted components on representative changes (Lean / Full / capability-frontier); (Step 3) classify each component into re-justified / retire-eligible / new-failure-surfaced; (Step 4) record the sweep as a single CHANGELOG entry with per-component rationale updates citing the sweep. Anti-patterns explicitly enumerated: "the methodology has worked for years," "the new model is much better, drop the scaffolding," "add new components for every interesting capability claim," "run on every patch release," "treat the sweep as additive only," "bundle findings into the next big methodology change." Cadence is per material model release with a ~30-day window. Owner is repo maintainer (not adopter teams). Companion mode-decision-tree change: `skills/engineering-workflow/references/mode-decision-tree.md §Scenarios that force Lean` gains a row covering harness-evolution-sweep-backed canonical-component retirements — symmetric to the existing Discipline-provenance-sweep-backed project-local retirement row, both rides applying the asymmetric-cost lever (sweep-backed retirement Lean-eligible; addition still Full L1+) so the methodology can shed canonical weight as well as project-local weight. Companion edits: `docs/anti-entropy-discipline.md §Relationship to other documents` gains a back-pointer naming harness-evolution as the canonical-methodology counterpart; `docs/file-role-map.md` registers the new file in the topic-specific SoT list; `docs/README.md` Tier-3 disciplines section gains a row. No new schema fields, no new role definitions, no new manifest enums. + +- **`docs/multi-agent-handoff.md §Single-agent anti-collusion rule` — new `### Why this rule exists` preamble naming self-evaluation bias as the underlying behavioural failure.** Adds a preamble before the structural rule explaining the **behavioural failure** the structural rule enforces against. The pattern named: *AI agents asked to evaluate work they have produced systematically over-report quality.* The preamble traces the pattern across multiple surfaces (Reviewer's anti-rationalisation failures, Implementer's self-supervising loop in `ai-operating-contract.md §Rejected patterns`, autonomous self-terminating loop, broader self-praising of mediocre output) and identifies the shared mechanism: *the same identity that produced the work cannot reliably hold an adversarial stance toward it.* The preamble explicitly names that this is **not prompt-engineerable away** — instructing an agent to "be critical of your own work" makes the surface text more critical without making the evaluation more accurate; the reliable fix is **structural separation**, with the auditor's tool envelope mechanically prevented from touching the work. The §Tool-permission matrix's *Reviewer has no write tools* row is identified as the load-bearing form. The preamble also cross-references `harness-evolution-discipline.md` for re-evaluation of whether the failure still binds on a given model class — the structural rule stays as long as the failure does, but the discipline's boundary is sweep-evaluable rather than perpetually fixed. Closes the gap where the rule was structurally complete (every consumer cited it) but lacked a stated rationale at the SoT — readers landing on the rule from a runtime bridge or thin-bridge file would see the prohibition without seeing the underlying behavioural pattern, so the rule's strictness read as procedural preference rather than counter-pressure to a behavioural failure. + +- **`docs/multi-agent-handoff.md §Capability gating by risk level` — new "Risk is one axis; capability frontier is another" callout.** The existing matrix scales the role envelope on `breaking_change.level × rollback_mode` — a *blast-radius* axis. The callout names a second axis the matrix deliberately does not encode: how far the task sits from what the current model class does reliably solo (the *capability-frontier* axis). At equivalent blast-radius, a low-L mode-1 change at the capability frontier (unfamiliar SoT pattern, multi-step task at the edge of long-context coherence, model-class-new task domain) earns more Reviewer attention and benefits more from a registered specialist than the same risk profile applied to a well-trodden change shape. Conversely, a change shape the model has executed reliably across many prior changes can be reviewed with the baseline envelope even when the matrix's risk-axis row would technically permit more. The matrix's *additional gating* column is a **floor, not a ceiling** — capability-frontier signals (Discovery-loop frequency on similar prior changes, novel SoT pattern, model-class-new task domain) can motivate raising envelope strictness above the floor, recorded as an `escalations[*]` entry naming the capability-frontier rationale rather than a risk-axis trigger. The risk-axis is the encoded mechanical enforcement boundary; the capability-frontier axis is the human / Planner judgement signal that lives alongside it. Re-evaluation of where the capability frontier sits is a `harness-evolution-discipline.md` concern; per-change sensitivity to it is a Planner concern. Closes the gap where adopters reading the matrix in isolation might infer the role envelope's strictness was a function of risk only — not so; both axes apply. + +- **`docs/mechanical-enforcement-discipline.md` — new `§Boundary with non-mechanical evaluation` section + Planner-side allocation rule + cross-references from `multi-agent-handoff.md §Reviewer`.** Inserted after `§How much enforcement is right` (which already discusses the boundary in negative — "Reviewer effort going to mechanical issues" as an under-enforcement signal). The new section names the boundary in **positive**: what each evaluator type is *for*, and how the three layer. Names the **three evaluator types** the methodology relies on as a single coherent stack — **Mechanical** (this doc; pass/fail predicates over source / structure / artifact shape; cheapest, fires on every event, uniform coverage), **Application-driven** (`cross-cutting-concerns.md §Application-driven verification` + `application-driven-loop.md`; runtime behaviour evidence — page actually renders, API actually returns the shape, log entry actually appears in deployed stack; per-check cost; mandatory on user / operational surfaces above L2 per `autonomy-ladder-discipline.md`), **Agentic Reviewer audit** (`multi-agent-handoff.md §Reviewer`; cross-cutting concerns / breaking-change classification / rollback-mode appropriateness / surface coverage / claim substantiation; per-change cost, bounded by §Anti-rationalization rules and `review-loop-pattern.md` iteration cap). Layering: mechanical is the floor, application-driven is the bridge (mechanical execution + agentic-or-mechanical interpretation — the locus where the two meet), agentic Reviewer audit is the ceiling. **Allocation rule (Planner-side, Phase 3):** each acceptance criterion is allocated to the cheapest evaluator that catches its failure shape — pass/fail predicates → mechanical; runtime-behaviour evidence → application-driven; judgment-heavy / cross-cutting → agentic Reviewer audit. The allocation is per-AC, not per-change; a change with all-mechanical evidence is under-evaluated if it touches cross-cutting concerns, a change with all-agentic evidence is over-evaluated if checks were achievable mechanically. Parallel to the new AC-as-Sprint-Contract discipline (above): same time-axis, complementary axis (Sprint Contract asks "is this AC pre-verifiable?"; allocation rule asks "by which evaluator?"). **New anti-pattern:** *routing by familiarity rather than by failure shape* — mechanical-heavy team writes lint rules for everything (high-noise, gets bypassed); agentic-heavy team relies on Reviewer for every dimension (Reviewer attention saturates on lint-level noise). Detection: `review_notes` repeatedly surfacing "lint should have caught this" or "every iteration of this audit catches the same shape." Companion edit: `multi-agent-handoff.md §Reviewer §Must not do` gains a row "Spend audit attention on what a mechanical check should have caught" pointing back at the new section. `§Relationship to other documents` of `mechanical-enforcement-discipline.md` gains two rows naming `multi-agent-handoff.md §Reviewer` (agentic-evaluator counterpart) and `cross-cutting-concerns.md §Application-driven verification` (bridge-evaluator counterpart). Closes the gap where the methodology had three evaluator surfaces — mechanical / application-driven / agentic — operating in parallel without a single canonical comparison or allocation rule, leaving Planners to allocate AC rows by familiarity and Reviewers to absorb mechanical-shaped findings the methodology had no positive guidance against. + ### Changed - **`skills/engineering-workflow/references/cluster-parallelism.md §The core rule §1` — extended to name runtime-enforcement scope.** Previously only declared "no two clusters declare `scope_files` patterns that could resolve to the same file." Now adds: "Declaration alone does not prevent races on files outside `scope_files` (lockfiles, build caches, `.git/index`) or on shared manifest array fields; runtime enforcement of the declaration lives in §Runtime isolation below." This is a normative scope expansion — invariant 1 now binds on both declaration and runtime enforcement, surfacing the operational gap the new section closes. The §Why each invariant is needed table also gains a row distinguishing the two enforcement layers. diff --git a/docs/README.md b/docs/README.md index 2603909..c7bcd56 100644 --- a/docs/README.md +++ b/docs/README.md @@ -75,6 +75,7 @@ These activate when the change touches a specific dimension. Skip the ones that | [`playtest-discipline.md`](playtest-discipline.md) | Game / interactive / experience-driven changes need playtest evidence. | | [`post-delivery-observation.md`](post-delivery-observation.md) | Phase 8 observation: production findings, metrics, continuous-evidence. | | [`anti-entropy-discipline.md`](anti-entropy-discipline.md) | Time-driven garbage-collection sweeps for accumulated drift (stale CCKNs, expired deprecations, doc-reference rot). Complements the edit-boundary mechanical enforcement and the delivery-event Phase 8 observation. | +| [`harness-evolution-discipline.md`](harness-evolution-discipline.md) | Per-material-model-release re-evaluation of canonical methodology components whose load-bearing-ness depends on a specific model-class failure mode (§11 verbal-completion illusion, §12 Context anxiety, §Pre-handoff self-check, §Anti-rationalization rules, Compaction algorithm, etc.). Three outcomes per component: re-justify / retire-eligible / new-failure-surfaced. Sibling to `anti-entropy-discipline.md` (which excludes canonical methodology); together they let the methodology shed weight as well as accumulate it. | | [`adoption-strategy.md`](adoption-strategy.md) | Rolling out the methodology to a team that isn't using it yet. | | [`adoption-anti-metrics.md`](adoption-anti-metrics.md) | How to recognize fake adoption (compliance theatre). | | [`ci-cd-integration-hooks.md`](ci-cd-integration-hooks.md) | Wiring methodology gates into CI/CD pipelines. | diff --git a/docs/ai-operating-contract.md b/docs/ai-operating-contract.md index 228ff42..ddcd0e4 100644 --- a/docs/ai-operating-contract.md +++ b/docs/ai-operating-contract.md @@ -365,6 +365,68 @@ At each of these, couple the narration (if any) with the tool call in the same t --- +## 12. Context anxiety — premature task-shrink under perceived context pressure + +### Why this rule exists + +A long-running session can drift into a failure shape distinct from §4 (context hygiene) and §11 (verbal-completion illusion): the model **perceives** that it is approaching its context window limit and begins wrapping up the change early — truncating remaining tasks, skipping planned evidence collection, summarising instead of executing, declaring "done" when the work is structurally incomplete. The perception may or may not be accurate; sometimes the session has plenty of context left. From the user's side this looks like the agent gave up halfway through a planned task without raising an escalation. The agent did not crash, did not refuse, did not run out of permission — it **prematurely closed the work** because it expected to run out of room. + +This is **context anxiety**: premature task closure under unverified pressure perception. Naming it separately matters because the remedy differs from the adjacent failures — §4 is about facts not written to files; §11 is about a single tool call missed at an action transition; §12 is about the tail of a multi-task plan collapsing into a wrap-up narrative. Misdiagnosing one as another routes the agent into the wrong corrective. + +### Symptoms + +- The plan had **N** tasks; only the first **M** ≪ **N** were attempted, with the remainder narrated as "follow-ups" without a follow-up artifact. +- Output tone shifts from execution mode ("running tests…") to wrap-up mode ("the change is complete; remaining items can be addressed later") without an explicit escalation marker. +- Evidence rows that were planned to be collected disappear from `evidence_plan.artifacts` rather than acquiring `status: collected`. +- A multi-phase change collapses Phase 4 / 5 / 6 narration into one self-declared "delivered" claim. +- Recovery via Discovery Loop or §5 active escalation is **bypassed in favour of self-truncation**. + +### Must do + +- When you sense context pressure, **verify before truncating**: estimate remaining tokens against the task list (the `resumption-protocol.md §Step 2b` 30% rule applies symmetrically to outgoing sessions). If pressure is real, route through §5 escalation or produce an outgoing handoff prompt per `skills/engineering-workflow/templates/handoff-prompt-template.md`; if pressure is perceptual, continue executing. +- If context will not fit the remaining work, **declare the truncation explicitly** as a §5 escalation (or, when a methodology-specific path applies — Discovery Loop, Phase Re-entry — route through that). The truncation must be a named act, not a silent omission. +- Prefer **Context Reset over Compaction** when a long-running session is approaching its window limit: a fresh session reading a Manifest-backed handoff carries the work forward without the perception that triggered the anxiety; in-place compaction shrinks the conversation but preserves the same session's pressure perception, so the anxiety re-fires on the next stretch of work. + +### Must not do + +- Reduce remaining task scope **without naming the change** in `scope_deltas` or §5 escalation. +- Substitute "the rest is straightforward, leaving as a follow-up" for actual execution when the plan committed to executing — a real follow-up needs an artifact (a new manifest, a tracked task, a `next_action` pointer), not a narrative gesture. +- Treat the conversation's tail as an external gate ("the user will run out of patience, so I'll wrap up here"). Conversation length is not a phase boundary. +- Use **compaction inside an anxious session as the remedy** — compaction summarises the past but does not lift the perception that caused the anxiety; the same session continues with the same perceived pressure and re-fires the anxiety on the next stretch. The remedy that addresses the cause is Context Reset. + +### Distinguishing from adjacent failures + +| Failure mode | Where it strikes | Visible cause | Remedy | +|---|---|---|---| +| **§4 Context hygiene failure** | Across sessions; compression discards facts not written to files | A compression event | Write key decisions to files proactively | +| **§11 Verbal-completion illusion** | Action-transition points (after task-list updates, plan approval, phase opens) | None — model chose `end_turn` instead of the tool call | Couple narration with the tool call in the same turn | +| **§12 Context anxiety (this rule)** | Tail of long single-session work; remaining tasks shrink prematurely | Perception of approaching context limit (may or may not be real) | Verify before truncating; if real, declare truncation as escalation; prefer Context Reset over Compaction | +| **§3 Evidence-before-completion failure** | Pre-handoff "I'm done" claim without evidence | Confusion of self-claim with verification | Collect evidence; cite paths; pass §10 self-check | + +The four failure modes share a surface symptom (work declared done that is not done), but the **upstream cause and the corrective action differ**. Misdiagnosing context anxiety as verbal-completion illusion (or vice versa) routes the agent into the wrong remedy: §11's couple-narration-with-tool-call rule does not lift §12's perceived pressure, and §12's prefer-Context-Reset rule does not address §11's missing tool call. + +### Risk-point inventory + +Agents observing themselves should treat the following moments as high-risk for context anxiety: + +- The narrative shift from "executing task **N** of **M**" to "the major work is done" while **N** < **M**. +- The tail of an intensive tool-use stretch (many code searches, long file reads) just before transitioning into a wrap-up sentence. +- Any moment when the agent considers "I'll leave this as a follow-up" without naming a follow-up artifact (manifest, task, `next_action`). +- Mid-execution moments when the agent counts remaining tokens or estimates window fill — the act of counting is a risk point because the count's interpretation drives the next action. +- After a long stretch of evidence collection, when the next planned task requires another long stretch — the temptation is to declare the first stretch sufficient. +- The tail of a long-running delegation D2-progress write — the canonical role's instinct may be to synthesise rather than continue iterating. + +### Relation to other sections + +- §4 (Context hygiene) is the **cross-session** form: facts not written to files are lost. §12 is the **intra-session** form: tasks not executed are lost. The two are independent — a session can fail at §4 (lose facts to compression) while passing §12 (still execute every planned task), and vice versa. +- §11 (Verbal-completion illusion) is the sibling intra-session failure: both are silent task-shrink, but §11 strikes at *action-transition points* (one tool call short) while §12 strikes at *the tail of long work* (many tasks short). +- §5 (Active escalation) is the legitimate exit valve. Context anxiety routes around §5 by declaring done; the corrective is to route through §5 instead. +- `skills/engineering-workflow/references/resumption-protocol.md §Step 2b context-budget rule` is the **incoming-session counterpart**: incoming sessions estimate before reading; §12 is the **outgoing-session symmetric rule** — outgoing sessions estimate before declaring done. +- `skills/engineering-workflow/references/long-running-delegation.md` D1 (checkpoint-bounded) and D2 (artifact-grounded progress) prevent §12 from surfacing in the first place: a delegation that writes progress at every iteration cannot collapse into a final wrap-up because the iterations themselves are the audit trail. +- `docs/ai-project-memory.md §Pre-compression protection list` covers what to *rescue* on imminent compression; §12 covers what *not to do* when the agent merely *anticipates* compression. The protection list is the first-aid kit; §12 is the rule against premature triage. + +--- + ## Rejected patterns This methodology has positive rules (what to do) and negative rules (what not to do, embedded in §1–§11 above). A small set of patterns recur from adjacent AI-assistance frameworks but are **explicitly rejected** in this contract — they appear plausible at first glance but conflict with one or more of the core rules above. Listing them here makes the rejection auditable and prevents accidental adoption when patterns drift in from other tools. diff --git a/docs/ai-project-memory.md b/docs/ai-project-memory.md index 44df7ee..0e76e97 100644 --- a/docs/ai-project-memory.md +++ b/docs/ai-project-memory.md @@ -231,6 +231,8 @@ When the AI detects an imminent context compression (nearing window limits), res Protection method: **write to files immediately** (if not already written), or **restate near the tail of the conversation** so it escapes the compression zone. +**This list is the rescue protocol, not the trigger condition.** It tells you *what to save* when compression is imminent; it does not authorise *premature task closure* on the perception that compression is approaching. The latter is `docs/ai-operating-contract.md §12 Context anxiety` — declaring work done early because the agent expects to run out of room is a distinct failure mode from losing facts to actual compression. Run this list when compression is real; route through §5 active escalation when remaining work no longer fits. + --- ## Cross-session resumption diff --git a/docs/anti-entropy-discipline.md b/docs/anti-entropy-discipline.md index ea2cc04..5c0f934 100644 --- a/docs/anti-entropy-discipline.md +++ b/docs/anti-entropy-discipline.md @@ -161,3 +161,4 @@ Anti-entropy sweeps are themselves changes — they go through Phase 0 → Phase - [`docs/autonomy-ladder-discipline.md`](autonomy-ladder-discipline.md) §Anti-patterns — defines the rung-claim and rung-skipping anti-patterns; the Rung-claim-evidence sweep above is the time-driven detector for the former (the latter is caught at the change boundary, not on calendar). - [`skills/engineering-workflow/references/mode-decision-tree.md`](../skills/engineering-workflow/references/mode-decision-tree.md) §Scenarios that force Lean — the asymmetric retirement-cost row that lets sweep-backed retirements drop to Lean mode while additions stay Full per L60. This is what makes the methodology able to *shed* weight rather than only accumulate; the Discipline-provenance sweep above produces the finding that the row consumes. - [`docs/glossary.md §Provenance drift`](glossary.md) — the term defined for the failure mode the Discipline-provenance sweep targets. +- [`docs/harness-evolution-discipline.md`](harness-evolution-discipline.md) — the **canonical-methodology counterpart**. Anti-entropy targets project-local discipline-provenance drift, calendar-driven, and explicitly excludes canonical methodology components ("origins by definition"). Harness-evolution targets canonical methodology components whose load-bearing-ness depends on a specific model-class failure mode, model-release-driven. Together the two cover both axes — project-local time-decay and canonical-methodology model-capability-decay — that would otherwise let the methodology accumulate ceremony monotonically. diff --git a/docs/file-role-map.md b/docs/file-role-map.md index 78c10d5..0af62c0 100644 --- a/docs/file-role-map.md +++ b/docs/file-role-map.md @@ -21,7 +21,7 @@ Without the map, the same rule drifts across `AGENTS.md`, `CLAUDE.md`, `GEMINI.m |---|---|---| | [`AGENTS.md`](../AGENTS.md) | SoT — operating contract (the 10 core rules) | Canonical; all runtimes inherit from here | | [`multi-agent-handoff.md`](multi-agent-handoff.md) | SoT — role contract (Planner / Implementer / Reviewer definitions, field-ownership matrix, tool-permission matrix, anti-collusion, handoff minima, Task Prompt structure) | Canonical for multi-agent discipline; `agents/`, `.cursor/rules/`, `reference-implementations/roles/` all point back here | -| `docs/*.md` (other) | SoT — topic-specific definitions ([`surfaces.md`](surfaces.md), [`source-of-truth-patterns.md`](source-of-truth-patterns.md), [`breaking-change-framework.md`](breaking-change-framework.md), [`rollback-asymmetry.md`](rollback-asymmetry.md), [`phase-gate-discipline.md`](phase-gate-discipline.md), [`ai-operating-contract.md`](ai-operating-contract.md), [`agent-persona-discipline.md`](agent-persona-discipline.md), [`output-craft-discipline.md`](output-craft-discipline.md), [`glossary.md`](glossary.md), [`phase-command-vocabulary.md`](phase-command-vocabulary.md), [`repo-as-context-discipline.md`](repo-as-context-discipline.md), [`mechanical-enforcement-discipline.md`](mechanical-enforcement-discipline.md), [`tool-design-principles.md`](tool-design-principles.md), [`anti-entropy-discipline.md`](anti-entropy-discipline.md), [`autonomy-ladder-discipline.md`](autonomy-ladder-discipline.md), [`observability-legibility-discipline.md`](observability-legibility-discipline.md), [`throughput-first-merge-philosophy.md`](throughput-first-merge-philosophy.md), …) | Canonical per topic; referenced from the contracts above | +| `docs/*.md` (other) | SoT — topic-specific definitions ([`surfaces.md`](surfaces.md), [`source-of-truth-patterns.md`](source-of-truth-patterns.md), [`breaking-change-framework.md`](breaking-change-framework.md), [`rollback-asymmetry.md`](rollback-asymmetry.md), [`phase-gate-discipline.md`](phase-gate-discipline.md), [`ai-operating-contract.md`](ai-operating-contract.md), [`agent-persona-discipline.md`](agent-persona-discipline.md), [`output-craft-discipline.md`](output-craft-discipline.md), [`glossary.md`](glossary.md), [`phase-command-vocabulary.md`](phase-command-vocabulary.md), [`repo-as-context-discipline.md`](repo-as-context-discipline.md), [`mechanical-enforcement-discipline.md`](mechanical-enforcement-discipline.md), [`tool-design-principles.md`](tool-design-principles.md), [`anti-entropy-discipline.md`](anti-entropy-discipline.md), [`harness-evolution-discipline.md`](harness-evolution-discipline.md), [`autonomy-ladder-discipline.md`](autonomy-ladder-discipline.md), [`observability-legibility-discipline.md`](observability-legibility-discipline.md), [`throughput-first-merge-philosophy.md`](throughput-first-merge-philosophy.md), …) | Canonical per topic; referenced from the contracts above | | [`skills/engineering-workflow/SKILL.md`](../skills/engineering-workflow/SKILL.md) + `skills/**` | SoT — execution layer (modes, phases, templates, references) | Canonical for workflow execution | | [`schemas/`](../schemas/) + [`skills/engineering-workflow/templates/manifests/`](../skills/engineering-workflow/templates/manifests/) | SoT — machine-readable Change Manifest contract + worked examples | Canonical structural output | | [`CLAUDE.md`](../CLAUDE.md) | Thin-bridge — Claude Code entry; points at [`AGENTS.md`](../AGENTS.md) + `skills/` | Onboarding only, no new normative content | diff --git a/docs/glossary.md b/docs/glossary.md index 7fde899..cac40b5 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -394,6 +394,22 @@ Code has merged but: From the user's or operator's perspective, the change is not complete. +### Context anxiety + +The intra-session failure mode in which a long-running agent **perceives** it is approaching its context window limit and prematurely truncates the remaining work — task scope shrinks, planned evidence is dropped, the session ends with a wrap-up narrative instead of executing the rest of the plan. The perception may or may not be accurate; the failure is the agent acting on the perception **without verifying** and **without declaring the truncation as a scope change or escalation**. + +Distinguished from adjacent failures by where it strikes and what causes it: + +| | Strike point | Cause | Remedy | +|---|---|---|---| +| `False completion` (above) | Pre-handoff "done" claim | Self-claim mistaken for verified result | Collect evidence; pass §10 self-check | +| `Context anxiety` (this entry) | Tail of long single-session work | Perception of approaching context limit | Verify before truncating; declare as escalation if real; prefer Context Reset over Compaction | +| Verbal-completion illusion (`ai-operating-contract.md` §11) | Action-transition points | Model chose `end_turn` instead of tool call | Couple narration with the tool call in same turn | + +The remedy explicitly **prefers Context Reset (fresh session reading a Manifest-backed handoff) over Compaction (in-place summarisation)** when a long session is approaching its window limit: compaction shrinks the conversation but preserves the same session's pressure perception, so the anxiety re-fires on the next stretch; a fresh session does not carry the perception forward. + +Canonical reference: `docs/ai-operating-contract.md §12`. Incoming-session counterpart (estimate before reading): `skills/engineering-workflow/references/resumption-protocol.md §Step 2b`. Pre-compression protection list (what to rescue on imminent compression): `docs/ai-project-memory.md §Pre-compression protection list`. + --- ## Evidence and quality vocabulary diff --git a/docs/harness-evolution-discipline.md b/docs/harness-evolution-discipline.md new file mode 100644 index 0000000..bf3640f --- /dev/null +++ b/docs/harness-evolution-discipline.md @@ -0,0 +1,132 @@ +# Harness Evolution Discipline + +> **English TL;DR** +> Every component of this methodology — every gate, every required artifact, every role-separation rule, every self-check question — was authored against a specific failure mode observable in the model class it was written for. Some of those failure modes weaken or move as model capability changes. This discipline names the obligation to **re-evaluate canonical methodology components per material model release** rather than treating the methodology as a static deposit. Distinct from [`docs/anti-entropy-discipline.md`](anti-entropy-discipline.md) (project-local discipline-provenance drift) and from [`skills/engineering-workflow/references/mode-decision-tree.md`](../skills/engineering-workflow/references/mode-decision-tree.md) §Mode upgrade / downgrade (per-change mode shift) — this is a per-model-release sweep targeting **canonical methodology** components. + +--- + +## Why this discipline exists + +The methodology in `docs/` and `skills/engineering-workflow/` is built from many small components, each shaped by a specific failure observed in the model class it was authored against: + +- The §Single-agent anti-collusion rule ([`multi-agent-handoff.md`](multi-agent-handoff.md)) exists because models tend to praise their own work when asked to evaluate it. +- The Pre-handoff self-check questions exist because models declare "done" without evidence. +- The Compaction algorithm ([`change-manifest-spec.md`](change-manifest-spec.md) §Compaction algorithm) exists because session state spills over context windows. +- The §11 verbal-completion-illusion rule ([`ai-operating-contract.md`](ai-operating-contract.md)) exists because models silently end turns at action transitions. +- The §12 Context anxiety rule ([`ai-operating-contract.md`](ai-operating-contract.md)) exists because models prematurely truncate work under perceived context pressure. +- The §Anti-rationalization rules ([`multi-agent-handoff.md`](multi-agent-handoff.md)) exist because Reviewer agents talk themselves into approval after finding issues. + +Each component **encodes an assumption** about what the model cannot do reliably without scaffolding. Those assumptions are not permanent. As models improve, three outcomes are possible per component: + +- **Tighter** — the failure happens less often but still happens; the discipline still earns its place, with the rationale updated. +- **Outdated** — the failure no longer surfaces under representative use; the discipline shifts from preventive to vestigial; retiring it lets the methodology shed weight. +- **Moved** — a new failure mode emerges that no existing discipline addresses; a new component is needed. + +Without a discipline that periodically re-evaluates which assumptions still bind, the methodology accumulates ceremony monotonically — every observed incident motivates a new safeguard, and no symmetric pressure removes obsolete ones. [`anti-entropy-discipline.md`](anti-entropy-discipline.md) §Rule 3 addresses this for *project-local* disciplines via the `Discipline-provenance sweep`, but explicitly excludes canonical methodology components ("origins by definition"). This document names the sweep that **does** apply to canonical methodology components: a per-model-release re-evaluation. + +--- + +## What this discipline is and is not + +| Dimension | This discipline (Harness evolution) | Adjacent disciplines | +|---|---|---| +| **Targets** | Canonical methodology components in `docs/` and `skills/engineering-workflow/` whose load-bearing-ness depends on a specific model-class failure mode | Project-local disciplines added by adopting teams ([`anti-entropy-discipline.md`](anti-entropy-discipline.md) Discipline-provenance sweep). Per-change mode escalation ([`mode-decision-tree.md`](../skills/engineering-workflow/references/mode-decision-tree.md) §Mode upgrade / downgrade) | +| **Trigger** | A **material model release** — a model-class change with measurable capability shifts on agent-relevant axes (instruction-following, long-context retrieval, coherence on multi-step tasks, evidence discipline, tool-use precision) | Calendar cadence (sweep classes in [`anti-entropy-discipline.md`](anti-entropy-discipline.md) §Cadence). Per-change discovery signals (`mode-decision-tree.md` §Mode upgrade) | +| **Cadence** | Per material model release of any model class the methodology supports as a runtime target. Patch / minor model updates do not trigger; class transitions do | Per minor version (sweep classes); per change (mode shift) | +| **Scope of edit** | Re-justification, retirement, or addition of canonical methodology components | Project-local discipline retirement; per-change mode shift | +| **Output shape** | Single bundled change (or small set of related changes) recording the sweep, the components evaluated, and the per-component outcome | Individual per-finding cleanup PRs (anti-entropy); a single per-change mode declaration (mode-decision-tree) | + +The discipline does **not** target the operating contract's evergreen rules. §1 honest reporting, §2 scope discipline, §3 SoT before consumers, §5 active escalation, §7 multi-agent role separation — these are not assumptions about a specific model class; they are properties of any AI-assisted engineering system regardless of the model. The targets are components shaped by **specific failure modes** observable in **specific model classes**. + +--- + +## The procedure + +Per material model release of a model the methodology bridges to: + +### Step 1 — Map the release's measured capability shifts + +Read the release's published capability claims and the methodology's existing failure-mode list (the `Why this rule exists` paragraphs in `ai-operating-contract.md` §11 / §12, the rationale in §Pre-handoff self-check, §Anti-rationalization rules, §Single-agent anti-collusion rule, etc.). For each capability axis named in the release, ask: *which existing component was authored against this exact failure?* + +The mapping is the input for Step 2; do not act on it directly. A claimed shift does not by itself motivate a methodology edit — only an *empirical re-test* showing the failure mode no longer surfaces (or moved) does. + +### Step 2 — Empirically re-test the targeted components + +Run a small set of representative changes through the methodology with the new model class. The set should include at least: + +- One **Lean-mode** change (single surface, ≤5-minute verification). +- One **Full-mode** change with multi-surface scope. +- One task at the model's current **capability frontier** — the kind of task that would have previously required heavy scaffolding to complete reliably. + +For each component identified in Step 1, record: + +- Where the component **fired and added value** (assumption still binds). +- Where the component **fired but found nothing** (assumption may no longer bind under representative use; flag for closer look). +- Where a **new failure surfaced** that no component caught (a candidate for a new component). + +Empirical re-testing is the load-bearing step. A capability claim is a hypothesis about agent behaviour; the re-test is the evidence. Skipping Step 2 and acting on Step 1 alone is the *"the new model is much better, drop the scaffolding"* anti-pattern below. + +### Step 3 — Classify each evaluated component + +Each evaluated component falls into exactly one of three outcomes: + +- **Re-justified.** Component still fires; the assumption it encodes is still load-bearing on the new model class. The component's `Why this rule exists` paragraph (or equivalent rationale section) gains a one-line addition naming the model class against which the assumption was re-validated. +- **Retire-eligible.** Component no longer fires under representative use. Open a sweep-backed retirement change per [`skills/engineering-workflow/references/mode-decision-tree.md`](../skills/engineering-workflow/references/mode-decision-tree.md) §Scenarios that force Lean (the row covering harness-evolution-sweep-backed retirements). The retirement is sweep-backed (this discipline's evidence) and therefore eligible for Lean rather than Full mode — the same asymmetric-cost lever the `Discipline-provenance sweep` exposes for project-local disciplines, applied here to canonical methodology. +- **New-failure-surfaced.** The empirical re-test revealed a failure mode no existing component addresses. Open a Full-mode change to add a new component, per the standard L1+ canonical-methodology-edit path. The new component carries a rationale paragraph naming the model class the failure was observed on, so future sweeps can re-evaluate it the same way. + +A component the sweep did not exercise in Step 2 is **not classified** — leave it untouched. The sweep is bounded by what it actually tested; an un-tested component does not silently move into "re-justified" by default. + +### Step 4 — Record the sweep + +Each component touched by the sweep gets a one-line addition to its rationale ("re-justified per harness-evolution sweep against \", or equivalent for retired / added components). The sweep itself is recorded as a single CHANGELOG entry that names: + +- The model class the sweep was run against. +- The set of components evaluated. +- The per-component outcome (re-justified / retired / added). +- A pointer to the empirical re-test artifacts (the changes run, the observed behaviours). + +Per-component edits cite the sweep entry; the sweep entry is the audit trail. + +--- + +## Concrete examples (illustrative shape, not normative) + +The following examples demonstrate the discipline's question shape. **Examples below are hypothetical** — no specific model is named or evaluated; readers running the discipline against a concrete release will produce real evaluations there. + +- **Pre-handoff self-check Q2 (reference-existence verification, [`multi-agent-handoff.md`](multi-agent-handoff.md) §Pre-handoff self-check).** Authored against models that fabricated function names. *Re-evaluation question:* under the new model class, does Q2 still find non-existent identifiers at non-trivial rate, or has the failure dropped below the noise floor? +- **§12 Context anxiety ([`ai-operating-contract.md`](ai-operating-contract.md)).** Authored against models that prematurely truncate work under perceived context pressure. *Re-evaluation question:* does the new model class still exhibit premature task closure on long sessions, or has the failure mode moved (e.g. shifted into a different failure shape such as silent scope expansion)? +- **§Anti-rationalization rules ([`multi-agent-handoff.md`](multi-agent-handoff.md) §Reviewer).** Authored against Reviewer agents that talk themselves into approval after finding issues. *Re-evaluation question:* still load-bearing, less load-bearing, or has a new rationalisation pattern emerged that the four rules do not catch? +- **Compaction algorithm ([`change-manifest-spec.md`](change-manifest-spec.md) §Compaction algorithm).** Authored against models whose Manifest growth rate exceeded their context budget on long changes. *Re-evaluation question:* with the new model class's larger / more efficient context handling, does the compaction algorithm still fire on representative Full-mode changes, or has its threshold moved upward? + +The point is not the examples themselves but the **shape of the question**: every component carries an implicit assumption; the discipline asks whether the assumption still holds. + +--- + +## Anti-patterns + +- *"The methodology has worked for years; no need to re-evaluate."* The methodology has worked because it was authored against the failure modes of a specific model class. Model classes change; the methodology's grip can loosen silently as failures it was designed to catch drop below the noise floor and as new failures surface that it does not catch. +- *"The new model is much better; drop most of the scaffolding."* Average capability and tail-failure rate are different axes. A model can be much better on average and still exhibit the same long-tail failures the discipline was authored against. Drop a component only when re-evaluation **shows** the failure mode no longer surfaces under representative use, not when general capability "feels" higher. +- *"Add new components for every interesting capability claim."* New capabilities do not by themselves motivate new disciplines; new *failure modes* do. A claim of "better instruction-following" does not motivate a new component; an observed new failure shape does. Symmetrically, the absence of a capability claim about a particular failure mode does not by itself motivate retirement — empirical Step 2 evidence does. +- *"Run the sweep on every patch release."* Patch / minor model updates do not motivate sweeps. The cadence is **material model release** — class transitions with measurable agent-axis shifts. Running on every minor wastes effort and trains the methodology against noise; the methodology starts churning to chase capability micro-shifts instead of binding on stable failure shapes. +- *"Treat the sweep as additive only."* The discipline's whole point is to surface retirements as well as additions. A sweep that produces only "re-justified" outcomes across many components on a major model release suggests Step 2 was performed shallowly (or skipped) — not that the methodology happened to be perfectly calibrated. The asymmetric-cost lever exists precisely to make retirement cheaper than addition; failing to use it is itself a discipline failure. +- *"Bundle the sweep findings into the next big methodology change."* Sweep findings should be recorded promptly. Bundling them with unrelated work mixes the model-class evidence trail with other rationales; the next sweep cannot distinguish "this rule changed because the model improved" from "this rule changed because we restructured the section." + +--- + +## Cadence + +- **Trigger:** material model release of any model class the methodology supports as a runtime target. +- **Window:** the sweep should run within ~30 days of release. Past that, behaviour observations from the new model class are already shaping per-change disciplines silently; the sweep loses its evidentiary clarity. +- **Owner:** repo maintainer (or whoever holds the equivalent authority for canonical methodology edits). The sweep is **not** a per-adopter responsibility — adopting teams running the methodology against a new model are consumers of the sweep's output, not its drivers. +- **Output deadline:** a sweep with no recorded outcome (no CHANGELOG entry, no per-component rationale update, no retirement / addition PRs) within the window is itself a sign the sweep did not occur. The window doubles as a forcing function on actually doing the work, not just intending to. + +--- + +## Relationship to other documents + +- [`docs/anti-entropy-discipline.md`](anti-entropy-discipline.md) §Rule 3 — sweep classes targeting *project-local* discipline-provenance drift. This discipline targets *canonical-methodology* model-capability drift. The two are complementary: anti-entropy asks "is this project-local discipline still justifiable?"; harness-evolution asks "is this canonical component still load-bearing on the current model class?". A project running both gets both kinds of weight-shedding. +- [`skills/engineering-workflow/references/mode-decision-tree.md`](../skills/engineering-workflow/references/mode-decision-tree.md) §Scenarios that force Lean — the sweep-backed Lean-mode retirement path. A retirement opened by this discipline rides the same asymmetric-cost lever already used by anti-entropy. Adding a new component still routes through the canonical-methodology-content row (L1+ → Full). +- [`docs/ai-operating-contract.md`](ai-operating-contract.md) §1–§10 — the evergreen rules this discipline does not target. Re-evaluation focuses on §11, §12, the §Anti-rationalization rules, the §Pre-handoff self-check questions, the Compaction algorithm, and similar capability-class-shaped components. +- [`docs/file-role-map.md`](file-role-map.md) — the index of canonical SoT files; the discipline's targets are exactly the rules whose homes are listed in that map. A new component added by the sweep gets a row in the map; a retired component's row is removed. +- [`CLAUDE.md`](../CLAUDE.md) §5 — the cross-cutting term update obligation. A re-justified / retired / added component triggers the same propagation discipline already required: every consumer of the affected rule must update in the same change. The sweep does not relax this. +- [`docs/repo-as-context-discipline.md`](repo-as-context-discipline.md) — the principle that drives the methodology's fundamental shape (anything an agent cannot reach in-context does not exist). This discipline preserves that principle's load-bearing-ness across model releases: as models shift, *what* must be reachable may shift, but *that* it must be reachable does not. diff --git a/docs/mechanical-enforcement-discipline.md b/docs/mechanical-enforcement-discipline.md index 3abddb7..b3be9cb 100644 --- a/docs/mechanical-enforcement-discipline.md +++ b/docs/mechanical-enforcement-discipline.md @@ -118,6 +118,46 @@ The right answer is between these. Concretely: --- +## Boundary with non-mechanical evaluation + +The three axes above describe what mechanical enforcement carries. Two evaluation kinds run alongside it that mechanical enforcement is structurally unsuited to perform; together they form the **three evaluator types** the methodology relies on. The previous section (*Reviewer effort going to mechanical issues* as an under-enforcement signal) names the boundary in negative; this section names it in positive — what each evaluator is *for*, and how the three layer. + +### Three evaluator types + +| Evaluator | Where its rules live | What it catches well | What it cannot catch | Cost profile | +|---|---|---|---|---| +| **Mechanical** (this doc) | `mechanical-enforcement-discipline.md` (this file); `runtime-hook-contract.md`; `automation-contract.md` | Architecture invariants, taste invariants, doc freshness — anything expressible as a pass/fail predicate over source / structure / artifact shape | Subjective quality, cross-cutting reasoning, runtime behaviour an artifact-shape check cannot reach | Cheapest. Fires on every event; uniform coverage; deterministic; produces durable check-log evidence | +| **Application-driven** | [`docs/cross-cutting-concerns.md §Application-driven verification`](cross-cutting-concerns.md); [`skills/engineering-workflow/references/application-driven-loop.md`](../skills/engineering-workflow/references/application-driven-loop.md) | Runtime behaviour: the page actually renders, the API actually returns the correct shape under load, the log entry actually appears in the deployed stack, the metric actually moves | Pure structural / algorithmic correctness (better caught mechanically); judgment-heavy concerns (better caught agentically) | Per-check cost (running the application is not free); mandatory on user / operational surfaces above L2 per [`docs/autonomy-ladder-discipline.md`](autonomy-ladder-discipline.md) | +| **Agentic Reviewer audit** | [`docs/multi-agent-handoff.md §Reviewer`](multi-agent-handoff.md); [`docs/cross-cutting-concerns.md`](cross-cutting-concerns.md) (the six dimensions Reviewer audits) | Cross-cutting concerns (security / performance / observability / testability / error-handling / build-time risk), breaking-change-level classification, rollback-mode appropriateness, surface-coverage assessment, claim substantiation | Drift in style / format / structure (would saturate Reviewer attention with mechanical work — the under-enforcement signal above); deterministic predicates (better caught mechanically) | Per-change cost; bounded by §Anti-rationalization rules in `multi-agent-handoff.md`; capped by `review-loop-pattern.md` iteration cap | + +The three are **layers, not alternatives**. A well-tuned methodology applies all three: + +- **Mechanical is the floor.** Fires on every event, costs almost nothing per check, catches the failures expressible as pass/fail. Every methodology-conformant repository has at least one mechanical check per axis. +- **Application-driven is the bridge.** When the change touches a surface where source-level evidence is insufficient (user surface, operational surface), the evidence is mandated to come from the running application. The check execution is mechanical (run the app, capture the artefact); the *interpretation* of the artefact may be mechanical (a structural test on the captured DOM / log line) or agentic (a Reviewer reading the captured behaviour against the AC). Application-driven is the locus where mechanical and agentic actually meet. +- **Agentic Reviewer audit is the ceiling.** Catches what the floor and the bridge cannot — concerns whose evaluation needs judgment. The Reviewer should never be doing what a mechanical check should have caught (the §How much enforcement is right *Reviewer effort going to mechanical issues* signal above); the Reviewer's attention is reserved for what only judgment can assess. + +### Allocation rule (Planner-side, Phase 3) + +When the Planner writes the `evidence_plan` rows in Phase 3 ([`docs/phase-gate-discipline.md`](phase-gate-discipline.md), [`docs/change-manifest-spec.md`](change-manifest-spec.md) §Verification plan), each acceptance criterion is allocated to the **cheapest evaluator that catches its failure shape**. The decision flow: + +1. **If the AC's failure shape can be expressed as a pass/fail predicate over source, structure, or static artifact** — allocate to a **mechanical** evaluator (lint, structural test, schema validator, type check). +2. **Else if the AC's failure shape requires observing the running application** (UI behaviour, runtime contract, observability emission, deployed-artifact correctness) — allocate to **application-driven** verification per [`cross-cutting-concerns.md §Application-driven verification`](cross-cutting-concerns.md). Within the application-driven row, decide whether interpretation is mechanical (deterministic structural assertion on the captured artefact) or agentic (Reviewer reads the captured behaviour and judges against the AC). +3. **Else (judgment-heavy, cross-cutting)** — allocate to the **agentic Reviewer audit**. The corresponding evidence row is one whose `tier` and `type` (per [`docs/evidence-quality-per-type.md`](evidence-quality-per-type.md)) anticipate Reviewer-side substantiation. + +The allocation is **per-AC**, not per-change. A single change may have AC rows allocated across all three evaluators — that is the expected shape, not an outlier. A change whose evidence_plan rows are *all* mechanical is **under-evaluated** if the change touches cross-cutting concerns (the Reviewer's audit will surface findings the evidence_plan did not anticipate). A change whose evidence is *all* agentic is **over-evaluated** if the same checks were achievable mechanically (the Reviewer is paying attention to noise the floor should have caught). + +This rule is parallel to the AC-as-Sprint-Contract discipline in [`docs/multi-agent-handoff.md §Acceptance criteria as a Sprint Contract`](multi-agent-handoff.md): same time-axis (Planner clears it before Implementer starts), complementary axis. The Sprint Contract discipline asks "is each AC pre-verifiable?"; this allocation rule asks "by which evaluator?". + +### Anti-pattern: routing by familiarity rather than by failure shape + +Adopters often default to the evaluator they are most comfortable with. A mechanical-heavy team writes lint rules for everything, including subjective quality concerns the rules cannot actually express — the resulting checks are high-noise, get bypassed, and stop catching the failures they were meant to catch. An agentic-heavy team relies on the Reviewer to catch every dimension, including deterministic ones that should have been mechanical — the Reviewer's attention saturates on lint-level findings and the cross-cutting concerns the audit exists for go uncaught. Both routings are by *familiarity*, not by *failure shape*. + +The decision is per-AC (what shape does this failure take?), not per-team (what evaluator is the team comfortable with?). The cost of mis-routing is asymmetric: a mechanical check on a judgment-shaped AC produces a high-noise signal that gets bypassed (the mechanical floor weakens for everyone); an agentic check on a deterministic AC consumes Reviewer attention that should have been free for cross-cutting concerns (the agentic ceiling weakens). Misallocation in either direction degrades the layer it touches and the layers above it. + +Detection: a Reviewer's `review_notes` repeatedly surfacing findings of the form *"lint should have caught this"* (signal of under-mechanical-enforcement) or *"every iteration of this audit catches the same shape"* (signal of over-agentic-allocation, where a mechanical rule would have prevented the iteration in the first place) is the visible form of this anti-pattern. Both detection signals are §How much enforcement is right symptoms reframed at the allocation layer. + +--- + ## Anti-patterns - *Reverse-axis enforcement.* A taste rule treated as architecture (block on every log-format slip) erodes credibility; an architecture rule treated as taste (warn on a layer-boundary violation) becomes a leak. The axis dictates the default severity. @@ -147,3 +187,5 @@ The right answer is between these. Concretely: - [`docs/adoption-anti-metrics.md`](adoption-anti-metrics.md) — the over-enforcement counter-pressure; Hook sprawl and Ceremony accumulation are the two anti-metrics this discipline most often risks tripping. - [`docs/output-craft-discipline.md`](output-craft-discipline.md) — the output-side counterpart; mechanical-enforcement output is held to the same earn-its-place rule. - [`docs/anti-entropy-discipline.md`](anti-entropy-discipline.md) — the time-axis counterpart; mechanical enforcement catches single-edit drift, anti-entropy catches accumulated drift across many edits. +- [`docs/multi-agent-handoff.md §Reviewer`](multi-agent-handoff.md) — the **agentic-evaluator counterpart** named in §Boundary with non-mechanical evaluation; together with this document and `cross-cutting-concerns.md §Application-driven verification`, the three form the layered evaluator stack the methodology relies on. The Reviewer-side back-pointer is the *Must not do — Spend audit attention on what a mechanical check should have caught* row in `multi-agent-handoff.md §Reviewer`. +- [`docs/cross-cutting-concerns.md §Application-driven verification`](cross-cutting-concerns.md) — the **bridge-evaluator counterpart**: when source-level evidence is insufficient (user / operational surfaces), evidence is mandated to come from the running application. The discipline lives there; this document names where it sits in the three-evaluator stack. diff --git a/docs/multi-agent-handoff.md b/docs/multi-agent-handoff.md index f933c36..9a84d08 100644 --- a/docs/multi-agent-handoff.md +++ b/docs/multi-agent-handoff.md @@ -69,6 +69,28 @@ The Task Prompt is **the Planner's output**, not a manifest field. It travels al **Mode application.** In Lean mode the six columns collapse into the Lean-spec note's task + boundaries sections — the columns are still answered, just compactly. In Zero-ceremony mode the Planner ≡ Implementer collapse makes the Task Prompt implicit (the agent briefs itself); the columns remain the disciplined questions to answer before starting. +#### Acceptance criteria as a Sprint Contract + +The acceptance-criteria column is the closest thing in the methodology to a sprint contract: a written, pre-handoff agreement on **what "done" looks like** that binds three downstream actors — the Implementer (who must satisfy it), the Reviewer (who must audit it), and the future-self / takeover session (who must resume it). Two disciplines follow from that contract role; both are Planner-side and apply at the time the AC is written, not at the time the Implementer ships. + +**(1) Reviewer-anticipation rule (Planner → Reviewer direction).** Before handing the Task Prompt to the Implementer, the Planner asks itself: *"If the Reviewer audited this AC, what specifically would they look for?"* Each AC must answer that question with a concrete `file:line` + evidence-type pair **at the time of writing**, not at the time of Implementer egress. §Pre-handoff self-check Q1 (Implementer egress) is the **last line of defence**; the Reviewer-anticipation discipline catches the same failure **at the source**, before the Implementer has spent cycles producing work whose AC could not have been verified anyway. An AC the Planner cannot pre-answer "how would the Reviewer know this is met?" for is an AC that will trigger a §Conflict resolution Tier-2 escalation later — surface the gap now, in the AC, rather than rediscover it after the diff exists. + +This is **not** the Reviewer participating in Phase 2. The Reviewer enters at Phase 5 with the manifest in `phase: review`, as defined in §Reviewer below; their identity, capability envelope, and anti-collusion separation are unchanged. The Planner *imagines* the Reviewer's audit; the Reviewer themselves remains a separate identity. Bringing a real Reviewer into Phase 2 would collapse the role separation (§Single-agent anti-collusion rule); imagining their audit while writing AC is the Planner's discipline, not a new role traffic. + +**(2) Reverse-shape rule (AC text → Implementer direction).** The wording of an AC steers what the Implementer optimises for, often in ways the Planner did not intend. An AC stated as "the endpoint returns within 50ms" pulls the Implementer toward latency tuning (caching, query rewrite); an AC stated as "the endpoint returns the correct shape under malformed input" pulls toward schema correctness and error handling. Both are valid; writing only one **silently de-prioritises the other** because the Implementer reads the AC text as the contract — what is not in the AC is not in the contract, regardless of what the Planner held privately as "obvious." + +This is the Task Prompt analogue of [`docs/agent-persona-discipline.md`](agent-persona-discipline.md)'s observation that the medium of the output reverse-shapes the persona that produces it. Here: **AC text reverse-shapes the implementation choices the Implementer will make.** The Planner's obligation is to write AC that names every dimension the change cares about — correctness, performance, security, observability, accessibility — even when one dimension feels self-evident from context. Implicit dimensions get implicit work. + +**Self-check before handoff.** Three questions, all answerable in writing: + +1. *Pre-verifiability.* For each AC, can the Planner — before any code is written — name the specific `file:line` (or path-shape) and evidence-type that would substantiate it? An AC whose verification path can be named only after the work is done is a post-hoc AC, not a pre-handoff contract. +2. *Dimension coverage.* Does the AC set cover every dimension the change cares about explicitly, or are some dimensions left implicit and therefore at risk of being de-prioritised? An "obvious but unwritten" requirement is the same shape as the §Anti-patterns failure of *Conversation treated as SoT* — present in the Planner's head, absent from the artifact the Implementer reads. +3. *Verifiability symmetry.* If the Reviewer (or a takeover session) read only the AC, the manifest, and the diff — without seeing the conversation that produced them — would they know how to audit each criterion? + +Any "no" or "unsure" answer is a signal to strengthen the AC **before** the Task Prompt ships, not after the Implementer hands off and the gap surfaces as a Discovery Loop or send-back. + +**Mode application.** In Lean mode these three questions collapse into "can I cite where to look and what counts as proof?" applied to the Lean-spec's task / boundaries sections — same questions, less ceremony. In Zero-ceremony mode the Planner ≡ Implementer collapse means the same agent runs the discipline against itself; the questions remain the disciplined ones to answer before starting work. + ### Implementer **Responsibilities:** @@ -117,6 +139,7 @@ Compaction history: this self-check originally had five questions (1.8.0 form). - Write implementation code (if an issue is found, send back to the Implementer). - Rewrite the Planner's or Implementer's fields (can only flag disagreement in `review_notes`). +- Spend audit attention on what a mechanical check should have caught (lint, format, type, doc-staleness, schema-shape). Such findings are an **under-enforcement signal upstream**, not Reviewer work — surface them in `review_notes` with a one-line note pointing at the missing mechanical check, then move on. The Reviewer's attention is reserved for what only judgment can assess; the boundary is defined in [`docs/mechanical-enforcement-discipline.md §Boundary with non-mechanical evaluation`](mechanical-enforcement-discipline.md). #### Anti-rationalization rules @@ -265,6 +288,8 @@ The tool-permission matrix above declares the *baseline* envelope per role. The **The principle.** Lower-risk changes operate under the baseline envelope. As risk rises, *more* of the role envelope's defaults convert from "may" to "must": evidence rows that were optional become required; reviewer identity that was advised becomes mandated; approvals that were AI-permissible become human-required. The schema's existing `escalations[*].trigger` enum already encodes the leaves (`rollback_mode_3`, `breaking_change_l3_or_l4`, `auth_pii_path_touched`, …); this matrix is the **routing layer** that names which trigger to raise based on the risk profile, so a Planner reading the matrix can resolve "what gating applies" deterministically without re-deriving from first principles each time. +**Risk is one axis; capability frontier is another.** The matrix above scales the role envelope on `breaking_change.level × rollback_mode` — a *blast-radius* axis. A second axis exists in parallel that the matrix deliberately does not encode: how far the task sits from what the current model class does reliably solo (the *capability-frontier* axis). Reviewer stringency, fan-out width, and specialist invocation can scale on this axis even at low blast-radius — a low-L, mode-1 change at the *capability frontier* (a domain the model has not previously demonstrated, an unfamiliar SoT pattern, a multi-step task at the edge of long-context coherence) earns more Reviewer attention and benefits more from a registered specialist than the same risk profile applied to a well-trodden change shape. Conversely, a change profile the model has executed reliably across many prior changes can be reviewed with the baseline envelope even when the matrix's risk-axis row would technically permit more. The matrix's *additional gating* column is a floor, not a ceiling — capability-frontier signals (Discovery-loop frequency on similar prior changes, novel SoT pattern, model-class-new task domain) can motivate raising envelope strictness above the floor, recorded as an `escalations[*]` entry naming the capability-frontier rationale rather than a risk-axis trigger. The risk-axis is the encoded, mechanical enforcement boundary; the capability-frontier axis is the human / Planner judgement signal that lives alongside it. Re-evaluation of where the capability frontier sits is a [`harness-evolution-discipline.md`](harness-evolution-discipline.md) concern; per-change sensitivity to it is a Planner concern. + **No new schema fields.** The matrix uses only the schema's existing fields: - `breaking_change.level` (already required) @@ -287,6 +312,16 @@ The matrix is the canonical source for how risk-level governs role envelope; run ## Single-agent anti-collusion rule +### Why this rule exists + +The structural rule below is the *enforcement layer* for an underlying behavioural failure: **AI agents asked to evaluate work they have produced systematically over-report quality.** The pattern is not contained to specific failure shapes — it appears as the Reviewer's *anti-rationalisation* failures (rule 1 confidence-without-substantiation, rule 2 unsubstantiated `pass` cells), as the Implementer's *self-supervising loop* in `ai-operating-contract.md §Rejected patterns`, as the *autonomous self-terminating loop* failure, and as the broader pattern of an agent praising its own output even when an external reader would call the work mediocre. The mechanism is shared across these surfaces: the same identity that produced the work cannot reliably hold an adversarial stance toward it. + +This is not a tooling problem and it cannot be fixed at the prompt layer alone — instructing an agent to "be critical of your own work" makes the surface text more critical without making the evaluation more accurate. The reliable fix is **structural separation**: the work and the audit are produced under different identities, with the auditor's tool envelope mechanically prevented from touching the work (the §Tool-permission matrix's *Reviewer has no write tools* row is the load-bearing form). Self-evaluation bias does not respect role labels — it takes hold whenever the same identity holds both production authority and audit authority over the same artifact, regardless of how the second pass is framed (*"reflection," "second look," "verifier sub-step," "self-correction"* are surface forms of the same identity-collapse). + +The rule below is therefore not a procedural preference; it is the structural counter-pressure to a behavioural failure the methodology cannot prompt-engineer its way around. Re-evaluation of whether the failure still binds on a given model class is a [`harness-evolution-discipline.md`](harness-evolution-discipline.md) concern; the rule itself stays as long as the failure does. + +### The rule + **Rule.** Within a single change, the *same* agent identity (same model invocation, same sub-agent spawn, same human account) must not play more than one of `{Planner, Implementer, Reviewer}`. Specifically forbidden combinations, in order of risk: diff --git a/skills/engineering-workflow/references/mode-decision-tree.md b/skills/engineering-workflow/references/mode-decision-tree.md index b67fefe..b898e86 100644 --- a/skills/engineering-workflow/references/mode-decision-tree.md +++ b/skills/engineering-workflow/references/mode-decision-tree.md @@ -69,6 +69,7 @@ The following **should not** use Full mode (avoid over-engineering): | Small-scope refactor in a well-tested area | Tests are the verification; no new contract. | | New log / metric without behavior change | Pure operational-surface enhancement. | | Retirement of a project-local discipline backed by a recorded `Discipline-provenance sweep` finding (per [`anti-entropy-discipline.md`](../../../docs/anti-entropy-discipline.md) Rule 3) | The sweep finding is the decay evidence; retirement is single-surface, single-consumer, ≤5-min verification (delete the discipline + bridge-pointer cleanup). Adding a new discipline still routes through the canonical-methodology-content row above (L1+ → Full); only sweep-backed *retirement* drops to Lean. This is the asymmetric-cost lever that lets the methodology shed weight over time rather than only accumulate it. Self-applying-methodology check: the retirement change still goes through Phases 0–7; the Lean-mode collapse is on artifact set, not on phase rigour. | +| Retirement of a canonical methodology component backed by a recorded `Harness-evolution sweep` finding (per [`harness-evolution-discipline.md`](../../../docs/harness-evolution-discipline.md) §The procedure §Step 3) | The model-release-driven empirical re-test is the decay evidence; retirement is single-surface (the canonical SoT file the component lives in), single-consumer-pattern (consumers cite by name, mechanically replaceable), ≤5-min verification (delete the component + back-pointer cleanup at every consumer per `CLAUDE.md §5`). Adding a new component still routes through the canonical-methodology-content row above (L1+ → Full); only sweep-backed *retirement* drops to Lean. Symmetric to the Discipline-provenance sweep row above — both rides apply the asymmetric-cost lever to let the methodology shed weight, the Discipline-provenance row to project-local discipline drift and this row to canonical-methodology model-capability drift. Self-applying-methodology check: the retirement change still goes through Phases 0–7; the Lean-mode collapse is on artifact set, not on phase rigour. The CHANGELOG entry must cite the harness-evolution sweep entry that produced the finding (the sweep's audit trail is what makes the Lean drop legitimate; a "retirement" with no recorded sweep is an L1+ Full-mode change). | ## Scenarios that force Three-line delivery (not Lean, not Full) diff --git a/skills/engineering-workflow/references/resumption-protocol.md b/skills/engineering-workflow/references/resumption-protocol.md index d879e46..d58b1a5 100644 --- a/skills/engineering-workflow/references/resumption-protocol.md +++ b/skills/engineering-workflow/references/resumption-protocol.md @@ -109,6 +109,8 @@ Before reading any artifact beyond the Manifest, estimate the cumulative read si The percentage is advisory — context sizes differ across runtimes — but naming a threshold forces the check to actually happen rather than drifting into "I'll just read it all." +**Outgoing-session counterpart.** The 30% rule is the *incoming* session's pre-read estimate. The symmetric *outgoing* rule — estimate before declaring done, do not silently truncate planned work when context feels tight — is `docs/ai-operating-contract.md §12 Context anxiety`. Together they close both ends of the session boundary: the incoming session does not over-read into exhaustion, and the outgoing session does not under-execute into premature wrap-up. + ## Step 3: determine the current mode - Is this task still Lean? - Or has the scope grown enough to warrant Full?