Refactor and re-scope mutation testing for core execution paths after PR #374

## Summary

PR #374 gets the naming/test-pyramid cleanup over the line, but mutation-testing signal is still weak in the highest-risk files. We need a follow-up pass that treats this as a design and measurement problem, not just a "write more tests" problem.

The immediate goal is to turn mutation testing into a useful guardrail for kata's core execution paths instead of spending cycles on cosmetic CLI/wiring mutants.

## Why this needs a follow-up issue

During the PR #374 follow-up work:

- We added targeted mutation-oriented tests in:
  - `src/cli/commands/execute.test.ts`
  - `src/features/execute/workflow-runner.test.ts`
  - `src/infrastructure/execution/session-bridge.test.ts`
  - `src/features/cycle-management/cooldown-session.test.ts`
  - `src/features/cycle-management/cooldown-session-prepare.test.ts`
- We found and fixed one real production bug in `src/cli/commands/execute.ts`:
  - `execute cycle ... --prepare --agent/--kataka` was reading `opts()` instead of `optsWithGlobals()`, which dropped agent attribution on the cycle subcommand path.
- Focused verification is good:
  - `vitest run src/cli/commands/execute.test.ts src/features/execute/workflow-runner.test.ts src/infrastructure/execution/session-bridge.test.ts src/features/cycle-management/cooldown-session.test.ts src/features/cycle-management/cooldown-session-prepare.test.ts`
  - Result: `5` files passed, `325` tests passed.

But the mutation-testing signal is still poor where it matters most:

- Focused Stryker run for `src/cli/commands/execute.ts` on the PR-374-based worktree:
  - `872` mutants
  - mutation score `43.39`
  - `305` killed
  - `208` survived
  - `190` no coverage
  - `169` errors
- The dominant survivors are not all "missing important tests":
  - many are command descriptions, option help text, banner strings, and other presentation-only literals
  - some are real uncovered branches in delegated CLI paths (`status`, `stats`, `--hint`, invalid agent handling, local `--json`)
  - some come from monolithic orchestration files where pure logic and side effects are mixed together

## Problem statement

The current mutation target shape makes it too easy to burn time on low-value survivors while still missing real risk:

1. `execute.ts` mixes command registration, CLI output, parsing, fallback logic, and business helpers in one file.
2. `session-bridge.ts` mixes prompt rendering, file IO, status aggregation, fallback resolution, and lifecycle updates.
3. `cooldown-session.ts` mixes orchestration, heuristics, persistence, cleanup, and enrichment logic in one class.
4. The mutation denominator is inflated by presentation-only code that should not carry the same weight as core execution logic.

If we keep treating this as a raw "% killed" chase, we'll spend time snapshot-testing help text instead of improving the actual quality bar.

## Goals

- Raise mutation-testing signal for the real execution paths in:
  - `src/cli/commands/execute.ts`
  - `src/features/execute/workflow-runner.ts`
  - `src/infrastructure/execution/session-bridge.ts`
  - `src/features/cycle-management/cooldown-session.ts`
- Separate pure logic from CLI/wiring so mutation testing can target code that is worth mutating.
- Define a clear mutation-testing scope and gating strategy for kata.
- Kill the remaining meaningful survivors without padding the suite with low-signal assertions.

## Non-goals

- Chasing repo-wide mutation score by asserting every description string or help-text literal.
- Broadly disabling whole mutation classes like `ConditionalExpression` or `LogicalOperator`.
- Reworking unrelated config/test harness pieces unless we find a real defect.

## Current high-signal findings

### `src/cli/commands/execute.ts`

Likely still the highest-value follow-up target.

What remains meaningful:

- delegated `status` / `stats` behavior
- invalid `--agent` / `--kataka` handling
- `--hint` parsing failures and edge cases
- local-vs-global `--json` behavior on nested subcommands
- some `--next` fallback paths

What looks low-value:

- `.description(...)` strings
- `.option(..., "help text")` strings
- CLI formatting-only lines and banners

### `src/features/execute/workflow-runner.ts`

This is in much better shape after the recent test additions.

Remaining likely work:

- artifact listing fallbacks / malformed metadata paths
- any remaining sort-order or serialization edge cases
- review whether the file should expose more helper functions for direct testing

### `src/infrastructure/execution/session-bridge.ts`

Still worth another pass because it mixes multiple responsibilities.

Likely remaining meaningful areas:

- `getAgentContext()` reconstruction and metadata canonicalization
- cycle-status aggregation and pending placeholder behavior
- non-JSON / malformed file filtering
- named-kata fallback resolution

### `src/features/cycle-management/cooldown-session.ts`

Still large and side-effect-heavy.

Likely remaining meaningful areas:

- learning-capture thresholds and heuristics
- enrichment fallback behavior
- incomplete-run detection / warnings
- expiry / stale-learning cleanup
- stale synthesis cleanup and related file filtering

## Proposed follow-up plan

### Phase 1: Re-baseline the weak files

- Run focused Stryker only on:
  - `execute.ts`
  - `workflow-runner.ts`
  - `session-bridge.ts`
  - `cooldown-session.ts`
- Capture per-file before/after counts, not just one repo-wide percent.
- Treat `execute.ts` separately from the engine/session files because its CLI wiring skews the denominator.

### Phase 2: Extract pure helpers out of `execute.ts`

Move these into a small helper module (or modules) with direct unit tests:

- `formatExplain`
- `parseHintFlags`
- `parseBetOption`
- saved-kata load/save/delete/list helpers
- duration formatting helpers

Target outcome:

- fewer mutants tied to command-registration boilerplate
- better unit-level mutation signal on actual logic

### Phase 3: Extract testable helpers from the orchestration files

Candidates:

- `session-bridge.ts`
  - cycle-status aggregation
  - last-activity selection
  - stage resolution / fallback
  - budget estimation helper
- `cooldown-session.ts`
  - alert-level calculation
  - learning-capture heuristics
  - incomplete-run classification
  - stale synthesis cleanup filtering

Target outcome:

- keep the orchestration classes thinner
- make the mutation target mostly deterministic business logic

### Phase 4: Add only high-value tests for real survivors

Prioritize tests that distinguish real behavior, especially:

- `execute`
  - invalid `--hint` stage/category/strategy/empty-flavor cases
  - local-vs-global `--json`
  - saved-kata fallback/error handling
- `session-bridge`
  - exact pending placeholder details
  - reconstructed context metadata
  - bridge-run and history filtering
- `cooldown-session`
  - expiry/archival cases
  - missing-description fallbacks
  - zero/undefined budget/report enrichment edge cases

### Phase 5: Decide what should count in the mutation gate

If presentation-only CLI literals still dominate the denominator after extraction:

- exclude only presentation-only code paths from the mutation target, or
- use narrow inline Stryker disables with rationale on those exact lines, or
- keep CLI registration in a non-blocking mutation suite and gate only core logic

Do **not** do a broad "just lower the bar" move without documenting why.

## Acceptance criteria

- We have a documented per-file mutation baseline for the four weak files.
- `execute.ts` is split so pure parsing/formatting logic is mutation-testable outside the command-registration wrapper.
- `session-bridge.ts` and/or `cooldown-session.ts` have high-signal helper extraction where it materially improves testability.
- Remaining survivors in the four weak files are triaged into:
  - real missing tests
  - acceptable presentation-only mutants
  - code-smell/refactor follow-up
- The mutation gate is updated to reflect the intended quality bar for kata core logic.
- We can explain, in one short note, why any ignored/excluded mutants are acceptable.

## Open questions

- Should CLI registration files participate in the blocking mutation gate at all?
- Should kata gate mutation score per file or per focused suite rather than repo-wide?
- Do we want repo-specific `area:*` labels for kata (`area:cli`, `area:engine`, `area:skills`) added/synced so future issues can be labeled consistently with the standard?

## Suggested implementation order for next session

1. Re-run focused Stryker on the PR-374 base for the four weak files and record the baseline.
2. Extract `execute` helpers and add direct tests first.
3. Re-run Stryker for `execute`.
4. Pick either `session-bridge` or `cooldown-session` as the next extraction target, not both at once.
5. Revisit mutation gating only after the helper extraction improves signal.

## References

- PR #374
- Follow-up worktree branch: `codex-mutation-pr374-followup-20260313`
- Focused Stryker report path from the PR-based worktree:
  - `reports/mutation/mutation.html`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor and re-scope mutation testing for core execution paths after PR #374 #375

Summary

Why this needs a follow-up issue

Problem statement

Goals

Non-goals

Current high-signal findings

`src/cli/commands/execute.ts`

`src/features/execute/workflow-runner.ts`

`src/infrastructure/execution/session-bridge.ts`

`src/features/cycle-management/cooldown-session.ts`

Proposed follow-up plan

Phase 1: Re-baseline the weak files

Phase 2: Extract pure helpers out of `execute.ts`

Phase 3: Extract testable helpers from the orchestration files

Phase 4: Add only high-value tests for real survivors

Phase 5: Decide what should count in the mutation gate

Acceptance criteria

Open questions

Suggested implementation order for next session

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Refactor and re-scope mutation testing for core execution paths after PR #374 #375

Description

Summary

Why this needs a follow-up issue

Problem statement

Goals

Non-goals

Current high-signal findings

src/cli/commands/execute.ts

src/features/execute/workflow-runner.ts

src/infrastructure/execution/session-bridge.ts

src/features/cycle-management/cooldown-session.ts

Proposed follow-up plan

Phase 1: Re-baseline the weak files

Phase 2: Extract pure helpers out of execute.ts

Phase 3: Extract testable helpers from the orchestration files

Phase 4: Add only high-value tests for real survivors

Phase 5: Decide what should count in the mutation gate

Acceptance criteria

Open questions

Suggested implementation order for next session

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`src/cli/commands/execute.ts`

`src/features/execute/workflow-runner.ts`

`src/infrastructure/execution/session-bridge.ts`

`src/features/cycle-management/cooldown-session.ts`

Phase 2: Extract pure helpers out of `execute.ts`