Skip to content

Refactor and re-scope mutation testing for core execution paths after PR #374 #375

@cmbays

Description

@cmbays

Summary

PR #374 gets the naming/test-pyramid cleanup over the line, but mutation-testing signal is still weak in the highest-risk files. We need a follow-up pass that treats this as a design and measurement problem, not just a "write more tests" problem.

The immediate goal is to turn mutation testing into a useful guardrail for kata's core execution paths instead of spending cycles on cosmetic CLI/wiring mutants.

Why this needs a follow-up issue

During the PR #374 follow-up work:

  • We added targeted mutation-oriented tests in:
    • src/cli/commands/execute.test.ts
    • src/features/execute/workflow-runner.test.ts
    • src/infrastructure/execution/session-bridge.test.ts
    • src/features/cycle-management/cooldown-session.test.ts
    • src/features/cycle-management/cooldown-session-prepare.test.ts
  • We found and fixed one real production bug in src/cli/commands/execute.ts:
    • execute cycle ... --prepare --agent/--kataka was reading opts() instead of optsWithGlobals(), which dropped agent attribution on the cycle subcommand path.
  • Focused verification is good:
    • vitest run src/cli/commands/execute.test.ts src/features/execute/workflow-runner.test.ts src/infrastructure/execution/session-bridge.test.ts src/features/cycle-management/cooldown-session.test.ts src/features/cycle-management/cooldown-session-prepare.test.ts
    • Result: 5 files passed, 325 tests passed.

But the mutation-testing signal is still poor where it matters most:

  • Focused Stryker run for src/cli/commands/execute.ts on the PR-374-based worktree:
    • 872 mutants
    • mutation score 43.39
    • 305 killed
    • 208 survived
    • 190 no coverage
    • 169 errors
  • The dominant survivors are not all "missing important tests":
    • many are command descriptions, option help text, banner strings, and other presentation-only literals
    • some are real uncovered branches in delegated CLI paths (status, stats, --hint, invalid agent handling, local --json)
    • some come from monolithic orchestration files where pure logic and side effects are mixed together

Problem statement

The current mutation target shape makes it too easy to burn time on low-value survivors while still missing real risk:

  1. execute.ts mixes command registration, CLI output, parsing, fallback logic, and business helpers in one file.
  2. session-bridge.ts mixes prompt rendering, file IO, status aggregation, fallback resolution, and lifecycle updates.
  3. cooldown-session.ts mixes orchestration, heuristics, persistence, cleanup, and enrichment logic in one class.
  4. The mutation denominator is inflated by presentation-only code that should not carry the same weight as core execution logic.

If we keep treating this as a raw "% killed" chase, we'll spend time snapshot-testing help text instead of improving the actual quality bar.

Goals

  • Raise mutation-testing signal for the real execution paths in:
    • src/cli/commands/execute.ts
    • src/features/execute/workflow-runner.ts
    • src/infrastructure/execution/session-bridge.ts
    • src/features/cycle-management/cooldown-session.ts
  • Separate pure logic from CLI/wiring so mutation testing can target code that is worth mutating.
  • Define a clear mutation-testing scope and gating strategy for kata.
  • Kill the remaining meaningful survivors without padding the suite with low-signal assertions.

Non-goals

  • Chasing repo-wide mutation score by asserting every description string or help-text literal.
  • Broadly disabling whole mutation classes like ConditionalExpression or LogicalOperator.
  • Reworking unrelated config/test harness pieces unless we find a real defect.

Current high-signal findings

src/cli/commands/execute.ts

Likely still the highest-value follow-up target.

What remains meaningful:

  • delegated status / stats behavior
  • invalid --agent / --kataka handling
  • --hint parsing failures and edge cases
  • local-vs-global --json behavior on nested subcommands
  • some --next fallback paths

What looks low-value:

  • .description(...) strings
  • .option(..., "help text") strings
  • CLI formatting-only lines and banners

src/features/execute/workflow-runner.ts

This is in much better shape after the recent test additions.

Remaining likely work:

  • artifact listing fallbacks / malformed metadata paths
  • any remaining sort-order or serialization edge cases
  • review whether the file should expose more helper functions for direct testing

src/infrastructure/execution/session-bridge.ts

Still worth another pass because it mixes multiple responsibilities.

Likely remaining meaningful areas:

  • getAgentContext() reconstruction and metadata canonicalization
  • cycle-status aggregation and pending placeholder behavior
  • non-JSON / malformed file filtering
  • named-kata fallback resolution

src/features/cycle-management/cooldown-session.ts

Still large and side-effect-heavy.

Likely remaining meaningful areas:

  • learning-capture thresholds and heuristics
  • enrichment fallback behavior
  • incomplete-run detection / warnings
  • expiry / stale-learning cleanup
  • stale synthesis cleanup and related file filtering

Proposed follow-up plan

Phase 1: Re-baseline the weak files

  • Run focused Stryker only on:
    • execute.ts
    • workflow-runner.ts
    • session-bridge.ts
    • cooldown-session.ts
  • Capture per-file before/after counts, not just one repo-wide percent.
  • Treat execute.ts separately from the engine/session files because its CLI wiring skews the denominator.

Phase 2: Extract pure helpers out of execute.ts

Move these into a small helper module (or modules) with direct unit tests:

  • formatExplain
  • parseHintFlags
  • parseBetOption
  • saved-kata load/save/delete/list helpers
  • duration formatting helpers

Target outcome:

  • fewer mutants tied to command-registration boilerplate
  • better unit-level mutation signal on actual logic

Phase 3: Extract testable helpers from the orchestration files

Candidates:

  • session-bridge.ts
    • cycle-status aggregation
    • last-activity selection
    • stage resolution / fallback
    • budget estimation helper
  • cooldown-session.ts
    • alert-level calculation
    • learning-capture heuristics
    • incomplete-run classification
    • stale synthesis cleanup filtering

Target outcome:

  • keep the orchestration classes thinner
  • make the mutation target mostly deterministic business logic

Phase 4: Add only high-value tests for real survivors

Prioritize tests that distinguish real behavior, especially:

  • execute
    • invalid --hint stage/category/strategy/empty-flavor cases
    • local-vs-global --json
    • saved-kata fallback/error handling
  • session-bridge
    • exact pending placeholder details
    • reconstructed context metadata
    • bridge-run and history filtering
  • cooldown-session
    • expiry/archival cases
    • missing-description fallbacks
    • zero/undefined budget/report enrichment edge cases

Phase 5: Decide what should count in the mutation gate

If presentation-only CLI literals still dominate the denominator after extraction:

  • exclude only presentation-only code paths from the mutation target, or
  • use narrow inline Stryker disables with rationale on those exact lines, or
  • keep CLI registration in a non-blocking mutation suite and gate only core logic

Do not do a broad "just lower the bar" move without documenting why.

Acceptance criteria

  • We have a documented per-file mutation baseline for the four weak files.
  • execute.ts is split so pure parsing/formatting logic is mutation-testable outside the command-registration wrapper.
  • session-bridge.ts and/or cooldown-session.ts have high-signal helper extraction where it materially improves testability.
  • Remaining survivors in the four weak files are triaged into:
    • real missing tests
    • acceptable presentation-only mutants
    • code-smell/refactor follow-up
  • The mutation gate is updated to reflect the intended quality bar for kata core logic.
  • We can explain, in one short note, why any ignored/excluded mutants are acceptable.

Open questions

  • Should CLI registration files participate in the blocking mutation gate at all?
  • Should kata gate mutation score per file or per focused suite rather than repo-wide?
  • Do we want repo-specific area:* labels for kata (area:cli, area:engine, area:skills) added/synced so future issues can be labeled consistently with the standard?

Suggested implementation order for next session

  1. Re-run focused Stryker on the PR-374 base for the four weak files and record the baseline.
  2. Extract execute helpers and add direct tests first.
  3. Re-run Stryker for execute.
  4. Pick either session-bridge or cooldown-session as the next extraction target, not both at once.
  5. Revisit mutation gating only after the helper extraction improves signal.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    cross-cuttingCross-cutting concern affecting multiple areasdogfoodingFound during dogfooding cyclesdomain:methodologyDev process and workflowpriority:soonImportant — schedule in the current or next cyclestatus:triageNeeds classification — not yet reviewedtech-debttestingtype:designArchitecture or design decision

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions