-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
PR #374 gets the naming/test-pyramid cleanup over the line, but mutation-testing signal is still weak in the highest-risk files. We need a follow-up pass that treats this as a design and measurement problem, not just a "write more tests" problem.
The immediate goal is to turn mutation testing into a useful guardrail for kata's core execution paths instead of spending cycles on cosmetic CLI/wiring mutants.
Why this needs a follow-up issue
During the PR #374 follow-up work:
- We added targeted mutation-oriented tests in:
src/cli/commands/execute.test.tssrc/features/execute/workflow-runner.test.tssrc/infrastructure/execution/session-bridge.test.tssrc/features/cycle-management/cooldown-session.test.tssrc/features/cycle-management/cooldown-session-prepare.test.ts
- We found and fixed one real production bug in
src/cli/commands/execute.ts:execute cycle ... --prepare --agent/--katakawas readingopts()instead ofoptsWithGlobals(), which dropped agent attribution on the cycle subcommand path.
- Focused verification is good:
vitest run src/cli/commands/execute.test.ts src/features/execute/workflow-runner.test.ts src/infrastructure/execution/session-bridge.test.ts src/features/cycle-management/cooldown-session.test.ts src/features/cycle-management/cooldown-session-prepare.test.ts- Result:
5files passed,325tests passed.
But the mutation-testing signal is still poor where it matters most:
- Focused Stryker run for
src/cli/commands/execute.tson the PR-374-based worktree:872mutants- mutation score
43.39 305killed208survived190no coverage169errors
- The dominant survivors are not all "missing important tests":
- many are command descriptions, option help text, banner strings, and other presentation-only literals
- some are real uncovered branches in delegated CLI paths (
status,stats,--hint, invalid agent handling, local--json) - some come from monolithic orchestration files where pure logic and side effects are mixed together
Problem statement
The current mutation target shape makes it too easy to burn time on low-value survivors while still missing real risk:
execute.tsmixes command registration, CLI output, parsing, fallback logic, and business helpers in one file.session-bridge.tsmixes prompt rendering, file IO, status aggregation, fallback resolution, and lifecycle updates.cooldown-session.tsmixes orchestration, heuristics, persistence, cleanup, and enrichment logic in one class.- The mutation denominator is inflated by presentation-only code that should not carry the same weight as core execution logic.
If we keep treating this as a raw "% killed" chase, we'll spend time snapshot-testing help text instead of improving the actual quality bar.
Goals
- Raise mutation-testing signal for the real execution paths in:
src/cli/commands/execute.tssrc/features/execute/workflow-runner.tssrc/infrastructure/execution/session-bridge.tssrc/features/cycle-management/cooldown-session.ts
- Separate pure logic from CLI/wiring so mutation testing can target code that is worth mutating.
- Define a clear mutation-testing scope and gating strategy for kata.
- Kill the remaining meaningful survivors without padding the suite with low-signal assertions.
Non-goals
- Chasing repo-wide mutation score by asserting every description string or help-text literal.
- Broadly disabling whole mutation classes like
ConditionalExpressionorLogicalOperator. - Reworking unrelated config/test harness pieces unless we find a real defect.
Current high-signal findings
src/cli/commands/execute.ts
Likely still the highest-value follow-up target.
What remains meaningful:
- delegated
status/statsbehavior - invalid
--agent/--katakahandling --hintparsing failures and edge cases- local-vs-global
--jsonbehavior on nested subcommands - some
--nextfallback paths
What looks low-value:
.description(...)strings.option(..., "help text")strings- CLI formatting-only lines and banners
src/features/execute/workflow-runner.ts
This is in much better shape after the recent test additions.
Remaining likely work:
- artifact listing fallbacks / malformed metadata paths
- any remaining sort-order or serialization edge cases
- review whether the file should expose more helper functions for direct testing
src/infrastructure/execution/session-bridge.ts
Still worth another pass because it mixes multiple responsibilities.
Likely remaining meaningful areas:
getAgentContext()reconstruction and metadata canonicalization- cycle-status aggregation and pending placeholder behavior
- non-JSON / malformed file filtering
- named-kata fallback resolution
src/features/cycle-management/cooldown-session.ts
Still large and side-effect-heavy.
Likely remaining meaningful areas:
- learning-capture thresholds and heuristics
- enrichment fallback behavior
- incomplete-run detection / warnings
- expiry / stale-learning cleanup
- stale synthesis cleanup and related file filtering
Proposed follow-up plan
Phase 1: Re-baseline the weak files
- Run focused Stryker only on:
execute.tsworkflow-runner.tssession-bridge.tscooldown-session.ts
- Capture per-file before/after counts, not just one repo-wide percent.
- Treat
execute.tsseparately from the engine/session files because its CLI wiring skews the denominator.
Phase 2: Extract pure helpers out of execute.ts
Move these into a small helper module (or modules) with direct unit tests:
formatExplainparseHintFlagsparseBetOption- saved-kata load/save/delete/list helpers
- duration formatting helpers
Target outcome:
- fewer mutants tied to command-registration boilerplate
- better unit-level mutation signal on actual logic
Phase 3: Extract testable helpers from the orchestration files
Candidates:
session-bridge.ts- cycle-status aggregation
- last-activity selection
- stage resolution / fallback
- budget estimation helper
cooldown-session.ts- alert-level calculation
- learning-capture heuristics
- incomplete-run classification
- stale synthesis cleanup filtering
Target outcome:
- keep the orchestration classes thinner
- make the mutation target mostly deterministic business logic
Phase 4: Add only high-value tests for real survivors
Prioritize tests that distinguish real behavior, especially:
execute- invalid
--hintstage/category/strategy/empty-flavor cases - local-vs-global
--json - saved-kata fallback/error handling
- invalid
session-bridge- exact pending placeholder details
- reconstructed context metadata
- bridge-run and history filtering
cooldown-session- expiry/archival cases
- missing-description fallbacks
- zero/undefined budget/report enrichment edge cases
Phase 5: Decide what should count in the mutation gate
If presentation-only CLI literals still dominate the denominator after extraction:
- exclude only presentation-only code paths from the mutation target, or
- use narrow inline Stryker disables with rationale on those exact lines, or
- keep CLI registration in a non-blocking mutation suite and gate only core logic
Do not do a broad "just lower the bar" move without documenting why.
Acceptance criteria
- We have a documented per-file mutation baseline for the four weak files.
execute.tsis split so pure parsing/formatting logic is mutation-testable outside the command-registration wrapper.session-bridge.tsand/orcooldown-session.tshave high-signal helper extraction where it materially improves testability.- Remaining survivors in the four weak files are triaged into:
- real missing tests
- acceptable presentation-only mutants
- code-smell/refactor follow-up
- The mutation gate is updated to reflect the intended quality bar for kata core logic.
- We can explain, in one short note, why any ignored/excluded mutants are acceptable.
Open questions
- Should CLI registration files participate in the blocking mutation gate at all?
- Should kata gate mutation score per file or per focused suite rather than repo-wide?
- Do we want repo-specific
area:*labels for kata (area:cli,area:engine,area:skills) added/synced so future issues can be labeled consistently with the standard?
Suggested implementation order for next session
- Re-run focused Stryker on the PR-374 base for the four weak files and record the baseline.
- Extract
executehelpers and add direct tests first. - Re-run Stryker for
execute. - Pick either
session-bridgeorcooldown-sessionas the next extraction target, not both at once. - Revisit mutation gating only after the helper extraction improves signal.
References
- PR Unify agent naming and formalize test pyramid #374
- Follow-up worktree branch:
codex-mutation-pr374-followup-20260313 - Focused Stryker report path from the PR-based worktree:
reports/mutation/mutation.html