Skip to content

Planner lacks diminishing-returns awareness — grinds through low-impact mechanical findings #187

@peteromallet

Description

@peteromallet

Summary

The planner has no mechanism to recognize when continued work on a dimension has negligible impact on the overall score. In practice, this means the bot can spend hours grinding through low-confidence mechanical findings (e.g., props interface splitting at 97% Code Quality) while high-impact subjective dimensions sit at 42-50% — a 14-27x impact differential per score point.

More broadly: mechanical detectors like props probably shouldn't generate standalone work items at their current thresholds. They should either inform subjective reviews, or only fire at much more extreme thresholds.

What happened

On March 2-3 2026, the bot worked through a reigh codebase queue. After a plan reset at 22:03, it processed items in default sort order: test_coverage (126 items), unused (18), structural (7), deprecated (5), etc. At 01:02, all 14 subjective dimensions were skipped due to a transient codex batch auth/hang issue. This left 9 props findings as the entire queue.

The bot then spent ~90 minutes applying the identical transformation to every file: split an interface into 3-5 sub-interfaces joined with &. It did this on interfaces with 14-18 fields where:

  • Every sub-interface was local (never exported, never imported elsewhere)
  • Every prop was consumed by the same component (no forwarding, no reuse)
  • The splits added naming overhead without improving readability
  • ~70 files were modified with pure type-level noise

The result was a net-negative change: more lines, more names to remember, no behavioral improvement.

Why this is a problem

1. No impact-threshold cutoff

The planner doesn't know that fixing all 35 remaining props findings would move the Code Quality dimension from 97.2% to ~99.2% — a 2% gain on a dimension that's already near-ceiling. Meanwhile, Design Coherence sits at 42.8%. Each point of Design Coherence is worth ~14x more to the overall score than each point of Code Quality.

The planner should be able to reason: "This dimension is above X%, and the remaining findings are all low-confidence. The expected score gain from working this queue is Y. There are other dimensions where Y would be 14-27x larger. I should work on those instead."

2. No queue diversity enforcement

Once subjective items were skipped, the bot was locked into a single detector category with no escape. There's no mechanism to say "I've been doing the same kind of work for N minutes/items, maybe I should reconsider my queue."

3. The detector threshold is too aggressive for standalone work

The props detector fires at >14 properties with low confidence. But 15-20 props is completely normal for React components that accept event handlers, styling overrides, and ref forwarding. At this threshold, the detector generates findings for interfaces that no human reviewer would flag.

The current behavior is: detector fires → finding created → bot treats it as a work item → bot invents a remedy (always "split into sub-interfaces") → bot applies it mechanically.

The problem is that the detector has no semantic analysis:

  • It doesn't check if sub-interfaces would ever be reused
  • It doesn't check if the component is a leaf (all props consumed locally) vs a container (props forwarded to children)
  • It doesn't check if the interface already has logical groupings via comments
  • The action field is always "unknown", so the bot gets no guidance on how to fix — just that something is "bloated"

Suggestions

Option A: Mechanical findings inform subjective reviews, not standalone work

This is probably the right architectural direction. The props detector's output is useful context for a subjective review ("this interface has 18 fields — is that a readability problem in practice?") but not useful as a standalone work item ("split this interface into sub-interfaces").

The workflow would be:

  1. Mechanical detectors (props, unused, structural) run and produce signals
  2. These signals are fed as context into subjective dimension evaluations
  3. The subjective reviewer decides whether the mechanical signal represents an actual quality problem in context
  4. Only subjective findings become work items

This prevents the "blind mechanical transformation" failure mode entirely. The bot would never split an interface just because it has >14 fields — it would only split one if a subjective review determined the interface was actually hard to read or maintain.

Option B: Much higher thresholds for standalone mechanical findings

If mechanical findings stay as standalone work items, the thresholds need to be much more extreme:

  • props at >14 fields: informational only (shown in reports, not queued)
  • props at >25 fields: low confidence finding (queued but deprioritized)
  • props at >40 fields: medium confidence finding
  • props at >55 fields: high confidence finding

At 25+ fields, you're much more likely to have a genuine readability problem. At 14 fields, you're just flagging normal React components.

Option C: Diminishing-returns circuit breaker in the planner

Regardless of A or B, the planner should have awareness of:

  • Dimension ceiling proximity: "Code Quality is at 97%. Maximum possible gain from remaining findings is 2.8%. Other dimensions have 50%+ headroom."
  • Impact-per-item threshold: "Each remaining props fix moves the score by 0.045%. The minimum threshold for worth-doing is 0.1%."
  • Queue homogeneity detection: "The entire queue is from one detector category. This suggests the queue is exhausted of meaningful work, not that this category is the most important."
  • Stagnation detection: "I've completed 10 items from the same detector in the last hour and the dimension score hasn't meaningfully changed."

Option D: Require action prescriptions from detectors

The props detector currently sets action: "unknown" on every finding. The bot then invents its own remedy, which is always "split into sub-interfaces." If the detector prescribed specific actions based on semantic analysis (e.g., "extract callback props into a separate interface because they're forwarded to child X" vs "no action recommended — props are all consumed locally"), the bot could avoid applying inappropriate transformations.

Recommended path

I'd lean toward A + C: make mechanical detectors feed into subjective reviews rather than generating standalone work items, AND add diminishing-returns awareness to the planner as a safety net. Option B is a reasonable interim fix if A is too big a change.

The key insight is: a line count is not a quality judgment. The number 14 doesn't tell you whether an interface is hard to read. Only contextual evaluation can determine that. Mechanical detectors are good at measuring; subjective reviews are good at judging. The architecture should respect that distinction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions