Planner lacks diminishing-returns awareness — grinds through low-impact mechanical findings

## Summary

The planner has no mechanism to recognize when continued work on a dimension has negligible impact on the overall score. In practice, this means the bot can spend hours grinding through low-confidence mechanical findings (e.g., props interface splitting at 97% Code Quality) while high-impact subjective dimensions sit at 42-50% — a 14-27x impact differential per score point.

More broadly: mechanical detectors like `props` probably shouldn't generate standalone work items at their current thresholds. They should either inform subjective reviews, or only fire at much more extreme thresholds.

## What happened

On March 2-3 2026, the bot worked through a reigh codebase queue. After a plan reset at 22:03, it processed items in default sort order: test_coverage (126 items), unused (18), structural (7), deprecated (5), etc. At 01:02, all 14 subjective dimensions were skipped due to a transient codex batch auth/hang issue. This left 9 `props` findings as the entire queue.

The bot then spent ~90 minutes applying the identical transformation to every file: split an interface into 3-5 sub-interfaces joined with `&`. It did this on interfaces with 14-18 fields where:
- Every sub-interface was local (never exported, never imported elsewhere)
- Every prop was consumed by the same component (no forwarding, no reuse)
- The splits added naming overhead without improving readability
- ~70 files were modified with pure type-level noise

The result was a net-negative change: more lines, more names to remember, no behavioral improvement.

## Why this is a problem

### 1. No impact-threshold cutoff

The planner doesn't know that fixing all 35 remaining props findings would move the Code Quality dimension from 97.2% to ~99.2% — a 2% gain on a dimension that's already near-ceiling. Meanwhile, Design Coherence sits at 42.8%. Each point of Design Coherence is worth ~14x more to the overall score than each point of Code Quality.

The planner should be able to reason: "This dimension is above X%, and the remaining findings are all low-confidence. The expected score gain from working this queue is Y. There are other dimensions where Y would be 14-27x larger. I should work on those instead."

### 2. No queue diversity enforcement

Once subjective items were skipped, the bot was locked into a single detector category with no escape. There's no mechanism to say "I've been doing the same kind of work for N minutes/items, maybe I should reconsider my queue."

### 3. The detector threshold is too aggressive for standalone work

The `props` detector fires at >14 properties with `low` confidence. But 15-20 props is completely normal for React components that accept event handlers, styling overrides, and ref forwarding. At this threshold, the detector generates findings for interfaces that no human reviewer would flag.

The current behavior is: detector fires → finding created → bot treats it as a work item → bot invents a remedy (always "split into sub-interfaces") → bot applies it mechanically.

The problem is that the detector has no semantic analysis:
- It doesn't check if sub-interfaces would ever be reused
- It doesn't check if the component is a leaf (all props consumed locally) vs a container (props forwarded to children)
- It doesn't check if the interface already has logical groupings via comments
- The `action` field is always `"unknown"`, so the bot gets no guidance on *how* to fix — just that something is "bloated"

## Suggestions

### Option A: Mechanical findings inform subjective reviews, not standalone work

This is probably the right architectural direction. The props detector's output is useful *context* for a subjective review ("this interface has 18 fields — is that a readability problem in practice?") but not useful as a standalone work item ("split this interface into sub-interfaces").

The workflow would be:
1. Mechanical detectors (props, unused, structural) run and produce signals
2. These signals are fed as context into subjective dimension evaluations
3. The subjective reviewer decides whether the mechanical signal represents an actual quality problem *in context*
4. Only subjective findings become work items

This prevents the "blind mechanical transformation" failure mode entirely. The bot would never split an interface just because it has >14 fields — it would only split one if a subjective review determined the interface was actually hard to read or maintain.

### Option B: Much higher thresholds for standalone mechanical findings

If mechanical findings stay as standalone work items, the thresholds need to be much more extreme:
- `props` at >14 fields: informational only (shown in reports, not queued)
- `props` at >25 fields: `low` confidence finding (queued but deprioritized)
- `props` at >40 fields: `medium` confidence finding
- `props` at >55 fields: `high` confidence finding

At 25+ fields, you're much more likely to have a genuine readability problem. At 14 fields, you're just flagging normal React components.

### Option C: Diminishing-returns circuit breaker in the planner

Regardless of A or B, the planner should have awareness of:
- **Dimension ceiling proximity**: "Code Quality is at 97%. Maximum possible gain from remaining findings is 2.8%. Other dimensions have 50%+ headroom."
- **Impact-per-item threshold**: "Each remaining props fix moves the score by 0.045%. The minimum threshold for worth-doing is 0.1%."
- **Queue homogeneity detection**: "The entire queue is from one detector category. This suggests the queue is exhausted of meaningful work, not that this category is the most important."
- **Stagnation detection**: "I've completed 10 items from the same detector in the last hour and the dimension score hasn't meaningfully changed."

### Option D: Require action prescriptions from detectors

The `props` detector currently sets `action: "unknown"` on every finding. The bot then invents its own remedy, which is always "split into sub-interfaces." If the detector prescribed specific actions based on semantic analysis (e.g., "extract callback props into a separate interface because they're forwarded to child X" vs "no action recommended — props are all consumed locally"), the bot could avoid applying inappropriate transformations.

## Recommended path

I'd lean toward **A + C**: make mechanical detectors feed into subjective reviews rather than generating standalone work items, AND add diminishing-returns awareness to the planner as a safety net. Option B is a reasonable interim fix if A is too big a change.

The key insight is: **a line count is not a quality judgment.** The number 14 doesn't tell you whether an interface is hard to read. Only contextual evaluation can determine that. Mechanical detectors are good at measuring; subjective reviews are good at judging. The architecture should respect that distinction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Planner lacks diminishing-returns awareness — grinds through low-impact mechanical findings #187

Summary

What happened

Why this is a problem

1. No impact-threshold cutoff

2. No queue diversity enforcement

3. The detector threshold is too aggressive for standalone work

Suggestions

Option A: Mechanical findings inform subjective reviews, not standalone work

Option B: Much higher thresholds for standalone mechanical findings

Option C: Diminishing-returns circuit breaker in the planner

Option D: Require action prescriptions from detectors

Recommended path

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Planner lacks diminishing-returns awareness — grinds through low-impact mechanical findings #187

Description

Summary

What happened

Why this is a problem

1. No impact-threshold cutoff

2. No queue diversity enforcement

3. The detector threshold is too aggressive for standalone work

Suggestions

Option A: Mechanical findings inform subjective reviews, not standalone work

Option B: Much higher thresholds for standalone mechanical findings

Option C: Diminishing-returns circuit breaker in the planner

Option D: Require action prescriptions from detectors

Recommended path

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions