feat(gardener/classifier): expand digest budget and clean noise; cover leaves + soft_links

## TL;DR

The tree-digest builder (`src/products/gardener/engine/classifiers/tree-digest.ts`) is operating with a budget calibrated for an older context window and leaves several easy quality wins on the table. Three concrete problems found while inspecting a real gardener-comment run against paperclipai/paperclip:

1. **Digest budget is 100KB, silently truncating** — tiny fraction of the 200K-token window Claude 4.5/4.6/4.7 actually give us.
2. **Noise pollutes the digest** — `.gardener-tree-cache/` auto-generated copies and `drift/` placeholder NODE.md files consume budget for zero signal.
3. **Digest ignores leaf files and `soft_links`** — classifier judges PRs against NODE.md one-liners only, misses the actual decisions.

Separately: there's no task-aware model selection, and users have no idea `GARDENER_CLASSIFIER_MODEL` exists.

## Repro / evidence

Real run: `first-tree gardener comment --pr 4368 --repo paperclipai/paperclip --tree-path ~/paperclip-tree`, instrumented to dump the full prompt. Stats:

| section | bytes |
|---|---|
| system prompt | 1,469 |
| tree digest | **24,794** (138 NODE.md entries, ~half redundant `.gardener-tree-cache` + `drift` placeholders) |
| PR body | 8,023 |
| diff | 100,719 (under 200KB cap) |
| **total prompt** | ~135KB |

138 nodes used 25KB of a 100KB budget, but the useful signal was <50% of those nodes. The real paperclip-tree only has ~80 real NODE.md files; the rest were `.gardener-tree-cache/` cache copies (entries like \`\`\`\`.gardener-tree-cache/adapters/claude-local/NODE.md\`\`\`\`) and `drift/paperclip-e392f6b1/.../NODE.md` stubs labeled "Auto-generated intermediate node for sync proposals".

I also verified the model claim separately: I ran each of `claude-haiku-4-5 / sonnet-4-5 / sonnet-4-6 / opus-4-6 / opus-4-7` through `claude -p` and asked it to report its own context size. All report **200K tokens**. So no model change buys extra budget — the win is purely from better use of the existing window.

## Proposals

### 1. Grow `DIGEST_BUDGET_BYTES` and make it model-aware

Today: hard-coded 100KB. Suggested:

- Default to **500KB**, enough to cover realistic trees (up to ~2700 NODE.md) with leaves and soft_links.
- Or derive from the selected model: `digestCap = contextSize(model) / 4` (leaves room for PR body + diff + output).
- Surface an `stderr` warning when the cap is reached. Today: silent drop (confusing when a node is mysteriously missing from the citations).

### 2. Extend `SKIP_DIRS` and filter auto-generated drift

- Add `.gardener-tree-cache` to `SKIP_DIRS`.
- Skip any NODE.md whose extracted `summary` is exactly "Auto-generated intermediate node for sync proposals" (these are gardener's own scaffolding from `drift/<source-id>/.../NODE.md`, not real decisions).

On the run above this would drop ~half the 138 entries and return the digest to signal-dense state.

### 3. Include leaves + `soft_links` in the digest

The richest signal is in leaf files (e.g. `product/task-system/issue-blockers/issue-graph-liveness.md`), not in the parent NODE.md section summary. Today the classifier never sees leaves, so it's judging a PR that touches a specific decision against a parent's one-line description.

Suggested heuristic: when a PR diff touches files whose paths overlap a NODE.md's domain path (e.g. PR touches `server/src/issues/...` → include `product/task-system/**` leaves), include those leaf files in the digest. Path-prefix match is cheap and local — no embedding needed.

Same argument for `soft_links`: if `product/task-system/issue-links/NODE.md` has `soft_links: [product/agent-model/NODE.md]`, include the linked node's summary too. Classifier can see the cross-domain relationship.

### 4. Task-aware default model

Today `DEFAULT_MODEL = "claude-haiku-4-5"` across every classifier call. For `gardener comment` this is fine — high call rate, classification-only, haiku handles it. But for `gardener sync --open-issues` and `gardener draft-node`, the LLM is generating tree-node bodies or reasoning about cross-domain drift; haiku undersells.

Side-by-side verified on live paperclip PRs:

- PR #4368 (new adapter, should trigger "aligned with existing adapter pattern"), **haiku-4-5**: `ALIGNED / low`, zero cited nodes. Correct but shallow.
- PR #4367 (queue-sweep governance change), **sonnet-4-6** via `GARDENER_CLASSIFIER_MODEL=claude-sonnet-4-6`: `NEEDS_REVIEW / medium`, 4 cited nodes, flagged that the PR bundled an unrelated openclaw-gateway session-key change + pointed at `UNHEALTHY_AGENT_STATUSES` vs the canonical agent state machine.

Sonnet catches cross-domain signals haiku misses. Suggest:

- `gardener comment` default: haiku-4-5 (cheap, frequent).
- `gardener sync` / `gardener draft-node` default: sonnet-4-6 (quality-sensitive, rare).
- Both remain overridable via `GARDENER_CLASSIFIER_MODEL`.

### 5. Document `GARDENER_CLASSIFIER_MODEL` in onboarding

It exists in the code (`select.ts:50`) and in `install-workflow.ts` workflow comments, but it's not in `skills/first-tree/references/onboarding.md`. Users don't know they can upgrade the model for their tree without code changes. Suggest a one-line mention in Step 6 or the Pitfalls section.

## Acceptance criteria

- [ ] `DIGEST_BUDGET_BYTES` default raised (≥ 500KB) and/or derived from model context size.
- [ ] Budget-exhaustion emits an stderr warning with node count dropped.
- [ ] `.gardener-tree-cache` in `SKIP_DIRS`; `drift/` placeholder auto-generated nodes filtered out.
- [ ] Digest optionally includes leaf `.md` files when diff paths overlap node domain.
- [ ] Digest optionally includes soft-linked nodes' summaries when cited via `soft_links` frontmatter.
- [ ] Default model is task-specific (haiku for comment, sonnet-4-6 for sync/draft-node) unless overridden.
- [ ] `GARDENER_CLASSIFIER_MODEL` documented in onboarding.md.
- [ ] Existing tests still pass; new tests for each heuristic (noise filter, leaf inclusion, soft_link inclusion, budget warning).

## E2E test requirement

Before merging, verify:

1. On paperclipai/paperclip, `gardener comment --pr <open PR>` runs end-to-end with the new digest builder; emitted comment cites real (non-hallucinated) paths.
2. The `.gardener-tree-cache` / drift placeholders are absent from the posted citation list.
3. Budget warning prints to stderr when cap is hit (can repro by lowering the cap in a test env).
4. `GARDENER_CLASSIFIER_MODEL=claude-sonnet-4-6 gardener comment ...` runs with sonnet; matches the observed behavior on PR #4367 above.

## Env / context

- First-tree @ v0.3.2 (main as of 2026-04-24).
- Classifier call site: `src/products/gardener/engine/classifiers/claude-cli.ts`.
- Budget site: `src/products/gardener/engine/classifiers/tree-digest.ts:25`.
- Related: #272 (claude-cli classifier), #339 (DIFF_CAP raise + diff-noise filter — same class of fix for the diff side).

/cc @serenakeyitan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gardener/classifier): expand digest budget and clean noise; cover leaves + soft_links #343

TL;DR

Repro / evidence

Proposals

1. Grow `DIGEST_BUDGET_BYTES` and make it model-aware

2. Extend `SKIP_DIRS` and filter auto-generated drift

3. Include leaves + `soft_links` in the digest

4. Task-aware default model

5. Document `GARDENER_CLASSIFIER_MODEL` in onboarding

Acceptance criteria

E2E test requirement

Env / context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

section	bytes
system prompt	1,469
tree digest	24,794 (138 NODE.md entries, ~half redundant `.gardener-tree-cache` + `drift` placeholders)
PR body	8,023
diff	100,719 (under 200KB cap)
total prompt	~135KB

feat(gardener/classifier): expand digest budget and clean noise; cover leaves + soft_links #343

Description

TL;DR

Repro / evidence

Proposals

1. Grow DIGEST_BUDGET_BYTES and make it model-aware

2. Extend SKIP_DIRS and filter auto-generated drift

3. Include leaves + soft_links in the digest

4. Task-aware default model

5. Document GARDENER_CLASSIFIER_MODEL in onboarding

Acceptance criteria

E2E test requirement

Env / context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Grow `DIGEST_BUDGET_BYTES` and make it model-aware

2. Extend `SKIP_DIRS` and filter auto-generated drift

3. Include leaves + `soft_links` in the digest

5. Document `GARDENER_CLASSIFIER_MODEL` in onboarding