fix(gardener/classifiers): raise DIFF_CAP and filter noise#339
fix(gardener/classifiers): raise DIFF_CAP and filter noise#339serenakeyitan merged 2 commits intomainfrom
Conversation
Refs #338. Both classifiers were capping the diff at 20KB — roughly 6% of Haiku 4.5's 200K context window — which truncated typical feature PRs at ~300-400 lines and let lockfile regeneration eat the entire budget. - DIFF_CAP: 20_000 → 200_000 (both anthropic.ts and claude-cli.ts) - DIGEST_BUDGET_BYTES: 30_000 → 100_000 (tree-digest.ts) - New shared diff-filter.ts extracting DIFF_NOISE_PATTERNS from sync.ts. Both classifiers now strip lockfiles, dist/build/out/ coverage/node_modules/__pycache__ hunks, and minified/map/snap artifacts BEFORE applying the byte cap, so the cap bounds real code instead of noise. - Skip the whole Diff section when filtering leaves nothing behind. Total input rises from ~55KB to ~310KB (~38% of Haiku's window), leaving headroom for prompt-caching stability, tokenizer variance, and Anthropic's soft ~180-190K prompt-length threshold. Tests: - New tests/gardener/gardener-diff-filter.test.ts (11 cases) - pnpm typecheck clean - pnpm test: 1203 passed, 0 failed
yuezengwu
left a comment
There was a problem hiding this comment.
Reviewed the diff and ran the new tests locally (11/11 passing, typecheck clean). The fix is well-scoped: filtering noise before the byte cap is the right ordering, the ~38% of Haiku's 200K window for total prompt input leaves sane headroom for prompt-caching + tokenizer variance, and the new filterDiffNoise tests cover the important shapes (lockfiles, dist hunks, all-noise → empty, fail-open on malformed headers, real-code preservation). Guarding the ## Diff section so an all-noise PR gets no empty code fence is a nice touch.
One thing that does not match the PR description, worth a small follow-up:
- The description says "Extract
DIFF_NOISE_PATTERNSfromsync.tsinto a new sharedengine/classifiers/diff-filter.ts", butsrc/products/gardener/engine/sync.ts(lines 678–686) still defines its own privateDIFF_NOISE_PATTERNSandisDiffNoise. The two lists are currently identical, but there are now two sources of truth to keep in sync.formatPrDiffForPromptoperates onPrFileChange[](file-level) while the new helper operates on raw unified-diff text, so they can't share the whole function — butsync.tscould stillimport { DIFF_NOISE_PATTERNS, isDiffNoise } from "./classifiers/diff-filter.js"and drop its local copy. Not a blocker, but worth closing the loop so the next pattern tweak doesn't have to be made in two files.
Minor/optional observations:
- The
b/(.+)$header regex will misbehave on filenames containingb/(e.g. a path likefoo b/bar.ts). Extremely unlikely in real repos; fail-open means the worst case is "noise slips through," so not worth fixing unless you want belt-and-suspenders. - The split on
/(?=^diff --git )/massumes that literal string never appears at the start of a line inside a patch body. Fine forgh pr diffoutput in practice; just noting for anyone who runs into a diff-of-a-diff edge case later.
Approving — the duplication note is the only thing I'd actually want addressed, and a follow-up PR is fine if you prefer to keep this one tight.
This reply was drafted by breeze, an autonomous agent running on behalf of the account owner.
Bumps version to 0.3.2. Includes: - #339 fix(gardener/classifiers): raise DIFF_CAP and filter noise (refs #338) 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Summary
DIFF_CAPfrom 20KB → 200KB in bothanthropic.tsandclaude-cli.ts— 20KB is ~6% of Haiku 4.5's 200K window and truncates typical feature PRs at ~300-400 lines.DIGEST_BUDGET_BYTESfrom 30KB → 100KB intree-digest.ts.DIFF_NOISE_PATTERNSfromsync.tsinto a new sharedengine/classifiers/diff-filter.ts. Both classifiers now strip lockfiles,dist/build/out/coverage/node_modules/__pycache__hunks, and minified/map/snap artifacts BEFORE applying the byte cap, so the cap bounds real code instead of noise.## Diffsection when filtering leaves nothing (e.g. lockfile-only PRs now get a clean prompt instead of an empty code fence).Total prompt input rises from ~55KB to ~310KB (~38% of Haiku's 200K window), leaving headroom for prompt-caching stability, tokenizer variance on mixed Chinese/code content, and Anthropic's soft ~180-190K prompt-length threshold.
Refs #338.
Test plan
pnpm typecheckcleanpnpm test— 1203 passed, 0 failed, 51 skipped (no regressions)tests/gardener/gardener-diff-filter.test.ts— 11 cases covering lockfile drop, dist/ drop, minified artifacts, all-noise PRs (empty output), fail-open on malformed hunks, and real-code-only preservationfirst-tree gardener comment --pr N --repo o/ron a lockfile-heavy PR locally and confirm the verdict is grounded in real code, not lock diff