feat(optimize): MCP tool coverage detector with cache-aware costing#223
Conversation
Adds a per-tool optimizer finding for MCP servers whose schema is loaded
on every turn but rarely invoked. Builds on the existing server-level
`detectUnusedMcp` (zero invocations) by reporting partial-use cases:
"loaded 54 tools, called 0" or "loaded 26 tools, called 2 (8% coverage)".
Inventory comes from Claude Code's JSONL `attachment.deferred_tools_delta`
entries: `addedNames` lists the exact tools available at that turn,
including every fully-qualified `mcp__<server>__<tool>` name. We union
across all delta entries in a session (not just the first) because tool
availability can change mid-session when the user reloads MCP config or
a subagent inherits a different tool set. Names that don't match the
`mcp__<server>__<tool>` shape with both segments non-empty are rejected
at extraction so downstream `split('__')` consumers can't be poisoned.
Token-savings estimates are cache-aware. MCP tool schemas live in the
cached prefix of the system prompt: a session pays the full input price
on each cache-creation turn (rebuilds happen every ~5 minutes of
inactivity) and the cache-read discount on subsequent turns. Each call's
contribution is capped at its observed `cacheCreationInputTokens` /
`cacheReadInputTokens` so we never claim more MCP overhead than the
call's own cache buckets could contain.
When multiple servers are flagged, costing happens in a single combined
pass: the per-call cap applies to the total unused-schema budget across
all flagged servers, not per server. Two flagged servers cannot both
independently claim the same call's cache bucket, which would otherwise
overstate `tokensSaved` and misclassify findings as high impact.
A session counts toward `loadedSessions` (and toward the cost estimate)
only if its observed inventory included the server. Pure invocation-only
sessions, where the server appears in `mcpBreakdown` or `call.mcpTools`
without any matching `deferred_tools_delta`, do not satisfy the
`>= 2 sessions` threshold on their own. The same invariant applies in
`estimateMcpSchemaCost` so the two passes agree.
Coverage is computed against the inventory only: invocations of names
not present in any observed inventory (older config, hallucinated tool,
typo) do not inflate `toolsInvoked` and cannot drive `unusedCount`
negative. `toolsInvoked` is derived as `inventory.size - unusedTools.length`
to keep both numbers consistent.
`detectUnusedMcp` and the new detector are explicitly disjoint:
`detectUnusedMcp` skips servers that the coverage detector will report,
not every server that happens to be in any inventory, so a small
inventoried-but-uninvoked server below the coverage thresholds still
gets flagged as "configured but never called."
Thresholds for the coverage finding:
- > 10 tools available (small servers are noise)
- < 20% coverage
- >= 2 sessions with observed inventory
- High impact when total effective tokens >= 200_000 or >= 3 servers flagged
Smoke-tested on a real account: 7 servers flagged across 93 sessions
(`office-word-mcp` 0/54, `notebooklm-mcp` 0/38, `office-ppt-mcp` 0/37,
`excel-mcp-server` 0/25, `github-mcp-server` 2/26, `peekaboo` 3/22, plus
`claude_ai_Asana`). Combined-cap costing keeps `tokensSaved` honest.
Changes:
- src/types.ts: optional `mcpInventory: string[]` on `SessionSummary`.
Provider-agnostic field; currently populated only by the Claude parser.
- src/parser.ts: `extractMcpInventory` walks all entries, validates
fully-qualified names, returns sorted unique list. `buildSessionSummary`
passes it through; field is omitted when empty so JSON exports stay
clean.
- src/optimize.ts: `aggregateMcpCoverage`, `estimateMcpSchemaCost`
(single- and multi-server signatures), `detectMcpToolCoverage`. Wired
into `scanAndDetect`. `detectUnusedMcp` updated to disjoint with the
new detector.
- tests/mcp-coverage.test.ts: 23 cases covering aggregation, costing,
combined-cap behaviour, threshold gates, invocation-only-session
filtering, foreign-tool invocations, cache rebuild events, write+read
on the same call, multi-server pluralisation.
- tests/parser-mcp-inventory.test.ts: 12 cases for the JSONL extractor
including malformed name rejection and tolerant attachment parsing.
- CHANGELOG.md: entry under Unreleased / Added (CLI).
Closes getagentseal#2
|
Solid work. Clean separation between extractMcpInventory (parser), aggregateMcpCoverage (aggregation), estimateMcpSchemaCost (costing), and detectMcpToolCoverage (finding emission). Each piece is independently testable and tested. 35 new tests covering edge cases well: malformed names, invocation-only sessions, foreign tools, cache rebuilds, multi-server cap, threshold gates, pluralization. Comments are dense but justified here. The domain (cache pricing, inventory semantics) is genuinely complex and the invariants are non-obvious. Two things to address:
Otherwise this is one of the highest quality external PRs on this repo. Nice iteration. |
|
Thanks for the review. I addressed both cleanup points in e46b20b:
Validated locally with:
Please take another look when you have a chance. |
- Use 1.25x multiplier for cache-write tokens to match Anthropic's actual pricing (was incorrectly using 1x) - Shell-quote server names in `claude mcp remove` fix text to prevent issues with unusual server names
Summary
Closes #2.
Adds a per-tool optimizer finding for MCP servers whose schema is loaded on every turn but rarely invoked. Builds on the existing server-level
detectUnusedMcp(zero invocations) by reporting partial-use cases like "loaded 54 tools, called 0" or "loaded 26 tools, called 2 (8% coverage)".Smoke-tested on a real account: 7 servers flagged across 93 sessions —
office-word-mcp0/54,notebooklm-mcp0/38,office-ppt-mcp0/37,excel-mcp-server0/25,github-mcp-server2/26,peekaboo3/22, plusclaude_ai_Asana.Inventory source
Claude Code's JSONL writes
attachment.deferred_tools_deltaentries whoseaddedNamesarray lists the exact tools available at that turn — including every fully-qualifiedmcp__<server>__<tool>name. We union across all delta entries in a session (not just the first) because tool availability can change mid-session when MCP config reloads or a subagent inherits a different tool set.Names that don't match the
mcp__<server>__<tool>shape with both segments non-empty are rejected at extraction so downstreamsplit('__')consumers can't be poisoned.Token-savings estimation
MCP tool schemas live in the cached prefix of the system prompt:
cacheCreationInputTokens/cacheReadInputTokensso we never claim more MCP overhead than the call's own cache buckets could contain.When multiple servers are flagged, costing is a single combined pass: the per-call cap applies to the total unused-schema budget across all flagged servers, not per server. Two flagged servers can't independently claim the same call's cache bucket and overstate
tokensSaved.Correctness invariants
loadedSessions(and toward the cost estimate) only if its observed inventory included the server. Pure invocation-only sessions, where the server appears inmcpBreakdownorcall.mcpToolswithout any matchingdeferred_tools_delta, do not satisfy the>= 2 sessionsthreshold on their own.toolsInvokedand cannot driveunusedCountnegative.toolsInvokedis derived asinventory.size - unusedTools.lengthto keep both numbers consistent.detectUnusedMcpand the new detector are explicitly disjoint:detectUnusedMcpskips servers that the coverage detector will actually report (i.e. those clearing its thresholds), not every server that happens to be in any inventory. A small inventoried-but-uninvoked server below the coverage thresholds still gets flagged as "configured but never called."Thresholds
> 10tools available (small servers are noise)< 20%coverage>= 2sessions with observed inventory>= 200_000or>= 3servers flaggedChanges
src/types.ts: optionalmcpInventory: string[]onSessionSummary. Provider-agnostic field; currently populated only by the Claude parser.src/parser.ts:extractMcpInventorywalks all entries, validates fully-qualified names, returns sorted unique list.buildSessionSummarypasses it through; the field is omitted when empty so JSON exports stay clean.src/optimize.ts:aggregateMcpCoverage,estimateMcpSchemaCost(single- and multi-server signatures),detectMcpToolCoverage. Wired intoscanAndDetect.detectUnusedMcpupdated to be disjoint with the new detector.tests/mcp-coverage.test.ts: 23 cases covering aggregation, costing, combined-cap behaviour, threshold gates, invocation-only-session filtering, foreign-tool invocations, cache rebuild events, write+read on the same call, multi-server pluralisation, backward-compat single-server signature.tests/parser-mcp-inventory.test.ts: 12 cases for the JSONL extractor including malformed name rejection (mcp__server,mcp__server__,mcp____tool) and tolerant attachment parsing.CHANGELOG.md: entry under Unreleased / Added (CLI).Scope notes
deferred_tools_deltais Claude Code-specific. The field is provider-agnostic onSessionSummaryso other parsers can populate it later, but no other provider exposes the same telemetry today.mcpInventoryoptional field. All existing schemas, exports, and CLI flags are unaffected.Test plan
Reviews considered
Design and implementation went through three rounds of code review (Codex GPT-5.5 high, Gemini 3.1 Pro Preview, an internal Sonnet reviewer) before this PR. Concrete findings addressed end-to-end:
loadedSessionscounted from invocation-only sessions, diluting the thresholdtoolsInvokedcounting tools not present in inventorycontinueaftercacheCreationInputTokensskipping the same call'scacheReadInputTokensextractMcpInventoryaccepting malformed namescacheCreationevents per session)tokensSavedover-count when multiple servers flagged share a cache bucketdetectUnusedMcpfor inventoried-but-uninvoked small servers