You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Provider-native prompt caching is a separate concern from local response caching (tracked in #51) and is not currently exposed by AgentLoom. Both Anthropic and OpenAI offer it; both reduce cost ~90% on cached prompt prefixes; neither is reachable from AgentLoom workflows today.
add LLM response caching #51 (LLM response caching, local): AgentLoom hashes (provider + model + prompts + params) → if cached, skip the API call entirely and return the stored response. Useful for dev/test/replay. Backend file/redis/sqlite. Doesn't help if the input changes even slightly.
This issue (provider-native prompt caching): the API call still happens, but the provider caches the prompt prefix (system prompt, tool definitions, RAG context). The cached portion is billed at ~10% of normal cost. The non-cached portion (user message, dynamic content) is processed normally. Works even when the user message differs across calls — the expensive shared context is cached.
A testing harness like AgentTest amortizes huge system prompts across hundreds of scenario evaluations — exactly the workload prompt caching is designed for. Without it, evaluating 100 scenarios with a 5000-token system prompt costs ~5x more than necessary.
Proposal
1. Anthropic — cache_control breakpoints:
Anthropic's API accepts cache_control: {"type": "ephemeral"} on message blocks. Anything before a cache breakpoint is cached for ~5 minutes; subsequent requests with identical prefix bytes hit the cache.
- id: evaluatetype: llm_callsystem_prompt: "{state.judge_rubric}"# 4KB rubric, identical across all evalsprompt: "{state.candidate_response}"# variable per scenariocache_breakpoints:
- position: system_prompt_end # cache the system prompt
- position: messages[2].content_end # cache up to 3rd message
The adapter inserts cache_control: {"type": "ephemeral"} on the appropriate content blocks. Response includes usage.cache_creation_input_tokens and usage.cache_read_input_tokens — surfaced via TokenUsage.
2. OpenAI — automatic prompt caching:
OpenAI auto-caches prompts ≥ 1024 tokens (transparent to caller). The response includes usage.prompt_tokens_details.cached_tokens. AgentLoom currently discards this. Surface it via TokenUsage.
3. Extended TokenUsage:
classTokenUsage(BaseModel):
prompt_tokens: int=0completion_tokens: int=0total_tokens: int=0# NEWcached_tokens: int=0# tokens served from cache (cheap)cache_creation_tokens: int=0# Anthropic-specific: tokens written to cache (slightly more expensive than uncached)reasoning_tokens: int=0# from #127
4. Cost recomputation:
pricing.yaml (per #6) gains cached_input and cache_creation fields per model:
claude-sonnet-4-5:
input: 3.00# USD per 1M tokenscached_input: 0.30# 10% of inputcache_creation: 3.75output: 15.00gpt-4o:
input: 2.50cached_input: 1.25# 50% of input (OpenAI)output: 10.00
Complementary to add LLM response caching #51 (local response cache). Both can coexist — local cache short-circuits the call entirely; provider cache reduces cost when the call still happens.
For testing harnesses that re-evaluate scenarios, the combined effect of add LLM response caching #51 (skip identical re-runs) + this (cheaper distinct re-runs) is multiplicative.
Anthropic cache TTL is fixed at ~5 minutes (per their docs). For workflows that take longer between cached calls, cache misses are expected. Document this.
Google does not currently expose user-controllable prompt caching as of OpenAI/Anthropic spec; revisit when/if they do.
This is independent and small enough to ship in a single PR per provider.
Description
Provider-native prompt caching is a separate concern from local response caching (tracked in #51) and is not currently exposed by AgentLoom. Both Anthropic and OpenAI offer it; both reduce cost ~90% on cached prompt prefixes; neither is reachable from AgentLoom workflows today.
The distinction with #51:
(provider + model + prompts + params)→ if cached, skip the API call entirely and return the stored response. Useful for dev/test/replay. Backend file/redis/sqlite. Doesn't help if the input changes even slightly.A testing harness like AgentTest amortizes huge system prompts across hundreds of scenario evaluations — exactly the workload prompt caching is designed for. Without it, evaluating 100 scenarios with a 5000-token system prompt costs ~5x more than necessary.
Proposal
1. Anthropic —
cache_controlbreakpoints:Anthropic's API accepts
cache_control: {"type": "ephemeral"}on message blocks. Anything before a cache breakpoint is cached for ~5 minutes; subsequent requests with identical prefix bytes hit the cache.The adapter inserts
cache_control: {"type": "ephemeral"}on the appropriate content blocks. Response includesusage.cache_creation_input_tokensandusage.cache_read_input_tokens— surfaced viaTokenUsage.2. OpenAI — automatic prompt caching:
OpenAI auto-caches prompts ≥ 1024 tokens (transparent to caller). The response includes
usage.prompt_tokens_details.cached_tokens. AgentLoom currently discards this. Surface it viaTokenUsage.3. Extended
TokenUsage:4. Cost recomputation:
pricing.yaml(per #6) gainscached_inputandcache_creationfields per model:calculate_cost()extends to:5. Observability:
gen_ai.usage.cache_read_input_tokens(OTel GenAI v1.x semantic convention)gen_ai.usage.cache_creation_input_tokens(Anthropic-specific extension)agentloom_cache_savings_usd_total{provider, model}— counts the saved cost vs uncached path.6. Inspection:
agentloom info cache-savings # Last 24 hours: $4.32 saved across 1,247 cached calls (Anthropic)Scope
src/agentloom/core/results.py— extendTokenUsagewith cache fields.src/agentloom/providers/anthropic.py— acceptcache_breakpointsfrom step config; insertcache_controlheaders; parse cache usage from response.src/agentloom/providers/openai.py— parseprompt_tokens_details.cached_tokensfrom response.src/agentloom/providers/pricing.py(orpricing.yaml) — extended pricing fields.src/agentloom/providers/pricing.py::calculate_cost()— account for cached/cache_creation token tiers.src/agentloom/core/models.py—StepDefinition.cache_breakpointsfield.src/agentloom/observability/metrics.py— savings counter.examples/— system prompt caching example for batch evaluation.Regression tests
test_anthropic_cache_breakpoints_inserts_cache_control_headerstest_anthropic_cache_read_tokens_parsed_from_responsetest_openai_cached_tokens_parsed_from_responsetest_pricing_with_cached_tokens_uses_discounted_ratetest_pricing_with_cache_creation_uses_creation_ratetest_total_cost_correct_with_mixed_cache_and_uncachedtest_cache_savings_metric_recordedNotes