Skip to content

add provider-native prompt caching (Anthropic cache_control, OpenAI cached_tokens) #126

@cchinchilla-dev

Description

@cchinchilla-dev

Description

Provider-native prompt caching is a separate concern from local response caching (tracked in #51) and is not currently exposed by AgentLoom. Both Anthropic and OpenAI offer it; both reduce cost ~90% on cached prompt prefixes; neither is reachable from AgentLoom workflows today.

The distinction with #51:

  • add LLM response caching #51 (LLM response caching, local): AgentLoom hashes (provider + model + prompts + params) → if cached, skip the API call entirely and return the stored response. Useful for dev/test/replay. Backend file/redis/sqlite. Doesn't help if the input changes even slightly.
  • This issue (provider-native prompt caching): the API call still happens, but the provider caches the prompt prefix (system prompt, tool definitions, RAG context). The cached portion is billed at ~10% of normal cost. The non-cached portion (user message, dynamic content) is processed normally. Works even when the user message differs across calls — the expensive shared context is cached.

A testing harness like AgentTest amortizes huge system prompts across hundreds of scenario evaluations — exactly the workload prompt caching is designed for. Without it, evaluating 100 scenarios with a 5000-token system prompt costs ~5x more than necessary.

Proposal

1. Anthropic — cache_control breakpoints:

Anthropic's API accepts cache_control: {"type": "ephemeral"} on message blocks. Anything before a cache breakpoint is cached for ~5 minutes; subsequent requests with identical prefix bytes hit the cache.

- id: evaluate
  type: llm_call
  system_prompt: "{state.judge_rubric}"   # 4KB rubric, identical across all evals
  prompt: "{state.candidate_response}"    # variable per scenario
  cache_breakpoints:
    - position: system_prompt_end          # cache the system prompt
    - position: messages[2].content_end    # cache up to 3rd message

The adapter inserts cache_control: {"type": "ephemeral"} on the appropriate content blocks. Response includes usage.cache_creation_input_tokens and usage.cache_read_input_tokens — surfaced via TokenUsage.

2. OpenAI — automatic prompt caching:

OpenAI auto-caches prompts ≥ 1024 tokens (transparent to caller). The response includes usage.prompt_tokens_details.cached_tokens. AgentLoom currently discards this. Surface it via TokenUsage.

3. Extended TokenUsage:

class TokenUsage(BaseModel):
    prompt_tokens: int = 0
    completion_tokens: int = 0
    total_tokens: int = 0
    # NEW
    cached_tokens: int = 0              # tokens served from cache (cheap)
    cache_creation_tokens: int = 0      # Anthropic-specific: tokens written to cache (slightly more expensive than uncached)
    reasoning_tokens: int = 0           # from #127

4. Cost recomputation:

pricing.yaml (per #6) gains cached_input and cache_creation fields per model:

claude-sonnet-4-5:
  input: 3.00         # USD per 1M tokens
  cached_input: 0.30  # 10% of input
  cache_creation: 3.75
  output: 15.00

gpt-4o:
  input: 2.50
  cached_input: 1.25  # 50% of input (OpenAI)
  output: 10.00

calculate_cost() extends to:

cost = (
    (prompt_tokens - cached_tokens - cache_creation_tokens) * input_price +
    cached_tokens * cached_input_price +
    cache_creation_tokens * cache_creation_price +
    completion_tokens * output_price
) / 1_000_000

5. Observability:

  • gen_ai.usage.cache_read_input_tokens (OTel GenAI v1.x semantic convention)
  • gen_ai.usage.cache_creation_input_tokens (Anthropic-specific extension)
  • New metric: agentloom_cache_savings_usd_total{provider, model} — counts the saved cost vs uncached path.

6. Inspection:

agentloom info cache-savings
# Last 24 hours: $4.32 saved across 1,247 cached calls (Anthropic)

Scope

  • src/agentloom/core/results.py — extend TokenUsage with cache fields.
  • src/agentloom/providers/anthropic.py — accept cache_breakpoints from step config; insert cache_control headers; parse cache usage from response.
  • src/agentloom/providers/openai.py — parse prompt_tokens_details.cached_tokens from response.
  • src/agentloom/providers/pricing.py (or pricing.yaml) — extended pricing fields.
  • src/agentloom/providers/pricing.py::calculate_cost() — account for cached/cache_creation token tiers.
  • src/agentloom/core/models.pyStepDefinition.cache_breakpoints field.
  • src/agentloom/observability/metrics.py — savings counter.
  • examples/ — system prompt caching example for batch evaluation.

Regression tests

  • test_anthropic_cache_breakpoints_inserts_cache_control_headers
  • test_anthropic_cache_read_tokens_parsed_from_response
  • test_openai_cached_tokens_parsed_from_response
  • test_pricing_with_cached_tokens_uses_discounted_rate
  • test_pricing_with_cache_creation_uses_creation_rate
  • test_total_cost_correct_with_mixed_cache_and_uncached
  • test_cache_savings_metric_recorded

Notes

  • Complementary to add LLM response caching #51 (local response cache). Both can coexist — local cache short-circuits the call entirely; provider cache reduces cost when the call still happens.
  • For testing harnesses that re-evaluate scenarios, the combined effect of add LLM response caching #51 (skip identical re-runs) + this (cheaper distinct re-runs) is multiplicative.
  • Anthropic cache TTL is fixed at ~5 minutes (per their docs). For workflows that take longer between cached calls, cache misses are expected. Document this.
  • Google does not currently expose user-controllable prompt caching as of OpenAI/Anthropic spec; revisit when/if they do.
  • This is independent and small enough to ship in a single PR per provider.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestprovidersProvider gateway and adapters

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions