add provider-native prompt caching (Anthropic cache_control, OpenAI cached_tokens)

### Description

Provider-native prompt caching is a separate concern from local response caching (tracked in #51) and is not currently exposed by AgentLoom. Both Anthropic and OpenAI offer it; both reduce cost ~90% on cached prompt prefixes; neither is reachable from AgentLoom workflows today.

**The distinction with #51:**

- **#51 (LLM response caching, local):** AgentLoom hashes `(provider + model + prompts + params)` → if cached, **skip the API call entirely** and return the stored response. Useful for dev/test/replay. Backend file/redis/sqlite. Doesn't help if the input changes even slightly.
- **This issue (provider-native prompt caching):** the API call still happens, but the **provider** caches the prompt prefix (system prompt, tool definitions, RAG context). The cached portion is billed at ~10% of normal cost. The non-cached portion (user message, dynamic content) is processed normally. Works even when the user message differs across calls — the expensive shared context is cached.

A testing harness like AgentTest amortizes huge system prompts across hundreds of scenario evaluations — exactly the workload prompt caching is designed for. Without it, evaluating 100 scenarios with a 5000-token system prompt costs ~5x more than necessary.

### Proposal

**1. Anthropic — `cache_control` breakpoints:**

Anthropic's API accepts `cache_control: {"type": "ephemeral"}` on message blocks. Anything before a cache breakpoint is cached for ~5 minutes; subsequent requests with identical prefix bytes hit the cache.

```yaml
- id: evaluate
  type: llm_call
  system_prompt: "{state.judge_rubric}"   # 4KB rubric, identical across all evals
  prompt: "{state.candidate_response}"    # variable per scenario
  cache_breakpoints:
    - position: system_prompt_end          # cache the system prompt
    - position: messages[2].content_end    # cache up to 3rd message
```

The adapter inserts `cache_control: {"type": "ephemeral"}` on the appropriate content blocks. Response includes `usage.cache_creation_input_tokens` and `usage.cache_read_input_tokens` — surfaced via `TokenUsage`.

**2. OpenAI — automatic prompt caching:**

OpenAI auto-caches prompts ≥ 1024 tokens (transparent to caller). The response includes `usage.prompt_tokens_details.cached_tokens`. AgentLoom currently discards this. Surface it via `TokenUsage`.

**3. Extended `TokenUsage`:**

```python
class TokenUsage(BaseModel):
    prompt_tokens: int = 0
    completion_tokens: int = 0
    total_tokens: int = 0
    # NEW
    cached_tokens: int = 0              # tokens served from cache (cheap)
    cache_creation_tokens: int = 0      # Anthropic-specific: tokens written to cache (slightly more expensive than uncached)
    reasoning_tokens: int = 0           # from #127
```

**4. Cost recomputation:**

`pricing.yaml` (per #6) gains `cached_input` and `cache_creation` fields per model:

```yaml
claude-sonnet-4-5:
  input: 3.00         # USD per 1M tokens
  cached_input: 0.30  # 10% of input
  cache_creation: 3.75
  output: 15.00

gpt-4o:
  input: 2.50
  cached_input: 1.25  # 50% of input (OpenAI)
  output: 10.00
```

`calculate_cost()` extends to:

```
cost = (
    (prompt_tokens - cached_tokens - cache_creation_tokens) * input_price +
    cached_tokens * cached_input_price +
    cache_creation_tokens * cache_creation_price +
    completion_tokens * output_price
) / 1_000_000
```

**5. Observability:**

- `gen_ai.usage.cache_read_input_tokens` (OTel GenAI v1.x semantic convention)
- `gen_ai.usage.cache_creation_input_tokens` (Anthropic-specific extension)
- New metric: `agentloom_cache_savings_usd_total{provider, model}` — counts the saved cost vs uncached path.

**6. Inspection:**

```bash
agentloom info cache-savings
# Last 24 hours: $4.32 saved across 1,247 cached calls (Anthropic)
```

### Scope

- `src/agentloom/core/results.py` — extend `TokenUsage` with cache fields.
- `src/agentloom/providers/anthropic.py` — accept `cache_breakpoints` from step config; insert `cache_control` headers; parse cache usage from response.
- `src/agentloom/providers/openai.py` — parse `prompt_tokens_details.cached_tokens` from response.
- `src/agentloom/providers/pricing.py` (or `pricing.yaml`) — extended pricing fields.
- `src/agentloom/providers/pricing.py::calculate_cost()` — account for cached/cache_creation token tiers.
- `src/agentloom/core/models.py` — `StepDefinition.cache_breakpoints` field.
- `src/agentloom/observability/metrics.py` — savings counter.
- `examples/` — system prompt caching example for batch evaluation.

### Regression tests

- `test_anthropic_cache_breakpoints_inserts_cache_control_headers`
- `test_anthropic_cache_read_tokens_parsed_from_response`
- `test_openai_cached_tokens_parsed_from_response`
- `test_pricing_with_cached_tokens_uses_discounted_rate`
- `test_pricing_with_cache_creation_uses_creation_rate`
- `test_total_cost_correct_with_mixed_cache_and_uncached`
- `test_cache_savings_metric_recorded`

### Notes

- Complementary to #51 (local response cache). Both can coexist — local cache short-circuits the call entirely; provider cache reduces cost when the call still happens.
- For testing harnesses that re-evaluate scenarios, the combined effect of #51 (skip identical re-runs) + this (cheaper distinct re-runs) is multiplicative.
- Anthropic cache TTL is fixed at ~5 minutes (per their docs). For workflows that take longer between cached calls, cache misses are expected. Document this.
- Google does not currently expose user-controllable prompt caching as of OpenAI/Anthropic spec; revisit when/if they do.
- This is independent and small enough to ship in a single PR per provider.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add provider-native prompt caching (Anthropic cache_control, OpenAI cached_tokens) #126

Description

Proposal

Scope

Regression tests

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

add provider-native prompt caching (Anthropic cache_control, OpenAI cached_tokens) #126

Description

Description

Proposal

Scope

Regression tests

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions