You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Anthropic Prompt Caching: Align CONVERSATION_HISTORY with Anthropic's incremental caching pattern
This commit updates the CONVERSATION_HISTORY cache strategy to align with
Anthropic's official documentation and cookbook examples
(https://github.com/anthropics/claude-cookbooks/blob/main/misc/prompt_caching.ipynb)
for incremental conversation caching.
**Cache breakpoint placement:**
- Before: Cache breakpoint on penultimate (second-to-last) user message
- After: Cache breakpoint on last user message
**Aggregate eligibility:**
- Before: Only considered user messages for min content length check
- After: Considers all message types (user, assistant, tool) within 20-block
lookback window for aggregate eligibility
Anthropic's documentation and cookbook demonstrate incremental caching by
placing cache_control on the LAST user message:
```python
result.append({
"role": "user",
"content": [{
"type": "text",
"text": turn["content"][0]["text"],
"cache_control": {"type": "ephemeral"} # On LAST user message
}]
})
```
This pattern is also shown in their official docs:
https://docs.claude.com/en/docs/build-with-claude/prompt-caching#large-context-caching-example
Anthropic's caching system uses prefix matching to find the longest matching
prefix from the cache. By placing cache_control on the last user message,
we enable the following incremental caching pattern:
```
Turn 1: Cache [System + User1]
Turn 2: Reuse [System + User1], process [Assistant1 + User2],
cache [System + User1 + Assistant1 + User2]
Turn 3: Reuse [System + User1 + Assistant1 + User2],
process [Assistant2 + User3],
cache [System + User1 + Assistant1 + User2 + Assistant2 + User3]
```
The cache grows incrementally with each turn, building a larger prefix that
can be reused. This is the recommended pattern from Anthropic.
The new implementation considers all message types (user, assistant, tool)
within the 20-block lookback window when checking minimum content length.
This ensures that:
- Short user questions don't prevent caching when conversation has long
assistant responses
- The full conversation context is considered for the 1024+ token minimum
- Aligns with Anthropic's note: "The automatic prefix checking only looks
back approximately 20 content blocks from each explicit breakpoint"
None. This is an implementation detail of the CONVERSATION_HISTORY strategy.
The API surface remains unchanged. Users may observe:
- Different cache hit patterns (should be more effective)
- Cache metrics may show higher cache read tokens as conversations grow
- Updated `shouldRespectMinLengthForUserHistoryCaching()` to test aggregate
eligibility with combined message lengths
- Renamed `shouldApplyCacheControlToLastUserMessageForConversationHistory()`
(from `shouldRespectAllButLastUserMessageForUserHistoryCaching`)
- Added `shouldDemonstrateIncrementalCachingAcrossMultipleTurns()` integration
test showing cache growth pattern across 4 conversation turns
- Updated mock test assertions to verify last message has cache_control
Updated anthropic-chat.adoc to clarify:
- CONVERSATION_HISTORY strategy description now mentions incremental prefix
caching
- Code example comments updated to reflect cache breakpoint on last user
message
- Implementation Details section expanded with explanation of prefix matching
and aggregate eligibility checking
- Anthropic Prompt Caching Docs: https://docs.claude.com/en/docs/build-with-claude/prompt-caching
- Anthropic Cookbook: https://github.com/anthropics/claude-cookbooks/blob/main/misc/prompt_caching.ipynb
Signed-off-by: Soby Chacko <soby.chacko@broadcom.com>
Copy file name to clipboardExpand all lines: models/spring-ai-anthropic/src/test/java/org/springframework/ai/anthropic/AnthropicPromptCachingMockTest.java
Copy file name to clipboardExpand all lines: spring-ai-docs/src/main/antora/modules/ROOT/pages/api/chat/anthropic-chat.adoc
+8-3Lines changed: 8 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -213,9 +213,9 @@ Different models have different minimum token thresholds for cache effectiveness
213
213
Spring AI provides strategic cache placement through the `AnthropicCacheStrategy` enum:
214
214
215
215
* `NONE`: Disables prompt caching completely
216
-
* `SYSTEM_ONLY`: Caches only the system message content
216
+
* `SYSTEM_ONLY`: Caches only the system message content
217
217
* `SYSTEM_AND_TOOLS`: Caches system message and the last tool definition
218
-
* `CONVERSATION_HISTORY`: Caches conversation history in chat memory scenarios
218
+
* `CONVERSATION_HISTORY`: Caches the entire conversation history by placing cache breakpoints on tools (if present), system message, and the last user message. This enables incremental prefix caching for multi-turn conversations
219
219
220
220
This strategic approach ensures optimal cache breakpoint placement while staying within Anthropic's 4-breakpoint limit.
@@ -620,13 +620,18 @@ Even small changes will require a new cache entry.
620
620
The prompt caching implementation in Spring AI follows these key design principles:
621
621
622
622
1. **Strategic Cache Placement**: Cache breakpoints are automatically placed at optimal locations based on the chosen strategy, ensuring compliance with Anthropic's 4-breakpoint limit.
623
+
- `CONVERSATION_HISTORY` places cache breakpoints on: tools (if present), system message, and the last user message
624
+
- This enables Anthropic's prefix matching to incrementally cache the growing conversation history
625
+
- Each turn builds on the previous cached prefix, maximizing cache reuse
623
626
624
627
2. **Provider Portability**: Cache configuration is done through `AnthropicChatOptions` rather than individual messages, preserving compatibility when switching between different AI providers.
625
628
626
629
3. **Thread Safety**: The cache breakpoint tracking is implemented with thread-safe mechanisms to handle concurrent requests correctly.
627
630
628
631
4. **Automatic Content Ordering**: The implementation ensures proper on-the-wire ordering of JSON content blocks and cache controls according to Anthropic's API requirements.
629
632
633
+
5. **Aggregate Eligibility Checking**: For `CONVERSATION_HISTORY`, the implementation considers all message types (user, assistant, tool) within the last ~20 content blocks when determining if the combined content meets the minimum token threshold for caching.
634
+
630
635
=== Future Enhancements
631
636
632
637
The current cache strategies are designed to handle **90% of common use cases** effectively. For applications requiring more granular control, future enhancements may include:
0 commit comments