Spring AI Make Anthropic prompt caching message-type aware (TTLs, eligibility, min-size) and optimize cache-block usage

Follow up on discussion from #4300

⸻

Context

First off, thanks for landing prompt caching support—this is a big unlock. The recent work wires cache_control into Spring AI’s Anthropic integration via AnthropicChatOptions, including a single cacheTtl and a “strategy” knob, plus usage tracking fields for cache reads/writes. That’s a solid baseline.

In real applications, we can squeeze substantially more value (and avoid pitfalls) by making caching aware of message types and by optimizing how we “spend” Anthropic’s limited cache breakpoints.

For reference, I prototyped this in a small PR (https://github.com/sobychacko/spring-ai/pull/2) against @sobychacko 's branch which added fine-grained controls (per-message-type TTLs, eligibility, and minimum sizes) in addition to some of the other changes already incorporated into #4300 .

Why now

Anthropic’s behavior has a few important constraints/opportunities: • You can define up to 4 cache breakpoints; automatic prefix checking looks back ~20 content blocks. Using multiple, intentional breakpoints drives higher hit rates. • Default TTL is 5 minutes, and 1 hour is also available. Mixing TTLs is allowed but longer TTLs must appear before shorter TTLs—the order affects both correctness and billing. source • Minimum cacheable length is 1024 tokens (most models) or 2048 for Haiku; short content won’t be cached even if marked. A library-level guard keeps us from wasting breakpoints on tiny content.

You can use both 1-hour and 5-minute cache controls in the same request, but with an important constraint: Cache entries with longer TTL must appear before shorter TTLs (i.e., a 1-hour cache entry must appear before any 5-minute cache entries).

Given those constraints, two knobs consistently drive better cache utilization in production: 1. Per-message-type TTL (e.g., SYSTEM & TOOLS at 1h; USER & ASSISTANT at 5m) 2. Minimum content size thresholds (global + per-message-type) so we don’t “spend” scarce breakpoints on small or low-value segments

In my testing, those two changes notably improved “tokens cached as a % of total input” while staying within Anthropic’s rules.

⸻

Proposal

Message-type-aware caching policy

Introduce a provider-specific CacheControlConfiguration on AnthropicChatOptions that controls: • Eligible roles: EnumSet (typically SYSTEM, TOOL; often not USER/ASSISTANT) • Max cache blocks: int maxCacheBlocks (defaults to 4; user may set lower if for some reason they want to) - Perhaps lets not include this • Minimum size for message cache eligibility: int minLengthGlobal + Map minLengthOverrides (length as chars or pluggable token estimator) - I considered exposing a Function that could be applied to the content for estimating the size rather than just always looking at the String length. • Per-message-type TTL: Map<MessageType, Ttl> where Ttl ∈ {"5m", "1h"}; enforce Anthropic’s ordering rule (1h segments must appear before 5m).

This allows production users to express intent directly (e.g., “I always want my long, stable system spec and tool schema at 1h, but only cache user/assistant turns at 5m when they’re big enough to matter.”)

Note: The merged code already routes caching via AnthropicChatOptions at request creation time to keep message classes portable—this proposal keeps that design, just adds policy depth.

⸻

Example API sketch (illustrative)

AnthropicChatOptions options =
    AnthropicChatOptions.builder()
        .cachePolicy(CacheControlConfiguration.builder()
            .eligible(EnumSet.of(MessageType.SYSTEM, MessageType.TOOL))
            .maxCacheBlocks(3) // budget carefully under Anthropic's max of 4
            .minLengthGlobal(800) // skip tiny segments
            .minLengthOverride(MessageType.TOOL, 1800) // Override global default for TOOL messages
            .ttl(MessageType.SYSTEM, "1h") // Set specific TTL for System messages
            .ttl(MessageType.TOOL, "5m")
            .build())
        .build();

Under the hood, the request builder: • Emits ContentBlocks for system and applies cache_control selectively based on the policy. • Applies the same policy to tools and (optionally) messages, enforcing TTL ordering and the ≤4 breakpoint limit. • Skips cache marks for small segments so we don’t waste a breakpoint or rely on server-side rejection.

⸻

Benefits • Higher hit rates, lower cost: Cache stable, heavy segments (system/tool schemas, long context) at 1h, while leaving volatile parts at 5m—per Anthropic’s pricing, reads are ~10% of base input and writes are 1.25×/2× for 5m/1h, respectively. The mix often nets out strongly in favor of caching. • Fewer foot-guns: The library guards the “max 4 breakpoints” rule, honors TTL ordering, and avoids marking short segments that wouldn’t cache anyway.

Comment From: sobychacko

@adase11 Thanks for this report. Are you open to sending a PR based on the work you've already done against the previous PR branch? We will be more than happy to review the changes. Thanks!

Comment From: adase11

Yep, more than happy to. I'll have something together shortly.

Comment From: adase11

@sobychacko PR here #4342