Bug description

I'm experiencing what appears to be multiple API calls to the LLM provider when calling different methods on the same ChatClient.CallResponseSpec instance. When I call both chatResponse() and entity() on the same response object, it seems like two separate API calls are being made instead of reusing a cached response.

I'm not entirely sure if this is a bug in Spring AI or if I'm misunderstanding how the API is supposed to work, but I'm looking for guidance on the correct approach.

My use case requires both structured output parsing (via .entity()) and metadata access (via .chatResponse().getMetadata()), specifically for token usage tracking. Currently, I can't find a way to get both from what appears to be a single API call.

The observed behavior includes: - What seems like doubled API costs - Significantly increased response times (2+ seconds instead of milliseconds) - Difficulty implementing proper token usage tracking

Environment

  • Spring AI version: 1.0.0-M7
  • Java version: 17
  • Spring Boot version: 3.4.4
  • LLM Provider: OpenAI (also using Anthropic)
  • Maven dependencies: spring-ai-starter-model-openai, spring-ai-starter-model-anthropic

Steps to reproduce

  1. Create a ChatClient instance
  2. Use .entity() for structured output: ```java // This works for structured output but provides no access to metadata MyStructuredResponse modelAnswer = getChatClient() .prompt(prompt) .call() .entity(MyStructuredResponse.class);

// No way to access token usage from this call // Would need a separate call to get metadata: Usage usage = getChatClient() .prompt(prompt) // Different prompt due to augmentation .call() .chatResponse() .getMetadata() .getUsage(); ```

The issue is that .entity() doesn't provide any way to access the ChatResponse or its metadata from the same call.

Expected behavior

I would expect a way to access metadata (particularly token usage) when using structured output. Possible solutions could include:

  1. An enhanced method like: java StructuredResponseWithMetadata<MyClass> result = response.entityWithMetadata(MyClass.class); MyClass data = result.getEntity(); Usage usage = result.getMetadata().getUsage();

  2. Or making metadata accessible on the structured response itself: java CallResponseSpec response = getChatClient().prompt(prompt).call(); MyClass data = response.entity(MyClass.class); Usage usage = response.getLastCallMetadata().getUsage(); // Access metadata from the entity() call

The key need is to track token usage for cost management when using structured output.

Minimal Complete Reproducible example

@RestController
public class TestController {

    @Autowired
    private ChatClient chatClient;

    @GetMapping("/test-double-call")
    public ResponseEntity<String> testDoubleCall() {
        String prompt = "Return a JSON object with a 'message' field containing 'Hello World'";

        // Time the total operation
        long totalStart = System.nanoTime();

        // Get the response spec
        ChatClient.CallResponseSpec response = chatClient.prompt(prompt).call();

        // First access - measure time
        long firstStart = System.nanoTime();
        Usage usage = response.chatResponse().getMetadata().getUsage();
        long firstEnd = System.nanoTime();
        System.out.println("First call (chatResponse) duration: " + (firstEnd - firstStart) / 1_000_000 + " ms");

        // Second access - measure time  
        long secondStart = System.nanoTime();
        String content = response.entity(String.class);
        long secondEnd = System.nanoTime();
        System.out.println("Second call (entity) duration: " + (secondEnd - secondStart) / 1_000_000 + " ms");

        long totalEnd = System.nanoTime();
        System.out.println("Total duration: " + (totalEnd - totalStart) / 1_000_000 + " ms");

        return ResponseEntity.ok("Check console logs - you'll see two long durations instead of one API call + fast cached access");
    }
}

Additional context

I understand that .entity() must perform prompt augmentation to include structured output instructions, so it makes sense that it can't reuse a previous response. However, this creates a gap in the API where structured output users cannot access token usage metadata.

Current workarounds are suboptimal: 1. Using only chatResponse() and manually parsing JSON (loses structured output convenience) 2. Making two separate calls with different prompts (doubles API costs) 3. Using lower-level APIs (loses ChatClient benefits)

Is there a recommended pattern or planned feature to access metadata when using .entity() for structured output? This would be valuable for cost tracking and observability in production applications.