Expected Behavior
When creating a UserMessage
with an image URL using Media
, I should be able to specify the image detail
option ("low"
, "high"
, "auto"
), which is supported by the OpenAI API for vision models like GPT-4o.
For example, I expect something like:
UserMessage.builder()
.text("What do you see?")
.media(List.of(Media.builder()
.mimeType(MimeTypeUtils.IMAGE_PNG)
.data(URI.create("https://example.com/image.png"))
.detail("low") // <== This field doesn't currently exist
.build()))
.build();
The resulting request payload should include:
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.png",
"detail": "low"
}
}
Current Behavior
Currently, the Media
abstraction does not support setting a detail
value. Even though the internal MediaContent.ImageUrl
class accepts a detail
parameter, the mapToMediaContent(...)
function in OpenAiChatModel
uses a constructor that sets it to null
.
As a result, it is not possible to control image quality when using image URLs. This is a problem when optimizing for latency or when handling large/multiple images.
Context
I'am building a system using GPT-4o's multimodal capabilities and leveraging Spring AI for easier integration. When sending multiple or large images via URL, being able to reduce the image detail to "low"
would provide performance improvements.
However, without this feature:
- The full-size image is always sent
- We experience longer response times from the LLM
- We have no control over performance trade-offs
I am considering customizing OpenAiChatModel
and overriding mapToMediaContent
to manually inject the detail
value, but this workaround adds unnecessary complexity.
If this feature could be added — either by extending the Media
class or by offering a more flexible mapping hook — I'd be very happy to contribute.
Thanks again for your great work!
Comment From: sunyuhan1998
Indeed, it appears that this issue also exists in other models (e.g., Mistral, MiniMax, ZhiPu).
Comment From: dev-jonghoonpark
How about this solution?
I resolved the issue by creating an ImageWithDetail
class that extends Media
, allowing us to add detail
data without significantly changing the existing structure.
If the maintainers find this approach acceptable, I will submit a PR implemented in this direction.
test code:
@Test
void imageWithDetail() throws IOException {
var userMessage = UserMessage.builder()
.text("Explain what do you see on this picture?")
.media(List.of(ImageWithDetail.low(Media.builder()
.mimeType(MimeTypeUtils.IMAGE_PNG)
.data(URI.create("https://docs.spring.io/spring-ai/reference/_images/multimodal.test.png"))
.build())))
.build();
ChatResponse response = this.chatModel
.call(new Prompt(List.of(userMessage), OpenAiChatOptions.builder().model("gpt-4o").build()));
logger.info(response.getResult().getOutput().getText());
assertThat(response.getResult().getOutput().getText()).containsAnyOf("bananas", "apple", "bowl", "basket",
"fruit stand");
}
The test results confirm that the intended detail value is included in the request.
{"type":"image_url","image_url":{"url":"https://docs.spring.io/spring-ai/reference/_images/multimodal.test.png","detail":"low"}
ImageWithDetail.java:
public class ImageWithDetail extends Media {
private final String detail;
private ImageWithDetail(Media media, String detail) {
super(media.getMimeType(), media.getData(), media.getId(), media.getName());
this.detail = detail;
}
public static Media low(Media media) {
return new ImageWithDetail(media, "low");
}
public static Media high(Media media) {
return new ImageWithDetail(media, "high");
}
public static Media auto(Media media) {
return new ImageWithDetail(media, "auto");
}
public String getDetail() {
return detail;
}
}
OpenAiChatModel.java:
private MediaContent mapToMediaContent(Media media) {
var mimeType = media.getMimeType();
if (MimeTypeUtils.parseMimeType("audio/mp3").equals(mimeType)) {
return new MediaContent(
new MediaContent.InputAudio(fromAudioData(media.getData()), MediaContent.InputAudio.Format.MP3));
}
if (MimeTypeUtils.parseMimeType("audio/wav").equals(mimeType)) {
return new MediaContent(
new MediaContent.InputAudio(fromAudioData(media.getData()), MediaContent.InputAudio.Format.WAV));
}
if (MimeTypeUtils.parseMimeType("application/pdf").equals(mimeType)) {
return new MediaContent(new MediaContent.InputFile(media.getName(),
this.fromMediaData(media.getMimeType(), media.getData())));
}
else if (media instanceof ImageWithDetail imageWithDetail) {
return new MediaContent(new MediaContent.ImageUrl(this.fromMediaData(media.getMimeType(), media.getData()),
imageWithDetail.getDetail()));
}
else {
return new MediaContent(
new MediaContent.ImageUrl(this.fromMediaData(media.getMimeType(), media.getData())));
}
}
Comment From: sunyuhan1998
How about this solution?
I resolved the issue by creating an
ImageWithDetail
class that extendsMedia
, allowing us to adddetail
data without significantly changing the existing structure.
I think it looks good. From the perspective of the Media
class's original intent, directly modifying Media
is not appropriate. Implementing a subclass seems to be a better approach. I really like your solution.
Comment From: weonest
How about this solution?
I resolved the issue by creating an ImageWithDetail class that extends Media, allowing us to add detail data without significantly changing the existing structure.
I think this is a great approach. Modifying the Media class directly wouldn't be appropriate given its original intent and design, so extending it via a subclass makes much more sense. Thanks so much for taking care of this. Really appreciate your help! 🙌