Expected Behavior
The Spring AI OpenAiAudioSpeechModel should allow callers to provide an optional instructions string that is forwarded to OpenAI’s POST /v1/audio/speech request body only for models that support it (e.g., gpt-4o-mini-tts). When the selected model does not support instructions (tts-1, tts-1-hd), the client should either ignore the field with a warning or return a documented error, remaining backward-compatible.
Pseudo-code:
var options = OpenAiAudioSpeechOptions.builder()
.model("gpt-4o-mini-tts")
.voice("verse")
.responseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.PCM)
.instructions("Friendly; warm tone; natural pauses; ~1.1x feel")
.build();
Current Behavior
OpenAiAudioSpeechModel exposes model/voice/format parameters but there is no way to pass instructions through to OpenAI. Style guidance must be embedded into the input text, which the TTS may read literally or interpret inconsistently. The OpenAI REST API already documents instructions (not supported by tts-1 / tts-1-hd), but Spring AI currently drops it.
Context
-
Impact: We generate real-time speech and need consistent prosody (e.g., friendly, upbeat, slight smile, natural pauses). Without instructions, we rely on punctuation hacks or meta-text in input, which hurts intelligibility.
-
What we’re trying to accomplish: Pass a non-spoken style hint that the provider uses to shape delivery.
-
Alternatives considered:
-
Embedding style in input → gets spoken or yields unstable results.
-
Custom HTTP client bypassing Spring AI → loses the benefits of Spring AI’s abstraction and configuration.
-
-
Workarounds: None clean within OpenAiAudioSpeechModel.
Proposed change (high level):
- Add instructions (nullable) to the audio speech options/request type.