Google Gemini 3.1 Flash TTS Enhances Speech Control

Google released Gemini 3.1 Flash TTS, expanding its text-to-speech capabilities with greater control and expressiveness. The model is available in preview across the Gemini API, Google AI Studio, Vertex AI, and Google Vids. Moreover, it introduces over 200 audio tags that developers can embed into text prompts. These tags adjust tone, pacing, accent, and emotion with precision.

For example, developers can apply cues such as whispers, laughter, or curiosity to shape delivery. As a result, the model supports a more deliberate and narrative-driven audio style. In addition, it offers more than 70 languages, including Hindi, Japanese, and German. It also provides 30 prebuilt voices for faster deployment.

Furthermore, the model supports multi-speaker dialogue natively, so it maintains natural conversational flow within a single output. Consequently, developers can create podcasts, scripted content, and voice assistants without managing multiple voice calls. The model also ranks among the top performers on recent TTS leaderboards, reflecting strong competitive positioning.

Watermarking and Technical Capabilities

Google integrates SynthID watermarking into all generated audio, which helps identify AI-created content. At the same time, the watermark remains imperceptible and does not affect audio quality. Therefore, the system supports content authenticity without reducing usability.

NVIDIA Launches Spectrum-6 Ethernet for AI Factories

Additionally, developers can access the model through the gemini-3.1-flash-tts-preview ID in the Gemini API. The system allows up to 8,192 input tokens and 16,384 output tokens. These limits support longer scripts and more complex audio generation tasks.

Expanding Google’s Voice AI Stack

This release follows the earlier launch of Gemini 3.1 Flash Live, which focused on real-time voice interactions. Together, these updates strengthen Google’s broader voice AI ecosystem. As a result, the company continues to expand both real-time and generated speech capabilities.