Advanced audio dialog and generation with Gemini 2.5 – Google Blog
Jun 03, 2025
Here’s a closer look at what’s new in Gemini 2.5 for audio dialog and generation.
Gemini is built from the ground up to be multimodal, natively understanding and generating content across text, images, audio, video and code. At I/O we showed how Gemini 2.5 marks a significant step forward with new capabilities in AI-powered audio dialog and generation.
We’re already using these models to bring audio to users globally, across numerous products, prototypes and languages. NotebookLM’s Audio Overviews and Project Astra are just two examples. Here’s a closer look at what you can do with Gemini 2.5 native audio capabilities.
Human conversation is rich and nuanced, with meaning conveyed not just by what is said, but how it’s spoken — through tone, accent and even non-speech vocalizations, like laughter. We believe conversation will be a key way we interact with AI. That’s why Gemini reasons and generates speech natively in audio, enabling effective, real-time communication.
Native audio dialog with Gemini 2.5 Flash preview features:
The evolution of text-to-speech technology is moving rapidly, and with our latest models, we’re moving beyond naturalness to giving unprecedented control over generated audio. Now you can generate anything from short snippets to long-form narratives, precisely dictating style, tone, emotional expression and performance — all steerable through natural language prompts.
Additional controls and capabilities include:
For controllable speech generation (TTS), choose Gemini 2.5 Pro Preview for state-of-the-art quality on complex prompts, or Gemini 2.5 Flash Preview for cost-efficient everyday applications. This allows developers to dynamically create audio for announcements, stories, podcasts, video games and more.
We’ve proactively assessed potential risks throughout every stage of the development process for these native audio features, using what we’ve learned to inform our mitigation strategies. We validate these measures through rigorous internal and external safety evaluations, including comprehensive red teaming for responsible deployment. Additionally, all audio outputs from our models are embedded with SynthID, our watermarking technology, to ensure transparency by making AI-generated audio identifiable.
We’re bringing native audio outputs to Gemini 2.5 models, giving developers new capabilities to build richer, more interactive applications via the Gemini API in Google AI Studio or Vertex AI.
To begin exploring, developers can try native audio dialog with Gemini 2.5 Flash preview in Google AI Studio’s stream tab. Controllable speech generation (TTS) is available in preview for both Gemini 2.5 Pro and Flash by selecting speech generation in the generate media tab within Google AI Studio.
Let’s stay in touch. Get the latest news from Google in your inbox.
Follow Us