Back to feed
7/10
Model Release
9 Jun 2026, 17:00 UTC
Google DeepMind releases Gemini 3.5 Live Translate for real-time audio translation with voice preservation.
Native speech-to-speech models that preserve paralinguistic features like tone and pace represent a major shift away from clunky ASR-LLM-TTS pipelines. By offering this via API in Google AI Studio, developers can now build ultra-low latency, voice-native applications without managing complex orchestration. This directly challenges OpenAI's Realtime API dominance in the voice modality.
What happened
Google DeepMind has announced the release of Gemini 3.5 Live Translate, a new audio-centric model designed for real-time, cross-language speech translation. Supporting over 70 languages, the model is immediately accessible to consumers via the Google Translate app and available to developers as an API preview within Google AI Studio.Technical details
Unlike traditional translation stacks that rely on a cascaded pipeline (Automatic Speech Recognition -> Text Translation -> Text-to-Speech), Gemini 3.5 Live Translate operates natively in the audio domain. The key technical differentiator is its ability to preserve the speaker's original tone, pitch, and speaking pace in the translated output in real time. This points to a direct speech-to-speech architecture capable of processing and generating acoustic tokens with minimal latency, bypassing the information loss (such as emotion and emphasis) typically associated with intermediate text representations.Why it matters
For engineers and product builders, this release simplifies the voice application stack. Managing latency, state, and prosody across separate ASR, LLM, and TTS models is notoriously difficult and computationally expensive. By exposing a unified, voice-preserving API in Google AI Studio, Google is significantly lowering the barrier to entry for building seamless, multilingual voice agents. This places Gemini 3.5 in direct competition with OpenAI's Realtime API (GPT-4o), signaling an intensifying race to dominate the native audio modality. The preservation of paralinguistic features is particularly critical for enterprise use cases like customer service, telehealth, and global gaming, where nuance and empathy are required.What to watch next
Watch for developer benchmarks on the API's end-to-end latency and cost per second of audio compared to OpenAI's offerings. Additionally, monitor how well the model handles overlapping speech, background noise, and rapid language switching—common failure modes for real-time audio models. We should also look for potential integrations of this model into Google Workspace applications like Google Meet and Android's OS-level live translation features.Sources
gemini
audio-models
speech-to-speech
api-release
google-deepmind