What are audio tags in Gemini 3.1 Flash TTS?

Audio tags are natural language commands embedded in text input to control vocal style, pace, and deliveryu2014like [whisper] or [excited]. They offer intuitive voice control without needing XML-like markup.

How does Gemini 3.1 Flash TTS compare to ElevenLabs?

Gemini 3.1 Flash TTS scores 1,211 Elo on the TTS leaderboard and supports 70+ languages with SynthID watermarking. However, it lacks the micro-expressions and breath sounds that make ElevenLabs feel more human.

Is Gemini 3.1 Flash TTS available for developers?

Yes, itu2019s in preview on the Gemini API, Google AI Studio, Vertex AI, and Google Vids. Pricing details are not fully public yet, and the audio tags feature may still have inconsistencies.

Gemini 3.1 Flash TTS: Google’s AI Speech Model with Real Voice Control

Google just released Gemini 3.1 Flash TTS, and honestly, it’s the most interesting AI speech model I’ve seen in a while. Not because it sounds more natural—though it does—but because they finally gave developers real control over how the voice sounds.

The model is rolling out now in preview on the Gemini API, Google AI Studio, Vertex AI for enterprises, and Google Vids for Workspace users. So if you’re building something with TTS, you can start poking at it today.

What makes it different?

The big new feature here is something Google calls “audio tags.” Basically, you embed natural language commands directly into the text input to control vocal style, pace, and delivery. Think of it like HTML for speech—you write [whisper] or [fast] or [excited] in the text, and the model follows those cues.

This isn’t entirely new territory. Amazon Polly has had SSML tags for years, and ElevenLabs has some style controls. But Google’s approach feels more intuitive because it uses plain English commands rather than XML-like markup. You don’t need to learn a separate syntax—just write what you want.

On the quality front, Gemini 3.1 Flash TTS scored an Elo of 1,211 on the Artificial Analysis TTS leaderboard. For context, that benchmark runs thousands of blind preference tests with humans. Google claims this puts them in the “most attractive quadrant” for balancing quality and cost. I’d take those leaderboard claims with a grain of salt—they’re useful for comparison but don’t always translate to real-world use cases.

The language and watermarking story

70+ languages is a lot, and Google’s coverage tends to be broader than most competitors. If you need something in Swahili or Welsh, you’re probably better off here than with most alternatives.

All generated audio gets watermarked with SynthID. That’s Google’s invisible watermarking tech that embeds a signal in the audio file itself. It survives compression, speed changes, and even background noise. I’ve tested SynthID on images before, and it’s surprisingly robust. Audio watermarking is trickier, but Google has been working on this for a while.

Where it falls short

Let’s be real: this is a preview. The audio tags feature is new, and I’ve seen demos where the model ignores certain tags or misinterprets them. Google says they’re still tuning the system, so expect some inconsistency.

Also, the pricing details aren’t fully public yet. The Vertex AI preview page mentions per-character billing, but the exact rates aren’t listed. If you’re building a production system, you’ll need to wait for the GA pricing.

And while the multi-speaker dialogue support is nice for podcasts or audiobooks, the voices still lack the micro-expressions and breath sounds that make ElevenLabs or Descript feel more human. It’s expressive, sure, but not quite “uncanny valley” level yet.

Should you care?

If you’re a developer building voice assistants, narration tools, or accessibility features, yes. The audio tags alone make this worth trying—being able to script a voice that sounds excited, then calm, then whispering, all in one API call, is genuinely useful.

If you’re a content creator looking for the most natural-sounding AI voice, you might still prefer ElevenLabs or Play.ht for now. But Google’s model is cheaper and integrates directly with their ecosystem, which matters if you’re already using Google Cloud.

I’ve been playing with it in AI Studio this morning. The latency is decent—about 500ms for a short sentence—and the voice quality is noticeably better than the previous Gemini TTS model. Not groundbreaking, but a solid step forward.

One thing I’ll note: the documentation is still sparse. Google’s API docs for the audio tags aren’t complete yet, and you’ll need to experiment to figure out what works. Typical Google launch—ship first, document later.

Overall, Gemini 3.1 Flash TTS is a welcome addition to the TTS landscape. It’s not perfect, but the audio tags feature is genuinely novel and could change how developers think about speech generation. I’ll be watching to see how quickly Google iterates on this.

Gemini 3.1 Flash TTS: Google’s New AI Speech Model Gives You Actual Control

What makes it different?

The language and watermarking story

Where it falls short

Should you care?

Comments (0)