Mistral Voxtral TTS - Open-Source Voice AI That Fits on Your Phone

I was today days old when I learned that Mistral quietly dropped a text-to-speech model that they claim beats ElevenLabs — and the whole thing is open-weight, free to download, and small enough to run on a smartphone.

Mistral Voxtral TTS - Open-Source Voice AI

The model is called Voxtral TTS (full name: mistralai/Voxtral-4B-TTS-2603), and Mistral released it on March 26, 2026 — less than two weeks ago. Quick context: Mistral's Voxtral family started life back in July 2025 as a speech understanding model (speech-to-text). This new release is the TTS sibling — it goes the other direction, turning text into natural-sounding voice.

Why this is a big deal:

It's actually open. Voxtral TTS is licensed Apache 2.0 and the weights are right there on Hugging Face. Download it, run it locally, fine-tune it, ship it in your app. No API keys, no per-character billing, no rate limits. Compare that to ElevenLabs, where realistic voice generation lives behind a subscription paywall and metered usage.
Mistral says it beats ElevenLabs. In their announcement, Mistral claims Voxtral TTS matches or outperforms the leading closed-source TTS providers on naturalness and expressiveness benchmarks. That's a bold claim coming from an open model, and it's the kind of thing the community will absolutely stress-test over the next few weeks. For now, it's a "Mistral says" claim — but if it holds up, it's a watershed moment for open voice AI.
It runs on your phone. This is the part that blew my mind. Voxtral TTS is a 4-billion-parameter model — small enough to fit on consumer hardware, including smartphones. Think about that: the same quality of voice synthesis that used to require a cloud GPU can now live on the device in your pocket. Offline voice assistants, on-device audiobook narration, accessibility tools that don't need internet — all suddenly possible.
~70ms latency for streaming. Mistral reports roughly 70 milliseconds of latency for a 10-second audio sample, with full streaming support. That's the kind of responsiveness you need for real-time applications: live dubbing, interactive agents, game dialogue, you name it.
Nine languages out of the box. English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic — all supported natively in a single model. No per-language plan, no extra modules. Just pick a language and go.

The "today days old" angle here writes itself. For the last few years, the best text-to-speech has been a closed ecosystem — you paid ElevenLabs or one of the cloud giants, and that was basically your only option if you wanted voices that didn't sound like a GPS from 2008. Now there's a credible open alternative that you can actually run on a laptop, a Raspberry Pi, or a phone.

And because it's Apache 2.0, the next wave of hobby projects, indie apps, and weird fun experiments just got a brand-new building block. Voice notes that read themselves back. Local language-learning apps with native-sounding pronunciation. Dungeons & Dragons tools that narrate your campaign out loud. Accessibility readers that work on a plane. All of it suddenly possible without sending a single character to a cloud API.

If you want to play with it, Hugging Face has the weights, Mistral has a hosted API for folks who'd rather not self-host, and the model card has example code to get you speaking in ~10 lines of Python.

Hear it for yourself:

Interactive demo on Hugging Face — Type in any text, pick from 20 built-in voices (or upload a 5–30 second clip to clone your own), and download the result. No API key, no account, no developer background required: huggingface.co/spaces/mistralai/voxtral-tts-demo
Mistral's official voice-emulation showcase — Mistral's announcement page has a “Voice Emulation” section with eight preset voices (Margaret, Sanchit, Angele, Gustavo, Khyathi, Nick, Yassir, Patrick) you can click through to hear the range: mistral.ai/news/voxtral-tts#voice-emulation
Cascaded speech-to-speech translation demo — Same announcement page, further down, has a cross-lingual demo where an English speaker's voice gets preserved while the language switches to French: mistral.ai/news/voxtral-tts#cascaded-speech-to-speech-translation
Community sample page — voxtral-tts.com has curated examples (podcast intro, customer-support script, voice-clone comparison, Spanish/French) all generated with Voxtral TTS, no cherry-picking.

Check it out: Mistral announcement • Hugging Face model card • Mistral TTS API docs

Mistral Voxtral TTS — Open-Source Voice AI That Fits on Your Phone