Skip to content
New · the open voice benchmark is liveRead it
cantari
Open benchmark

Every engine, scored on the same script.

No vendor marketing. Ranked on third-party Quality Elo from the Artificial Analysis Speech Arena, with our own measured latency, languages, cloning, price, and rights in the open. Filter by what you are building.

Scores from an independent arena, latencies we measured ourselves, engine costs in the open: this is how you know the routing is on your side, not the vendor’s. We put it all in one place so the trade-offs are yours to judge.

Showing all 5 engines, unfiltered, in leaderboard order (new engines last, unranked).

Gemini FlashActed scripts, audiobooks, dramatic reads2770ms
Grok VoiceCharacter and persona reads in English2444ms
KokoroDrafts, high volume, cost-sensitive jobs973ms
MAI Voice 2Styled English reads with speed and intensity control2426ms
ZonosAmerican and British accent reads in English4523ms

Latency: our own measurement 2026-06-10 on the same routed path that serves the studio, not a server SLA. Quality Elo: from the Artificial Analysis Speech Arena (full leaderboard below).

01
Gemini FlashActed scripts, audiobooks, dramatic reads
1225 Elo2770ms (measured 2026-06-10) · 24 langs
02
Grok VoiceCharacter and persona reads in English
1197 Elo2444ms (measured 2026-06-10) · 1 lang
03
KokoroDrafts, high volume, cost-sensitive jobs
1060 Elo973ms (measured 2026-06-10) · 8 langs
04
MAI Voice 2Styled English reads with speed and intensity control
1007* Elo2426ms (measured 2026-06-10) · 1 lang
05
ZonosAmerican and British accent reads in English
1000* Elo4523ms (measured 2026-06-10) · 1 lang

Quality Elo from the Artificial Analysis Speech Arena, retrieved June 10, 2026. For context, the top model of all rated is Fun-Realtime-TTS at 1228.06. Latencies are our own measured wall-clock numbers.

* MAI Voice 2: Score is for MAI-Voice-1; MAI-Voice-2 is not yet arena-rated.

* Zonos: Baseline rating with limited arena votes so far.

Latency: our own measurement 2026-06-10 on the same routed path that serves the studio (script: "The northern lights drifted across the sky, slow and silent, like breathing.") - not a server SLA. Quality Elo: third-party, from the Artificial Analysis Speech Arena (see attribution above).

973ms
Fastest engine (Kokoro)
2770ms
Most expressive (Gemini)
5
Engines measured
2026-06-10
Last measured
How we score

The same test, every time.

One script, measured the same way across every engine so the numbers are comparable and reproducible.

Step 1: Same script, every engine

Every engine receives the identical passage: "The northern lights drifted across the sky, slow and silent, like breathing." No engine gets a head start.

Step 2: Measure wall-clock latency

We record the time from sending the request to receiving the full audio bytes, three runs per engine. The median is published. Wall-clock time to full audio through our routing gateway, measured locally. Not a server SLA.

Step 3: Quality from a third-party arena

Quality Elo comes from the Artificial Analysis Speech Arena, a user-vote arena where listeners blind-compare engines. We do not score quality ourselves, so no engine grades its own homework.

Step 4: Publish in the open

Results are dated (last run: 2026-06-10) and re-run when engines change. No engine scores its own homework.

Questions

Method and questions.

How is latency measured?
Wall-clock time from request to full audio bytes, using the same routed speech endpoint that serves the studio. Each engine ran the same short script three times and the median was recorded (2026-06-10). Wall-clock time to full audio through our routing gateway, measured locally. Not a server SLA. This is not a server SLA - network and load conditions on the day affect the result.
Where does the quality score come from?
The Quality Elo is a third-party number from the Artificial Analysis Speech Arena (retrieved June 10, 2026), a public arena where listeners blind-compare engines and the votes produce an Elo rating. We do not invent or self-score it. Our Gemini engine rates 1225, within about three Elo of Fun-Realtime-TTS at 1228.06, the top of all roughly 85 rated models. Two of our engines are matched to the nearest rated version: MAI is shown with MAI-Voice-1's rating because MAI-Voice-2 is not yet arena-rated, and Zonos's rating is a baseline with limited votes so far. Both caveats are footnoted on the table.
How is latency different from the quality score?
Latency is our own objective measurement: how long until the audio arrived on the path our studio uses. The Quality Elo is a third-party perceptual rating: how human listeners judged the voice in blind comparisons. A fast engine can sound flat; a slow engine can be expressive. The benchmark shows both so you can trade them off for your use case.
How often is it updated?
The benchmark is re-run when an engine changes its model or pricing, or when a new engine is added. The date is always shown (current: 2026-06-10) so you know how fresh the data is. If an engine changes significantly between runs, both the old and new readings are noted.
Which engine should I pick for audiobooks?
Gemini Flash leads on Quality Elo and emotional consistency because it follows bracketed cues across long files. Grok Voice works well for character-driven English reads. Kokoro is the cost-effective option for drafts and high-volume work. MAI Voice 2 and Zonos sit lower on the arena, so listen to their samples before committing a long project to them.
Can I use these engines commercially?
Gemini Flash, Grok Voice, and MAI Voice 2 allow commercial use as served on Cantari; outputs are yours. Kokoro and Zonos are Apache-2.0 models. The per-engine rights are shown on each model card.
Keep comparing