ResearchJune 10, 2026 · 3 min read

The open voice benchmark, and why we run it

Vendor demos tell you which engine sounds best in the vendor's hands. We wanted numbers nobody here can tilt.

Last updated June 10, 2026

Every vendor wins its own demo

Shop for an AI voice engine and you will hear the same word everywhere: best. The most natural voice. The most lifelike speech. Every vendor says it, and every vendor has a demo reel to prove it. The reels are real audio. They are also hand-picked, scripted to the engine's strengths, and impossible to compare with anyone else's reel.

That leaves buyers in a bad spot. You cannot line up five marketing pages and learn anything, because the scripts differ, the conditions differ, and nobody publishes the takes that went badly. The one thing you actually need, the same material run through every engine under the same conditions, is exactly what vendor marketing is built to avoid.

We run five engines in one studio, and we route work between them. That routing is a promise: we will send your script to the engine that serves it best, not the one that serves us best. A promise like that needs evidence, so we built the evidence first.

Our two-part answer

We split the question in two, because quality and speed deserve different referees.

For quality, we do not referee at all. We publish the Quality Elo from the Artificial Analysis Speech Arena, a public arena where listeners blind-compare engines and the votes produce a rating, the way chess ratings work. We did not build the arena, we cannot vote our own engines up, and we print the score whether it flatters us or not. No engine grades its own homework here, including ours.

For speed, the arena cannot help, because latency depends on the path the audio takes to reach you. So we measure that ourselves: one short script, sent to every engine through the same gateway, timed from request to the complete audio file, median of three runs. The script is committed to our repo and every figure carries the date it was measured. The full method has its own post.

The numbers, in one table

Here is the whole roster. Quality is third-party. Latency is ours.

EngineQuality EloOur measured latency

Gemini Flash1225.132,770 ms

Grok Voice1196.922,444 ms

Kokoro1060.25973 ms

MAI Voice 21006.962,426 ms

Zonos1000.004,523 ms

* Quality Elo from the Artificial Analysis Speech Arena, retrieved 2026-06-10. User-vote arena ratings, not our scores.

* Latency: our own wall-clock measurement to full audio on the same routed path that serves the studio, median of 3 runs, measured 2026-06-10. Not a server SLA.

* MAI Voice 2: Score is for MAI-Voice-1; MAI-Voice-2 is not yet arena-rated.

* Zonos: Baseline rating with limited arena votes so far.

The embarrassing-honesty rule

A benchmark you only publish when you win is a press release. Ours has one rule: the number goes up either way.

So the table shows our losses. On the same arena, Eleven v3, the strongest ElevenLabs model, rates 1176, which beats three of our five engines. Kokoro sits at 1060.25, and it stays in the roster anyway, because it is our fastest, cheapest draft engine, not our blind-test winner. Different jobs, different tools.

The footnotes are part of the rule too. Where the data is thin, we say so in print. MAI Voice 2 carries the rating of MAI-Voice-1 because version 2 is not yet arena-rated. Zonos sits at the 1000 baseline with limited votes so far. Both scores will move as votes come in, possibly down. We will print that too.

The wins are real as well, and worth saying plainly: Gemini Flash rates 1225.13, within about three Elo of Fun-Realtime-TTS at 1228.06, the top model of all roughly 85 rated.

Routing follows the data

The benchmark is not a poster on the wall. It is the routing table. When the studio suggests Gemini Flash for an acted audiobook chapter and Kokoro for a forty-script draft batch, that suggestion is these numbers, applied to your job. What "acted" means in the script (bracketed cues versus a flat read) is unpacked in cued TTS vs flat reads.

It also disciplines us. If an engine's score falls, our routing has to follow it down, in public, on a page we told you to watch. That is the point of doing this in the open: you should never have to trust our taste, only our arithmetic.

The live, filterable version is at the open benchmark, with samples for every engine on the engines page. When the numbers change, the pages change. That is the whole idea.

Quality Elo data: third-party, from the Artificial Analysis Speech Arena, retrieved 2026-06-10. Latency figures are our own measured wall-clock numbers, not a server SLA.

← Back to the blog

Check our work, then make your own.

The benchmark is live and the studio is free to start. Every claim above is one click from its source.

See the benchmark Open the studio