EngineeringJune 10, 2026 · 3 min read

How we measure voice latency (and why vendors won't)

Wall-clock to the complete audio file, median of three, same script, same gateway. The script is in the repo.

Last updated June 10, 2026

Two very different meanings of fast

When a voice vendor says fast, ask: fast until what? The flattering answer is time to first byte, the moment the first sliver of audio leaves the server. Streaming engines can make that number look magical, a few hundred milliseconds, while the full clip is still seconds away.

First-byte time matters if you are building a live phone agent. But the creators we build for are making audiobook chapters, video voice-overs, and podcast segments. You cannot edit, export, or publish the first byte. You work with the whole file, so the honest clock runs from pressing generate to holding the complete audio.

That is what we measure: wall-clock time to full audio. It is the number that decides whether your draft loop feels instant or sticky, and it is the number most vendor pages will not print, because it is bigger and it is checkable.

The method, in five lines

We kept the method small enough to read in a minute and boring enough to trust:

One script for every engine, no favorites: "The northern lights drifted across the sky, slow and silent, like breathing."
Every engine is called through the same routing gateway, the same path our studio uses in production. Nobody gets a private fast lane.
Three runs per engine, and we publish the median, so one lucky or unlucky run cannot tilt the result.
The measuring script (scripts/measure-latency.mjs) is committed to the repo, and it writes the data file the site reads. The published table and the measurement cannot drift apart.
Every figure is dated. The current run is 2026-06-10.

The numbers

Here is the current run, fastest first. The audio sizes differ because the engines return different formats and bitrates for the same sentence, which is worth seeing in the open too.

EngineMedian, 3 runsAudio returned

Kokoro973 ms34,272 bytes

MAI Voice 22,426 ms128,640 bytes

Grok Voice2,444 ms80,256 bytes

Gemini Flash2,770 ms336,000 bytes

Zonos4,523 ms51,427 bytes

* Latency: our own wall-clock measurement to full audio on the same routed path that serves the studio, median of 3 runs, measured 2026-06-10. Not a server SLA.

* Local network conditions on the day affect these numbers, which is one more reason they carry a date.

What a second means for a creator

The spread is the story. Kokoro returns the full clip in 973 ms, just under a second. That is a draft loop: change a word, regenerate, listen, again, without ever losing your place in the script. It is why Kokoro is our draft engine even though it sits lower on the quality arena.

Gemini Flash takes 2.8 seconds for the same sentence, and earns it: it is the most expressive engine we run and the one we route final takes to. Waiting under three seconds for a keeper take is nothing. Waiting under three seconds between every micro-edit of a draft would wear you down.

Zonos is the slowest of the five at 4.5 seconds, and we publish that as plainly as the wins. The pattern we suggest in the studio falls straight out of these numbers: draft on the sub-second engine, finish on the expressive one. The same trade-offs are laid out on the benchmark and the engines page. The direction half of that pattern (plain read versus bracketed cues) is in cued TTS vs flat reads.

Why we re-measure, and why vendors won't

A latency number without a date is a rumor. Engines get upgraded, gateways change providers, networks have moods. So the benchmark is re-run when an engine changes and the date is printed next to every figure, on the site and in this post.

Vendors rarely publish numbers like these, and the reasons are not mysterious. A wall-clock figure measured through a real gateway is bigger than a first-byte figure measured inside their own datacenter. It goes stale, so it has to be maintained. And it invites checking, since anyone with the same script can run the same test. Those are all costs if your goal is a glossy page, and all features if your goal is trust.

One honest caveat: our numbers are a real-world measurement, not a guarantee. They were taken from one machine, on one day, through one gateway. Treat them as a fair comparison between engines under identical conditions, not as a service-level promise. That is also exactly what makes them useful: it is the test your own first generation will actually face. A flat plan means re-running your own comparison costs you nothing either.

← Back to the blog

Check our work, then make your own.

The benchmark is live and the studio is free to start. Every claim above is one click from its source.

See the benchmark Open the studio