Glossary

What is TTS latency? Time to first byte vs full audio

TTS latency is how long an engine takes to speak. The number depends entirely on where you stop the clock: the first streamed byte, or the complete audio file.

Updated June 11, 2026

The definition

TTS latency is the time between submitting text and getting audio back. Simple, except that vendors stop the clock at different moments, and the choice changes the number dramatically.

Time to first byte (TTFB) measures until the first sliver of audio leaves the server. Streaming engines make this look spectacular, a few hundred milliseconds, while the rest of the clip is still being generated. Time to full audio measures until the complete file has arrived, which is always a larger and less flattering number.

Which clock matters for you

TTFB is the right metric for live, conversational uses: a phone agent or an assistant needs to start speaking immediately, and nobody is waiting for a file. Time to full audio is the right metric for creators, because a chapter, a voice-over, or a podcast segment is only useful once the whole file exists and can be played back, edited, and exported.

When a marketing page says fast without saying until what, assume it means TTFB. The two metrics can differ by an order of magnitude on the same engine and the same sentence.

Real numbers, for scale

On our own published run (one short script, every engine through the same production path, median of three runs, measured 2026-06-10), the spread to full audio runs from 973 ms on Kokoro, the fastest engine on the roster, to 4,523 ms on Zonos, the slowest. Gemini Flash, the expressive default, lands at 2,770 ms.

That spread is why drafting and finishing want different engines: a sub-second loop keeps you editing, while a few seconds is a fine price for a keeper take.

These are measurements, dated and reproducible, not a service-level agreement. The live table is on the open benchmark, and the full method is written up in the latency post.

On Cantari

We publish wall-clock time to full audio for every engine because that is the clock a creator actually feels, and we re-measure when engines change. How to read those figures alongside quality is covered in the engines guide.