Question 1

How is latency measured?

Accepted Answer

Wall-clock time from request to full audio bytes, using the same routed speech endpoint that serves the studio. Each engine ran the same short script three times and the median was recorded (2026-06-10). Wall-clock time to full audio through our routing gateway, measured locally. Not a server SLA. This is not a server SLA - network and load conditions on the day affect the result.

Question 2

Where does the quality score come from?

Accepted Answer

The Quality Elo is a third-party number from the Artificial Analysis Speech Arena (retrieved June 10, 2026), a public arena where listeners blind-compare engines and the votes produce an Elo rating. We do not invent or self-score it. Our Gemini engine rates 1225, within about three Elo of Fun-Realtime-TTS at 1228.06, the top of all roughly 85 rated models. Two of our engines are matched to the nearest rated version: MAI is shown with MAI-Voice-1's rating because MAI-Voice-2 is not yet arena-rated, and Zonos's rating is a baseline with limited votes so far. Both caveats are footnoted on the table.

Question 3

How is latency different from the quality score?

Accepted Answer

Latency is our own objective measurement: how long until the audio arrived on the path our studio uses. The Quality Elo is a third-party perceptual rating: how human listeners judged the voice in blind comparisons. A fast engine can sound flat; a slow engine can be expressive. The benchmark shows both so you can trade them off for your use case.

Question 4

How often is it updated?

Accepted Answer

The benchmark is re-run when an engine changes its model or pricing, or when a new engine is added. The date is always shown (current: 2026-06-10) so you know how fresh the data is. If an engine changes significantly between runs, both the old and new readings are noted.

Question 5

Which engine should I pick for audiobooks?

Accepted Answer

Gemini Flash leads on Quality Elo and emotional consistency because it follows bracketed cues across long files. Grok Voice works well for character-driven English reads. Kokoro is the cost-effective option for drafts and high-volume work. MAI Voice 2 and Zonos sit lower on the arena, so listen to their samples before committing a long project to them.

Question 6

Can I use these engines commercially?

Accepted Answer

Gemini Flash, Grok Voice, and MAI Voice 2 allow commercial use as served on Cantari; outputs are yours. Kokoro and Zonos are Apache-2.0 models. The per-engine rights are shown on each model card.

Every engine, scored on the same script.

The same test, every time.

Step 1: Same script, every engine

Step 2: Measure wall-clock latency

Step 3: Quality from a third-party arena

Step 4: Publish in the open

Method and questions.