Platform

The five engines, and which one to pick

The full roster by character, plus how to read the benchmark: third-party quality scores and latencies we measure ourselves.

Updated June 11, 2026

One studio, five engines

Cantari routes your work across five real voice engines. They are not interchangeable: each has a character, and the honest way to choose is by the job in front of you.

EngineCharacterBest for

Gemini FlashExpressive, acts your [cues]Acted scripts, audiobooks, dramatic reads

KokoroFastest draftsDrafts, high volume, cost-sensitive jobs

Grok VoiceFive personasCharacter and persona reads in English

MAI Voice 2Real style and speed controlsStyled English reads with speed and intensity control

ZonosAmerican and British voicesAmerican and British accent reads in English

Engines are described by character on purpose. Two of the five (Kokoro and Zonos) are open-weight Apache-2.0 models, which also makes their licensing answer especially simple.

How to read the quality score

The quality number is not ours. It is the Quality Elo from the Artificial Analysis Speech Arena, a public arena where listeners blind-compare engines and the votes produce a rating, the way chess ratings work. We did not build the arena and we cannot vote our own engines up.

We print the score whether it flatters us or not, with the retrieval date next to it. Where the data is thin, the footnote says so in print: that is part of the method, not an apology.

How to read the latency number

Latency is ours, and it measures the clock a creator actually feels: wall-clock time from pressing generate to holding the complete audio file, not time to the first streamed byte. One short script, the same for every engine, three runs each, and we publish the median with the date it was measured.

It is a fair comparison between engines under identical conditions, not a service-level promise. Network conditions on the day move these numbers, which is exactly why they carry a date.

The numbers

The current roster, best arena score first. Quality is third-party; latency is ours.

EngineQuality EloOur measured latency

Gemini Flash1225.132,770 ms

Grok Voice1196.922,444 ms

Kokoro1060.25973 ms

MAI Voice 21006.962,426 ms

Zonos1000.004,523 ms

* Quality Elo from the Artificial Analysis Speech Arena, retrieved 2026-06-10. User-vote arena ratings, not our scores.

* Latency: our own wall-clock measurement to full audio, same script for every engine, median of 3 runs, measured 2026-06-10. Not a server SLA.

* MAI Voice 2: Score is for MAI-Voice-1; MAI-Voice-2 is not yet arena-rated.

* Zonos: Baseline rating with limited arena votes so far.

What the spread means in practice

Kokoro returns a full clip in under a second (973 ms in the current run), which makes it the draft loop: change a word, regenerate, listen, again. Gemini Flash takes longer (2,770 ms) and earns it as the most expressive engine in the roster, the one that acts bracketed cues, so it is where final takes go.

That draft-fast, finish-expressive pattern is the studio's own suggestion, and it falls straight out of these numbers rather than out of taste.

Go deeper

The live, filterable leaderboard is the open benchmark, and every engine has a detail page with real audio samples on the engines page. When the numbers change, those pages change: you should never have to trust our taste, only our arithmetic.