The five engines, and which one to pick
The full roster by character, plus how to read the benchmark: third-party quality scores and latencies we measure ourselves.
Updated June 11, 2026
One studio, five engines
Cantari routes your work across five real voice engines. They are not interchangeable: each has a character, and the honest way to choose is by the job in front of you.
Engines are described by character on purpose. Two of the five (Kokoro and Zonos) are open-weight Apache-2.0 models, which also makes their licensing answer especially simple.
How to read the quality score
The quality number is not ours. It is the Quality Elo from the Artificial Analysis Speech Arena, a public arena where listeners blind-compare engines and the votes produce a rating, the way chess ratings work. We did not build the arena and we cannot vote our own engines up.
We print the score whether it flatters us or not, with the retrieval date next to it. Where the data is thin, the footnote says so in print: that is part of the method, not an apology.
How to read the latency number
Latency is ours, and it measures the clock a creator actually feels: wall-clock time from pressing generate to holding the complete audio file, not time to the first streamed byte. One short script, the same for every engine, three runs each, and we publish the median with the date it was measured.
It is a fair comparison between engines under identical conditions, not a service-level promise. Network conditions on the day move these numbers, which is exactly why they carry a date.
The numbers
The current roster, best arena score first. Quality is third-party; latency is ours.
* Quality Elo from the Artificial Analysis Speech Arena, retrieved 2026-06-10. User-vote arena ratings, not our scores.
* Latency: our own wall-clock measurement to full audio, same script for every engine, median of 3 runs, measured 2026-06-10. Not a server SLA.
* MAI Voice 2: Score is for MAI-Voice-1; MAI-Voice-2 is not yet arena-rated.
* Zonos: Baseline rating with limited arena votes so far.
What the spread means in practice
Kokoro returns a full clip in under a second (973 ms in the current run), which makes it the draft loop: change a word, regenerate, listen, again. Gemini Flash takes longer (2,770 ms) and earns it as the most expressive engine in the roster, the one that acts bracketed cues, so it is where final takes go.
That draft-fast, finish-expressive pattern is the studio's own suggestion, and it falls straight out of these numbers rather than out of taste.
Go deeper
The live, filterable leaderboard is the open benchmark, and every engine has a detail page with real audio samples on the engines page. When the numbers change, those pages change: you should never have to trust our taste, only our arithmetic.