Skip to content
New · the open voice benchmark is liveRead it
cantari
Engines

The engines behind the layer.

Every model we route to, presented as a card: specs, scores, rights, and a preview sample. Self-hosted engines plug in the same way.

We publish each engine’s real cost for one reason: so you can see we route on merit and charge one flat price. The raw rates are here in the open, the numbers we pay underneath, so you never have to take our word for what is fair.

5
Real engines today
$0.62/M
Cheapest engine (Kokoro)
973ms
Fastest measured latency
1
Endpoint for all engines
Head to head

The whole roster on one line.

Every engine against the same rows, so the trade-offs are obvious before you read a single card. Press play in any column to hear it.

Engine
Quality Elo*
1225
1197
1060
1007*
1000*
Measured latency
2770ms
2444ms
973ms
2426ms
4523ms
Languages
24 languages
1 language
8 languages
1 language
1 language
Follows [cues]
Yes, acts cues
Plain read
Plain read
Plain read
Plain read
Cloning
No
No
No
No
No
Price
$1/M in + $20/M out
$15/M in · $0 out
$0.62/M in · $0 out
$22/M in · $0 out
$7/M in · $0 out
Rights
Commercial use; outputs are yours
Commercial use; outputs are yours
Apache-2.0 model; commercial OK
Commercial use; outputs are yours
Apache-2.0 model; commercial OK

* Quality Elo from the Artificial Analysis Speech Arena, retrieved June 10, 2026. It is a user-vote arena rating; the top model of all rated is Fun-Realtime-TTS at 1228.06. Latency figures are our own measured wall-clock numbers, measured 2026-06-10on the same routed path that serves the studio, not a server SLA. Engine rates are the providers’ published list rates, checked 2026-06-11; when a provider moves a rate, we update it here.

* MAI Voice 2: Score is for MAI-Voice-1; MAI-Voice-2 is not yet arena-rated.

* Zonos: Baseline rating with limited arena votes so far.

Engine roster

Every engine, side by side.

Specs, rights, sample voices, and pricing shown on every card. Compare them against the open benchmark before you pick.

Gemini FlashExpressive - follows [cues]

The only engine here that acts your bracketed [emotion] directions.

Quality Elo
1225
Latency
2770 ms (measured 2026-06-10)
Languages
24
Price
$1/M in + $20/M out
Rights
Commercial use; outputs are yours
Cue-followingExpressive
Gemini Flash in detail →
KokoroLightweight - plain read

Cheapest. Clean, plain read. Ignores cues.

Quality Elo
1060
Latency
973 ms (measured 2026-06-10)
Languages
8
Price
$0.62/M in · $0 out
Rights
Apache-2.0 model; commercial OK
CheapestFast
Kokoro in detail →
Grok VoicePersona voices - plain read

xAI voice with 5 personas. Plain read, ignores cues.

Quality Elo
1197
Latency
2444 ms (measured 2026-06-10)
Languages
1
Price
$15/M in · $0 out
Rights
Commercial use; outputs are yours
5 personasEnglish
Grok Voice in detail →
MAI Voice 2Styled voice - real controls

Microsoft voice with real style and speed controls.

Quality Elo
1007 *
Latency
2426 ms (measured 2026-06-10)
Languages
1
Price
$22/M in · $0 out
Rights
Commercial use; outputs are yours
Style controlsEnglish
MAI Voice 2 in detail →
ZonosOpen-weight - plain read

Open-weight Zyphra engine with four accent voices.

Quality Elo
1000 *
Latency
4523 ms (measured 2026-06-10)
Languages
1
Price
$7/M in · $0 out
Rights
Apache-2.0 model; commercial OK
4 accentsOpen-source
Zonos in detail →
One integration

One endpoint. Every engine.

Send the same request shape to every engine through a single API route. Swap engines by changing one field in your payload. Your key never leaves the server, and new engines plug in without touching your integration.
  • Single POST /api/speech endpoint for all engines
  • Switch engines with one field change
  • API key managed server-side only
Open the console
One endpoint

POST /api/speech

Gemini Flash$1/M in + $20/M out
Kokoro$0.62/M in · $0 out
Grok Voice$15/M in · $0 out
MAI Voice 2$22/M in · $0 out
Zonos$7/M in · $0 out

Swap engines with one field - key stays server-side

Hear the engines for yourself.

Open the console and generate real audio through each engine. No sign-up required to listen.