Engines

The engines behind the layer.

Every model we route to, presented as a card: specs, scores, rights, and a preview sample. Self-hosted engines plug in the same way.

We publish each engine’s real cost for one reason: so you can see we route on merit and charge one flat price. The raw rates are here in the open, the numbers we pay underneath, so you never have to take our word for what is fair.

5: Real engines today
$0.62/M: Cheapest engine (Kokoro)
973ms: Fastest measured latency
1: Endpoint for all engines

Head to head

The whole roster on one line.

Every engine against the same rows, so the trade-offs are obvious before you read a single card. Press play in any column to hear it.

Engine

Quality Elo*

1225

1197

1060

1007^*

1000^*

Measured latency

2770ms

2444ms

973ms

2426ms

4523ms

Languages

24 languages

1 language

8 languages

1 language

Follows [cues]

Yes, acts cues

Plain read

Cloning

Price

$1/M in + $20/M out

$15/M in · $0 out

$0.62/M in · $0 out

$22/M in · $0 out

$7/M in · $0 out

Rights

Commercial use; outputs are yours

Apache-2.0 model; commercial OK

Commercial use; outputs are yours

Apache-2.0 model; commercial OK

* Quality Elo from the Artificial Analysis Speech Arena, retrieved June 10, 2026. It is a user-vote arena rating; the top model of all rated is Fun-Realtime-TTS at 1228.06. Latency figures are our own measured wall-clock numbers, measured 2026-06-10on the same routed path that serves the studio, not a server SLA. Engine rates are the providers’ published list rates, checked 2026-06-11; when a provider moves a rate, we update it here.

* MAI Voice 2: Score is for MAI-Voice-1; MAI-Voice-2 is not yet arena-rated.

* Zonos: Baseline rating with limited arena votes so far.

Engine roster

Every engine, side by side.

Specs, rights, sample voices, and pricing shown on every card. Compare them against the open benchmark before you pick.

Gemini FlashExpressive - follows [cues]

The only engine here that acts your bracketed [emotion] directions.

Quality Elo: 1225
Latency: 2770 ms (measured 2026-06-10)
Languages: 24
Price: $1/M in + $20/M out
Rights: Commercial use; outputs are yours

Cue-followingExpressive

Gemini Flash in detail →

KokoroLightweight - plain read

Cheapest. Clean, plain read. Ignores cues.

Quality Elo: 1060
Latency: 973 ms (measured 2026-06-10)
Languages: 8
Price: $0.62/M in · $0 out
Rights: Apache-2.0 model; commercial OK

CheapestFast

Kokoro in detail →

Grok VoicePersona voices - plain read

xAI voice with 5 personas. Plain read, ignores cues.

Quality Elo: 1197
Latency: 2444 ms (measured 2026-06-10)
Languages: 1
Price: $15/M in · $0 out
Rights: Commercial use; outputs are yours

5 personasEnglish

Grok Voice in detail →

MAI Voice 2Styled voice - real controls

Microsoft voice with real style and speed controls.

Quality Elo: 1007 *
Latency: 2426 ms (measured 2026-06-10)
Languages: 1
Price: $22/M in · $0 out
Rights: Commercial use; outputs are yours

Style controlsEnglish

MAI Voice 2 in detail →

ZonosOpen-weight - plain read

Open-weight Zyphra engine with four accent voices.

Quality Elo: 1000 *
Latency: 4523 ms (measured 2026-06-10)
Languages: 1
Price: $7/M in · $0 out
Rights: Apache-2.0 model; commercial OK

4 accentsOpen-source

Zonos in detail →

One integration

One endpoint. Every engine.

Send the same request shape to every engine through a single API route. Swap engines by changing one field in your payload. Your key never leaves the server, and new engines plug in without touching your integration.

Single POST /api/speech endpoint for all engines
Switch engines with one field change
API key managed server-side only

Open the console →

One endpoint

POST /api/speech

Gemini Flash$1/M in + $20/M out

Kokoro$0.62/M in · $0 out

Grok Voice$15/M in · $0 out

MAI Voice 2$22/M in · $0 out

Zonos$7/M in · $0 out

Swap engines with one field - key stays server-side

Hear the engines for yourself.

Open the console and generate real audio through each engine. No sign-up required to listen.

Open console See the benchmark