Skip to content
New · the open voice benchmark is liveRead it
cantari
LiveCreate

Turn any script into voice you own.

Write or paste your text, pick from real AI voice engines, and generate finished text to speech audio in seconds. Add bracketed cues when you want a voice to act them, then export MP3 or WAV with commercial rights.

No credit card · 5 real engines · The audio is yours

Painted writer's desk at night with a vintage microphone

Welcome back. [warmly] It is good to have you here.

Try a cue

Same line. Three deliveries.

The bracketed cue is an instruction the engine acts. Pick one and press play: each take is real Gemini Flash output, recorded unedited.

Pick a mood

[whispering]Don't go in there... something moved.

Real Gemini Flash output, unedited. About 1,000 characters is a minute of audio.

How it works

From a blank script to audio you own.

Step 1: Write your script

Type or paste up to 30,000 characters. Add bracketed [emotion] cues anywhere you want a voice to act them.

Step 2: Pick an engine and voice

Choose from real engines and their own voices, or follow the open benchmark to the one that fits the job.

Step 3: Generate

One request returns finished audio in seconds. Preview it right in the console before you keep it.

Step 4: Own and export

Save to your library and export MP3 or WAV. Commercial rights, worldwide, no watermark.

Capabilities

What Text to Speech gives you.

Voices that act your cues

Gemini Flash reads bracketed [whispering] or [excited] directions as performance, not as text. Plain-read engines tell you up front when they will not.

Every engine, one studio

Gemini Flash, Kokoro, Grok Voice, MAI Voice 2, and Zonos all live in the same text to speech studio. Switch between them with one tap and pick the voice that fits the moment.

Long-form scripts

Up to 30,000 characters per generation, so a full article or chapter section goes through in a single pass.

Export and own

Download every generation as MP3 or WAV. A plain-language license per generation, not a credit meter.

Pick by measured numbers

The open benchmark publishes real wall-clock latency and languages so you choose an engine on evidence, not marketing.

Flat pricing

Start free on the open-weight engine and move up only when a job earns it. No per-character credit anxiety.

Powered by

The engines behind it.

Gemini FlashExpressive - follows [cues]

The only engine here that acts your bracketed [emotion] directions.

Quality Elo
1225
Latency
2770 ms (measured 2026-06-10)
Languages
24
Rights
Commercial use; outputs are yours
Cue-followingExpressive
KokoroLightweight - plain read

Cheapest. Clean, plain read. Ignores cues.

Quality Elo
1060
Latency
973 ms (measured 2026-06-10)
Languages
8
Rights
Apache-2.0 model; commercial OK
CheapestFast
Grok VoicePersona voices - plain read

xAI voice with 5 personas. Plain read, ignores cues.

Quality Elo
1197
Latency
2444 ms (measured 2026-06-10)
Languages
1
Rights
Commercial use; outputs are yours
5 personasEnglish
MAI Voice 2Styled voice - real controls

Microsoft voice with real style and speed controls.

Quality Elo
1007 *
Latency
2426 ms (measured 2026-06-10)
Languages
1
Rights
Commercial use; outputs are yours
Style controlsEnglish
ZonosOpen-weight - plain read

Open-weight Zyphra engine with four accent voices.

Quality Elo
1000 *
Latency
4523 ms (measured 2026-06-10)
Languages
1
Rights
Apache-2.0 model; commercial OK
4 accentsOpen-source

Quality Elo from the Artificial Analysis Speech Arena, retrieved June 10, 2026. Latencies are our own real wall-clock numbers.

Hear it

Real audio, straight from the engines.

Each sample is real engine output for the line on its card, recorded unedited. Open the console to generate your own.

Plain read

Your words, in a voice you own. Generated in seconds, exported in one click.

Warm welcome

Welcome back. It is good to have you here, so let us pick up right where we left off.

Acted cue

[whispering] Lean in, because this part is just between us.

The plain definition

What makes text to speech sound human.

Text to speech is the conversion of written words into spoken audio by a model trained on human speech. The definition takes one sentence; the craft is everything after it, because two engines reading the same script can land the same sentence completely differently.

The first difference is delivery. Most text to speech reads every sentence at one even pitch, which is exactly what makes it register as synthetic. An engine that acts direction closes that gap: write [weary] before a line and Gemini Flash performs the weariness instead of printing it. The plain-read engines here say up front that they will ignore cues, which is its own kind of useful.

The second difference is fit. A news digest wants Kokoro's fast even read, a character wants one of Grok Voice's five personas, and a tutorial wants MAI Voice 2 slowed to teaching pace. No single engine wins every text to speech job, which is why this studio holds five and the open benchmark publishes how each one actually measures.

The third difference is the script itself. Spell numbers the way you would say them, break up long sentences, and read the result aloud once before generating. Text to speech rewards writing for the ear, and the expressive engines reward it most of all.

Questions

The honest answers.

What Text to Speech can and cannot do today, in plain language.

Is the audio real, or prerecorded?
Real. The console generates live audio through real engines on every click, and the samples on this page are real engine output, recorded unedited. No voice actors, no marketing mockups.
Which engine should I pick?
Use Gemini Flash when you want a voice to act your bracketed [emotion] cues, Kokoro for the fastest clean drafts, and Grok Voice for its five English personas. MAI Voice 2 adds real style and speed controls, and Zonos brings four American and British voices. The open benchmark shows the trade-offs side by side.
Who owns what I generate?
You do. Every generation is yours to export as MP3 or WAV and ship commercially, worldwide, with no watermark.
How long can my script be?
Up to 30,000 characters per text to speech generation. For a full audiobook you can chain sections together, which is what our Audiobook Studio workflow is built to do.
How does pricing work?
Flat, not per-credit. Free starts at $0 on the open-weight Kokoro engine. Paid tiers unlock every engine for a flat monthly price, so you are never charged per character.
Is text to speech good enough for finished work, or only drafts?
Both, and you draw the line. Kokoro drafts fast and free; the expressive engines render the keepers. The samples on this page are unedited engine output, so judge finished quality with your own ears rather than our adjectives.
Keep exploring

Start generating with Text to Speech.

Free to start, no credit meter. Open the console and hear it for yourself.