Skip to content
New · the open voice benchmark is liveRead it
cantari
Tools

How to use the Text to Speech studio

The editor, choosing an engine by character, directing a read with bracketed cues, the fine controls, and Enhance.

Updated June 11, 2026

The studio at a glance

The studio is three surfaces. The editor on the left holds your script, up to 30,000 characters per generation. The panel on the right has two tabs: Settings, where you pick the voice and engine, and History, your recent takes. The transport bar appears along the bottom after your first generation, with playback, a scrubber, and a download button.

Bracketed [cues] in your script are highlighted as you type, so you can see your stage directions at a glance.

Choosing an engine

Five engines share the studio, and they have different characters. Pick by what the job needs, not by habit:

EngineCharacterBracketed cues
Gemini FlashExpressive, acts your [cues]Acts them
KokoroFastest draftsPlain read
Grok VoiceFive personasPlain read
MAI Voice 2Real style and speed controlsPlain read
ZonosAmerican and British voicesPlain read

A good rhythm: draft on Kokoro, where regeneration is nearly instant, then switch to Gemini Flash for the final acted take. The numbers behind that advice are in the engines guide.

Directing with bracketed cues

A cue is a stage direction in square brackets. It is not read aloud; it directs the line that follows it. Write them inline, or tap a cue in the Expression palette to insert it at your cursor:

[warmly] Once upon a time, in a valley so green the
hills looked painted, there lived a small fox named Ember.

[whispering] But the valley kept a secret.

[pause] [excited] And Ember was about to find it.
  • The palette ships these cues: [whispering], [calm], [quietly], [nervously], [trembling], [breathing heavily], [excited], [pause]. You can also write your own, like [sarcastically] or [out of breath].
  • Gemini Flash is the engine that acts cues. It is the default, and the studio badges it Acts [cues].
  • The other four engines read plainly: they skip the brackets rather than perform them, and the studio labels them Plain read so there is no surprise.

Fine controls (MAI Voice 2)

MAI Voice 2 is the one engine here that accepts real delivery parameters, so it is the one engine that shows sliders. Speed runs from 0.5x to 2x. Style picks a delivery (cheerful, excited, sad, angry), and once a style is chosen an Intensity slider controls how hard it leans in. Neutral sends no style parameter at all.

The other engines show no sliders because they would do nothing. No fake sliders, ever.

Enhance: the auto-director

Enhance script sends your text to a directing pass that inserts [emotion] cues where they help: a [warmly] before the reunion, a [pause] before the reveal. Your words are not rewritten; the cues are added around them.

It is one click, and one click to undo: the Undo enhance button restores your script exactly as it was. Enhance is available when the selected engine acts cues; on plain-read engines the button is disabled, because inserting cues an engine will ignore would just be noise.

Generate, save, download

Press Generate, or Ctrl+Enter (Cmd+Enter on a Mac) from inside the editor. While the clip renders, the transport bar shimmers; when it is ready, it plays.

Signed in, every take saves to your private library automatically. The History tab keeps your recent takes beside the editor, each with its script (Use script puts it back in the editor) and a download link. Downloads are standard MP3 or WAV depending on the engine.