Skip to content
New · the open voice benchmark is liveRead it
cantari
Glossary

What is text to speech? TTS, explained simply

Text to speech (TTS) is software that turns written words into spoken audio. What modern neural engines actually do, and how to size a script in minutes.

Updated June 11, 2026

The definition

Text to speech, usually shortened to TTS, is software that converts written text into spoken audio. You give it a script and a voice; it returns an audio file of that voice reading your words. Modern TTS is generative: a neural model trained on enormous amounts of recorded speech produces the read, which is why today's output sounds like a person rather than a navigation system.

The output is an ordinary audio file, typically MP3 or WAV, that you can edit, publish, and distribute like any other recording.

Why engines sound different from each other

There is no single TTS. Each engine is a different model with a different training history, so each has a character: one excels at acted, emotional reads, another at clean fast narration, another at a specific accent. Treating engines as interchangeable is the most common beginner mistake; matching the engine to the job is most of the craft.

Some engines also accept stage directions written in square brackets, a capability covered in the cue entry. Most do not, and simply skip them.

Sizing a script

The useful planning ratio: roughly 1,000 characters of script becomes about a minute of audio. Read speed varies with the voice and the writing, so treat it as an estimate, not a promise. A 5,000-character blog post is around five minutes; an 80,000-word novel lands near eight to nine hours.

Want the conversion done for you? The words to minutes calculator does this arithmetic interactively.

Text to speech on Cantari

Cantari runs five engines side by side in one studio, described honestly by character: Gemini Flash for expressive acted reads, Kokoro for fast unlimited drafts, Grok Voice for personas, MAI Voice 2 for style and speed controls, and Zonos for American and British accents.

Start with the Text to Speech tool, or read the studio guide for the editor, the engines, and the controls. The full roster comparison, with third-party quality scores, lives in the engines guide.