Skip to content
New · the open voice benchmark is liveRead it
cantari
Tools

How to turn a manuscript into an audiobook

Import a manuscript, split it into chapters and segments, render each line as real audio, and export stitched chapter WAVs. Beta, documented honestly.

Updated June 11, 2026

What the studio is

Audiobook Studio is the chaptered long-form workspace, currently in beta: the speech pipeline underneath is the same live one the rest of the app uses, while the studio itself is newer. Every render is real audio; there is no fake progress anywhere in it.

Two things persist in different places, and the studio says so plainly. Your project structure (chapters, segment text, voice choices) autosaves to this browser. Rendered audio takes auto-save to your library when you are signed in. The in-session players are temporary, so to stitch a chapter you export it in the same session you rendered it.

Chapters and segments

A project is a list of chapters, and each chapter is a list of segments, usually a paragraph each. Each chapter carries a default engine and voice; new segments inherit it, and any individual segment can override it.

Segments are the unit of work. You edit a segment's text, render just that segment, and play or download just that take. A chapter-level Render button renders only the segments that need it, sequentially, with a real progress count and a working Stop button. If one segment fails, the run stops there and every finished take is kept.

When you edit a segment that already has a take, it is marked edited since render, and it keeps playing the old take until you re-render it. The studio never silently swaps audio out from under you.

Importing a manuscript

Import manuscript accepts a paste or a plain .txt file (read on your machine), up to 150,000 characters, which is about two and a half hours of audio at the house rate of roughly 1,000 characters per minute. Larger books should be split into parts and imported one at a time.

Chapter detection is structure-only by design: a model reads a line-numbered view of your manuscript and returns only where each chapter starts and what to call it. Your original text is then split locally by those line numbers. The model never rewrites or echoes your manuscript at this step, so your words come through untouched.

If detection cannot find headings, the manuscript is split into evenly sized chapters of about 3,000 words each, with an honest note telling you that happened. You can rename chapters anytime.

There is also a quick paste path on the empty state that needs no model at all: lines starting with "# " begin a new chapter, and blank lines split segments.

# Chapter One
The first paragraph becomes segment one.

The second paragraph becomes segment two.

# Chapter Two
And so on.

The optional speakable pass

Import offers a Make it speakable checkbox. When on, each chapter is rewritten so engines read it correctly: abbreviations expand into full words (Dr. to Doctor, etc. to et cetera, e.g. to for example), and numbers, dates, years, ordinals, currency amounts, and units become spoken words (3rd to third, 1982 to nineteen eighty-two, 5 km to five kilometers).

The pass is told never to paraphrase, cut, reorder, or add content beyond those expansions, and a length guard rejects any rewrite that comes back suspiciously short. Chapters longer than 12,000 characters are kept as written, with a note. If the pass hits repeated trouble it stops early, keeps the remaining chapters as written, and tells you so. Nothing is hidden: every per-chapter note is shown before the modal closes.

Keeping the voice consistent

Pick your narrator at the chapter level before you render, so every segment inherits it. Per-segment voice overrides are great for dialogue or a second narrator, but switching engines mid-chapter is audible: engines differ in tone, pacing, and output format.

Changing a segment's voice or engine marks its existing take stale, exactly like a text edit, so the edited since render marker also protects you from accidentally shipping a chapter with one segment in the wrong voice.

Auto Tag can add performance [cues] to a segment for the cue-following Gemini Flash engine, with a one-click undo if you do not like the result.

Exports and where audio lives

Export chapter stitches the chapter's rendered takes, in order, into a single 24 kHz mono WAV and downloads it, named after the chapter. The stitch happens entirely in your browser from the takes you already rendered, so every segment in the chapter needs a current take first; the button is gated until then.

Individual takes are also yours: each segment's player downloads that take directly, as MP3 or WAV depending on the engine that rendered it, and each successful render auto-saves to your library when you are signed in.