What is speakable text normalization?
Normalization rewrites numbers, dates, and abbreviations into the words a voice should actually say: Dr. into Doctor, 1982 into nineteen eighty-two. Why it matters for long-form audio.
Updated June 11, 2026
The definition
Text normalization is the step that rewrites written forms into speakable words before synthesis. Print is full of shorthand that readers expand without thinking: Dr., 3rd, $5,200, 5 km, e.g., 1982. A voice engine has to make those expansions explicitly, and normalization is the name for doing it.
Why it goes wrong
The hard part is ambiguity. St. is Saint in St. Augustine and Street in Main St.; 1982 is a year in one sentence and a quantity in another; 3/4 might be a fraction or the fourth of March. Engines normalize internally and usually guess well, but a single wrong guess in hour seven of an audiobook is exactly the kind of error a listener remembers.
That is why long-form producers normalize explicitly, in the script, where a human can check the result, instead of trusting every engine's silent guess.
The speakable pass on Cantari
The Audiobook Studio offers a Make it speakable checkbox at import. When it is on, each chapter is rewritten with expansions only: abbreviations become full words, and numbers, dates, ordinals, currency, and units become spoken forms.
The pass runs under guardrails: it is instructed never to paraphrase, cut, reorder, or add content, a length guard rejects any rewrite that comes back suspiciously short, and every per-chapter note is shown before the import finishes. Your words, expanded, never replaced.
For a short script in the Text to Speech studio, the manual version works fine: write tricky strings out as words wherever a specific reading matters.
Related ideas
Normalization decides what words get said; cues decide how they are delivered. The two are independent, and good long-form scripts use both.