Glossary

What is speech to text? Transcription, explained simply

Speech to text (STT) turns recorded speech into written text. How modern transcription models work, what limits their accuracy, and what they cannot do yet.

Updated June 11, 2026

The definition

Speech to text, shortened to STT and also called automatic speech recognition (ASR), is the reverse of text to speech: you give it audio of someone speaking, and it returns the words as written text. It is the technology behind transcripts, captions, dictation, and voice search.

Modern transcription is done by large neural models trained on paired audio and text. The best-known family is Whisper, which is why you will see capable systems described as Whisper-class: trained on huge multilingual audio sets, robust to accents, and accurate enough that the transcript usually needs only light cleanup.

What limits accuracy

Accuracy is mostly decided before the model ever runs, by the recording itself. The things that hurt a transcript:

Distance and echo: a speaker far from the microphone in a reflective room.
Overlap: two people talking at once is genuinely hard for any system.
Music or noise beds under the voice.
Specialized vocabulary: names, jargon, and code-switching between languages mid-sentence.

What transcription is not

Plain transcription returns the words, nothing else. Knowing who said what is a separate problem called speaker diarization, and per-word timing for captions is another. Some products bundle these; many, honestly, do not, and it is worth checking before you build a workflow on the assumption.

Speech to text on Cantari

The Speech to Text tool takes an upload or an in-browser recording and returns an editable plain-text transcript from a Whisper-class model, usually in seconds. No speaker labels and no timestamps today, and the page says so rather than faking structure.

Details, formats, and the honest metering answer are in the Speech to Text guide. Transcription is also step one of the dubbing pipeline, where the transcript becomes the script for a translated re-voice.