Skip to content
New · the open voice benchmark is liveRead it
cantari
Tools

How to transcribe audio to text

Upload or record audio and get an accurate plain-text transcript from a Whisper-class model. Formats, the 25 MB cap, and what metering really counts.

Updated June 11, 2026

Upload or record

Speech to Text turns a recording into an editable transcript. You can drop in a file or record directly in the browser, up to 5 minutes per take. Transcription is done by a Whisper-class model and usually comes back in seconds.

Accepted uploads: mp3, wav, m4a, webm, ogg, and flac, up to 25 MB per file. The mpga, mpeg, mp4, and oga extensions are read too; for mp4 video we read the audio track.

What the transcript looks like

The transcript comes back as plain text. There are no speaker labels and no timestamps today, because the transcription endpoint we use does not provide them, and we will not fake structure it did not return. If you need who-said-what or per-word timing, this tool does not do that yet.

Plain text only for now: no speaker diarization, no word or segment timestamps. The docs will change when the tool does.

Copy, download, save

Once a transcript exists, three actions appear: Copy puts the full text on your clipboard, Download .txt saves it as a plain-text file named after your audio, and Save to library stores the source audio together with its transcript in your private library (sign-in required for the library).

ActionWhat you getNeeds sign-in
CopyTranscript on your clipboardNo
Download .txtPlain-text file on your machineNo
Save to librarySource audio + transcript in your libraryYes

How transcription is metered

Honest answer: by the transcript's character count, not by audio minutes. After a successful job, the number of characters in the returned transcript is recorded against the same monthly character allowance the voice tools spend. Since about 1,000 characters is a minute of audio, the two views track each other closely in practice.

If your monthly allowance is already spent, new transcriptions are blocked until it resets on the 1st. A failed or empty transcription costs you nothing; usage is only recorded after success.

You can try transcription without an account, within a polite rate limit. Signed-in use runs against your allowance instead.

Getting a better transcript

The model is good, but source quality still rules. Clear speech, close to the microphone, with minimal background noise transcribes best. Compressed formats like mp3 are fine; what hurts is room echo, crosstalk, and music under the voice.

The transcript header shows your file and the model class that transcribed it, so you always know what produced the text in front of you.