Skip to content
New · the open voice benchmark is liveRead it
cantari
Formats

Every format we read and write.

These lists are generated from the same configuration the studio runs on, so they cannot drift from what the tools actually accept. If a format is missing here, the upload will tell you the same thing.

Verified against the studio · Updated 2026-06-10

Bring audio in

What you can upload.

Three tools take audio files. Dubbing transcribes first, so it reads the same uploads as Speech to Text; cloning takes the same set minus video containers.

Bring words in

Manuscripts and chapters.

The Audiobook Studio takes text, not audio: paste the whole book or import a plain-text file, then export finished chapters.

Audiobook Studio

Manuscripts up to 150,000 characters

Paste a whole manuscript or import a plain-text file; export finished chapters as audio.

InYou can upload
  • .txtPlain textPaste your manuscript or import a .txt file.
Take audio out

What your generations export as.

No uploads here; you type, we generate. The engine you pick decides the container, and every export carries commercial rights.

Text to Speech

Scripts up to 30,000 characters

Type or paste a script; the engine you pick decides whether the audio comes back as MP3 or WAV.

Sound & Music

Prompts up to 600 characters

A plain-language prompt returns a finished instrumental clip.

InYou can upload

No file uploads: describe the clip you want.

At a glance

The full matrix.

Every format against every tool, computed from the same registry as the cards above.

FormatSpeech to TextVoice CloningDubbing & TranslationAudiobook StudioText to SpeechSound & Music
.mp3Speech to Text accepts .mp3 uploadsVoice Cloning accepts .mp3 uploadsDubbing & Translation accepts .mp3 uploadsDubbing & Translation exports .mp3Audiobook Studio exports .mp3Text to Speech exports .mp3Sound & Music exports .mp3
.wavSpeech to Text accepts .wav uploadsVoice Cloning accepts .wav uploadsDubbing & Translation accepts .wav uploadsDubbing & Translation exports .wavAudiobook Studio exports .wavText to Speech exports .wav
.m4aSpeech to Text accepts .m4a uploadsVoice Cloning accepts .m4a uploadsDubbing & Translation accepts .m4a uploads
.mp4Speech to Text accepts .mp4 uploadsDubbing & Translation accepts .mp4 uploads
.webmSpeech to Text accepts .webm uploadsVoice Cloning accepts .webm uploadsDubbing & Translation accepts .webm uploads
.oggSpeech to Text accepts .ogg uploadsVoice Cloning accepts .ogg uploadsDubbing & Translation accepts .ogg uploads
.flacSpeech to Text accepts .flac uploadsVoice Cloning accepts .flac uploadsDubbing & Translation accepts .flac uploads
.txtSpeech to Text exports .txtAudiobook Studio accepts .txt uploads
Accepted as uploadAvailable as export
Convert guides

One page per conversion.

Each common conversion has its own guide: what the format is, who really makes such files, and the honest caps, with the same registry behind every number.

Straight answers

Format questions, answered honestly.

Is there a file size limit?
Yes, and we publish the real numbers: Speech to Text and Dubbing take uploads up to 25 MB per file, Voice Cloning takes reference clips up to 20 MB and up to two minutes long, and the Audiobook Studio takes manuscripts up to 150,000 characters, which is about two and a half hours of finished narration.
What about video files like MP4?
MP4 works today in Speech to Text and Dubbing: we read the audio track and ignore the picture. Other video formats like MOV are not supported yet; if you can export the audio as MP3, WAV, or M4A first, every tool will read it.
What format should I record in?
Whatever your recorder makes. Browser recordings come out as WebM and phone voice memos as M4A, and we read both. For voice cloning, in-browser recordings are re-encoded to clean WAV automatically before upload, so you do not have to convert anything yourself.
Why do some exports come back as WAV and others as MP3?
The engine decides. Gemini Flash generates raw PCM that we deliver as WAV; the other engines and your cloned voices return MP3. Stitched chapter exports from the Audiobook Studio are always 24 kHz mono WAV, and Sound & Music clips are MP3.

Bring a file and see for yourself.

Drop a recording into Speech to Text and read the transcript in seconds. Free to start, no credit meter.