Skip to content
New · the open voice benchmark is liveRead it
cantari
Platform

Supported audio and file formats

Every format each tool reads and writes, with the real caps: 25 MB uploads, 20 MB clone clips, 150,000-character manuscripts, MP3 and WAV out.

Updated June 11, 2026

The short version

The audio tools read the common formats: mp3, wav, m4a, webm, ogg, and flac, plus a few aliases (mpga and mpeg read as MP3, oga as OGG, and mp4 video is read for its audio track where listed). Output is always a standard file: MP3 or WAV for audio, .txt for transcripts. Nothing proprietary, nothing locked.

The canonical, always-current matrix lives on the formats page; it is generated from the same registry the upload validators use, so it cannot drift from what the tools actually accept. This guide is the readable companion to it. Each common conversion also has its own short guide, from MP3 to text through text to WAV, all linked from the formats page.

What each tool reads

Inputs and the real caps the servers enforce, tool by tool:

ToolAcceptsCaps
Speech to Textmp3, wav, m4a, mp4 (audio track), webm, ogg, flac25 MB per file
Dubbing & Translationmp3, wav, m4a, mp4 (audio track), webm, ogg, flac25 MB per file; scripts up to 30,000 characters
Voice Cloningmp3, wav, m4a, webm, ogg, flac20 MB per clip; up to 2 minutes of reference audio
Audiobook StudioPasted text or a .txt fileManuscripts up to 150,000 characters
Text to SpeechTyped or pasted textScripts up to 30,000 characters
Sound & MusicA plain-language promptPrompts up to 600 characters

What each tool writes

Output formats are decided by the engine doing the work, and they are always standard files you can open anywhere:

  • Text to Speech: Gemini Flash generates PCM that we deliver as WAV; Kokoro, Grok Voice, MAI Voice 2, Zonos, and your cloned voices return MP3.
  • Audiobook Studio: stitched chapter exports are 24 kHz mono WAV, assembled in your browser; individual takes download as MP3 or WAV depending on the engine that rendered them.
  • Dubbing & Translation: dubbed audio comes back as MP3 or WAV per the voice engine; today dubbing re-voices on Gemini Flash, which delivers WAV.
  • Sound & Music: finished instrumental clips from Lyria 3, as MP3.
  • Speech to Text: a plain-text transcript you can copy or download as .txt.

What format should I record in?

If you control the recording, WAV or FLAC keeps everything; nothing is thrown away to compression. That matters most for voice cloning, where the engine learns from every detail of the reference clip.

If you already have an mp3 or an m4a (what phone voice memos usually produce), use it as is. Re-encoding a compressed file to WAV does not add quality back, it just makes the file bigger.

Browser recordings are handled for you. The in-app recorder produces webm, which every upload tool accepts, and the voice cloning studio re-encodes recordings to clean 24 kHz mono WAV on your machine before upload.

Hitting the 25 MB cap with a WAV recording? Export the same audio as FLAC (lossless, smaller) or a high-bitrate mp3 and it will usually fit.

The aliases, for completeness

A few extensions are read as formats you already know: .mpga and .mpeg are read as MP3, .oga is read as OGG, and .mp4 is a video container whose audio track we read in Speech to Text and Dubbing. If a file plays but will not upload, renaming is not the fix; check the format list above and the per-format notes on the formats page.