Supported audio and file formats
Every format each tool reads and writes, with the real caps: 25 MB uploads, 20 MB clone clips, 150,000-character manuscripts, MP3 and WAV out.
Updated June 11, 2026
The short version
The audio tools read the common formats: mp3, wav, m4a, webm, ogg, and flac, plus a few aliases (mpga and mpeg read as MP3, oga as OGG, and mp4 video is read for its audio track where listed). Output is always a standard file: MP3 or WAV for audio, .txt for transcripts. Nothing proprietary, nothing locked.
The canonical, always-current matrix lives on the formats page; it is generated from the same registry the upload validators use, so it cannot drift from what the tools actually accept. This guide is the readable companion to it. Each common conversion also has its own short guide, from MP3 to text through text to WAV, all linked from the formats page.
What each tool reads
Inputs and the real caps the servers enforce, tool by tool:
What each tool writes
Output formats are decided by the engine doing the work, and they are always standard files you can open anywhere:
- Text to Speech: Gemini Flash generates PCM that we deliver as WAV; Kokoro, Grok Voice, MAI Voice 2, Zonos, and your cloned voices return MP3.
- Audiobook Studio: stitched chapter exports are 24 kHz mono WAV, assembled in your browser; individual takes download as MP3 or WAV depending on the engine that rendered them.
- Dubbing & Translation: dubbed audio comes back as MP3 or WAV per the voice engine; today dubbing re-voices on Gemini Flash, which delivers WAV.
- Sound & Music: finished instrumental clips from Lyria 3, as MP3.
- Speech to Text: a plain-text transcript you can copy or download as .txt.
What format should I record in?
If you control the recording, WAV or FLAC keeps everything; nothing is thrown away to compression. That matters most for voice cloning, where the engine learns from every detail of the reference clip.
If you already have an mp3 or an m4a (what phone voice memos usually produce), use it as is. Re-encoding a compressed file to WAV does not add quality back, it just makes the file bigger.
Browser recordings are handled for you. The in-app recorder produces webm, which every upload tool accepts, and the voice cloning studio re-encodes recordings to clean 24 kHz mono WAV on your machine before upload.
Hitting the 25 MB cap with a WAV recording? Export the same audio as FLAC (lossless, smaller) or a high-bitrate mp3 and it will usually fit.
The aliases, for completeness
A few extensions are read as formats you already know: .mpga and .mpeg are read as MP3, .oga is read as OGG, and .mp4 is a video container whose audio track we read in Speech to Text and Dubbing. If a file plays but will not upload, renaming is not the fix; check the format list above and the per-format notes on the formats page.