How we verify every fixed sample on this site
Engines intermittently truncate long takes. Every fixed clip here is transcribed back and checked before it ships.
Last updated June 11, 2026
A demo that lies is worse than no demo
This site is full of audio: benchmark rows, engine cards, voice previews, gallery clips, worked examples on the use-case pages. Every one of those fixed clips, 78 of them in the current build, is real output from the engine named beside it. That is the house rule, and it created an engineering problem worth writing down.
The problem is not fakery, it is flakiness. Voice engines intermittently truncate long takes: the clip arrives, sounds perfect for eight seconds, and is missing its final sentence. Generate one sample by hand and you catch it by ear. Generate a site's worth and you will eventually publish a clip that dies mid-sentence under a card promising studio-grade audio.
A demo like that is worse than no demo, because it is the one moment a visitor can check our claims directly with their own ears.
Pre-rendered from the same data the pages read
The fix starts with where the sample list comes from: nowhere, by hand. The generation script imports the same modules the site's components render, the engine registry, the gallery, the use-case examples, the tool pages, and enumerates every fixed engine, voice, and line combination they contain. If a page gains a sample line, the script's inventory gains it on the next run; there is no second list to forget to update.
Each clip is then rendered through the exact synthesis path the live studio uses, per-engine input shaping and all, and saved under a filename derived from its content key. A clip that already exists is skipped, not re-billed, so re-runs are cheap and the set converges instead of churning.
The transcribe-back loop
Then comes the part that earns the word verified. Every newly generated clip is checked by a second model before it can ship:
- The clip is transcribed back to text by a Whisper-class model, the same transcription call our Speech to Text tool makes.
- The transcript is normalized and must contain the last three words the engine was supposed to speak. Truncation, the failure mode we actually see, always eats the end of a take, so the end is what we check.
- Bracketed [cues] are stripped from the expectation first, because they are stage directions, not words an engine should say.
- A failing take is deleted and regenerated, up to three attempts per clip.
- If any clip still cannot verify, the whole run aborts and writes nothing: the site keeps its previous verified set rather than shipping one bad clip.
The numbers
The current verified set, counted at build time from the same generated module the audio players read:
* Counted from the generated sample manifest at build time, so this table updates itself when the pipeline next runs.
Why instant playback, and what stays live
The pleasant side effect is speed. When you press play on a fixed sample, you are fetching a small static file, not waiting on a generation queue, so playback is instant and costs nothing per listen. That is why the gallery can put dozens of real clips on one page and why the benchmark rows respond like a local music player.
Two things deliberately stay generated live: anything you type yourself, and cloned voices. Pre-rendering your words is impossible, and quietly swapping in canned audio would break the only rule that makes the samples worth anything. The players check the verified set first and fall back to the live engines for everything else, the same engines, the same path.
The clip date is printed where the samples live, and when an engine changes, the pipeline re-runs and this post's numbers change with it. Fixed audio, honestly earned, is still real audio. It just got checked twice.
Check our work, then make your own.
The benchmark is live and the studio is free to start. Every claim above is one click from its source.
