Dialog practice in two languages, from one studio.
Every lesson needs listening audio: dialogs, drills, comprehension passages, in the target language and the learner's own. Generate both sides on the multilingual engine, and regenerate the set when the curriculum changes.
No credit card · Real engines · The audio is yours

Your learners can read Spanish; what they cannot do is survive hearing it at full speed. Every unit needs dialog audio, two speakers with natural pacing, re-recorded whenever the curriculum changes. Hiring native speakers for every drill in a sixty-unit course is how edtech budgets die.
What Language Learning actually needs.
Listening practice is the most expensive part of a language curriculum to produce: it needs native-sounding speech in the target language, plus instruction in the learner's language, for every drill in every unit. Studio sessions per language pair do not scale, so courses ship with too little audio and learners meet the spoken language for the first time in the wild. Generated speech on a multilingual engine makes listening drills as revisable as the worksheets around them.
Real features, mapped to the job.
Every item here works today, or says plainly where it is still in progress.
Two languages, one engine
Gemini Flash is the multilingual engine here, so the target-language line and the instruction line come from the same studio, in the same session.
Dialog with intent
Bracketed cues like [cheerfully] or [slowly, clearly] shape the delivery, so a drill can sound like a market stall rather than a dictation test.
Drills that revise with the curriculum
When unit four gets rewritten, regenerate unit four. The flat allowance means audio stops being the reason a curriculum update waits.
Translate existing lessons
The live dubbing pipeline carries recorded lesson audio into eight languages: transcribe, translate, re-voice, with the script editable at each step.
Listening drill, unit 4: at the market (A2)
Listen to the vendor's question, then answer out loud in the pause.
¿Cuánto quiere? ¿Medio kilo, o un kilo entero?
She asked how much you want: half a kilo, or a whole kilo.
Gemini Flash voices both languages in one drill. It is the same multilingual engine behind the dubbing pipeline, so the Spanish is generated speech, not a recording you have to license.
- ~1 min
- of listening drill from every 1,000 characters
- 2
- speakers in a dialog, one multilingual engine
- 8
- languages the dubbing pipeline translates into
How it goes, step by step.
Step 1: Write the drill
Script the dialog with speaker labels: instruction lines in the learner's language, practice lines in the target language.
Step 2: Voice each speaker
Pick a Gemini Flash voice per speaker so the vendor and the narrator stay distinct across the unit.
Step 3: Generate and listen for pacing
Generate the lines, check the target-language pacing, and add cues like [slowly, clearly] where beginners need room.
Step 4: Export to your course
Export MP3 or WAV per line or per drill, yours to embed in the app or LMS.
Designing language learning audio learners can keep up with.
Repetition is the curriculum, so make it cheap
A learner needs the same structure voiced five ways: statement, question, faster, slower, with a distractor. On a per-character meter those variants are where a language learning budget quietly dies; on a flat allowance, generating all five becomes the default lesson design instead of a luxury.
Two voices keep the drill legible
Hold one voice for instructions in the learner's language and a different voice for the target language, and never swap them. Learners stop translating the frame and start listening for the content, because the voice itself tells them which language is coming next.
Mind the engine's language list
Gemini Flash, the multilingual engine here, speaks roughly two dozen languages, and the dubbing pipeline translates into eight. Before scripting a unit, check the language learning pair you teach against those lists. If your target language is missing, we would rather you learn that on this page than after a pilot.
Start with Gemini Flash.
Gemini Flash is the only multilingual engine here and the only one that acts cues, which is exactly the pairing a language drill needs: real target-language speech, delivered with intent.
The only engine here that acts your bracketed [emotion] directions.
- Quality Elo
- 1225
- Latency
- 2770 ms (measured 2026-06-10)
- Languages
- 24
- Rights
- Commercial use; outputs are yours
“Escucha otra vez. ¿Cuánto cuesta el kilo de tomates?”
The honest answers.
What Cantari can and cannot do for language learning today, in plain language.
Which languages can I generate?
Can I slow the audio down for beginners?
Do I own the drill audio?
Try Cantari for language learning.
Free to start, no credit meter. Open the studio and hear it for yourself.