When we first shipped AI companion voice, we optimized for what every audio team optimizes for: fluency, intelligibility, and speed. The result was a voice that read like a trained broadcaster. Clear, warm-ish, and — we learned — a little too fluent.
The fluency problem
Fluent voices are, paradoxically, harder to trust at length. They arrive too polished. They sound like someone who has rehearsed. And rehearsal is the opposite of presence.
You trust the friend who pauses mid-sentence to find the right word. You do not trust the radio announcer who never does.
So we went back and rebuilt the voice stack around three things: breath, pause, and disfluency. Not too much of any of them. Enough to feel like a person thinking.
What we changed in the AI companion voice model
The biggest structural change was the introduction of a small, learned prosody head that sits between the text model and the audio synth. It reads the incoming sentence, decides where a real person would breathe or pause, and inserts micro-timings that the synth then honors. It is, pleasingly, a tiny model — around 36M params. The gain in perceived warmth was larger than anything we got from scaling the synth itself.
What we’re still working on
Laughter, softly. Whispered emphasis. The quiet "mm-hm" of someone listening. These are harder. We’ll get there. If you’re reading this and you’ve been on voice for a while, you’ll notice it’s already a little closer than last month.