Research

Building an AI Companion Voice That Breathes

7 min read
Building an AI Companion Voice That Breathes

When we first shipped AI companion voice, we optimized for what every audio team optimizes for: fluency, intelligibility, and speed. The result was a voice that read like a trained broadcaster. Clear, warm-ish, and — we learned — a little too fluent.

The fluency problem

Fluent voices are, paradoxically, harder to trust at length. They arrive too polished. They sound like someone who has rehearsed. And rehearsal is the opposite of presence.

You trust the friend who pauses mid-sentence to find the right word. You do not trust the radio announcer who never does.

So we went back and rebuilt the voice stack around three things: breath, pause, and disfluency. Not too much of any of them. Enough to feel like a person thinking.

What we changed in the AI companion voice model

The biggest structural change was the introduction of a small, learned prosody head that sits between the text model and the audio synth. It reads the incoming sentence, decides where a real person would breathe or pause, and inserts micro-timings that the synth then honors. It is, pleasingly, a tiny model — around 36M params. The gain in perceived warmth was larger than anything we got from scaling the synth itself.

Sound wave visualization on a dark background
A 90-second sample of the old voice vs. the new, time-aligned. The flats and the breaths are the same audio.

What we’re still working on

Laughter, softly. Whispered emphasis. The quiet "mm-hm" of someone listening. These are harder. We’ll get there. If you’re reading this and you’ve been on voice for a while, you’ll notice it’s already a little closer than last month.