The world of voice AI, with Mati Staniszewski of ElevenLabs

**John** (0:02)
Mati Staniszewski co-founded ElevenLabs in 2022, and has since scaled it to the $11 billion leader in AI audio. He's credited it with capturing the humaness of speech to realistic emotional inflection, and they're now expanding into everything from agentic workflows to music.
Thank you for having us.

**Mati Staniszewski** (0:18)
Thanks for having me.

**John** (0:21)
Let me give you a little place to start is, describe to me how I know how an LLM works at a high level. Describe to me how an audio model works. Like if we were Carpathi style looking to build a toy one from scratch, how does it work?

**Mati Staniszewski** (0:36)
In the early days, you try to replicate it exactly how you would replicate it with the human body. So you would try to completely try to reproduce an machine, analog machine that will create a vocal tract effectively. Then that progressed into trying to create effectively digital signals for speech. Bell Labs was one of the first to try to create a structured set of signals that will represent the speech. And that is the first precursor to what we would do today.
Then you would try to stitch in phonemes, effectively different sounds of how we would speak to humans, and then try to concatenate them together. It's another important part in that equation where you would, based on the most probabilistic approach of the next word, you would effectively try to bring the phonemes from your library of phonemes and bring them together. And then down to the modern history, where now we effectively do similar neural nets in other domains. So you predict the next sound based on, of course, the context of the previous sounds, if it's a streaming speech. If it's, let's say, a context of audio, you will use combination of predicting of the phonemes, but you also use the contextual text element of that work. And here, credit to my co-founder, Piotr, who effectively came with that new idea of how you can now create voice models, which are both reliable, high quality, quick, where you would bring a lot of the ideas from transformer models, from diffusion models, into the speech space. So that prediction of the next stock in the phoneme space wasn't something that was possible.
You always talk briefly about this, of how you operate on the text, on the waveform space, there's also a Mell Spectrogram space. So usually you do text, Mell Spectrogram waveform.

**John** (2:15)
So what's a Spectrogram space?

**Mati Staniszewski** (2:17)
It's like a visual representation of how the speech sounds across pitch, across energy, and then you transform that into a waveform.

**John** (2:24)
Got it.

**Mati Staniszewski** (2:24)
So like when WaveNet came along and Tachymotor models, they would effectively use text to Mell Spectrogram, so that visual representation, and then how you decode and encode that into the waveform to bring it across. And Piotr figured out how to abstract some of those steps and decode and encode them a lot better.
So that predicting of the next phoneme was one of the big piece. And second big piece was, how do you bring that context into the equation? So what I mean by context is, if a voice actor was reading a textual copy, you would know that, okay, this is a dialect sequence, I need to produce a dialect. If it's a happy sentence, I might need to pronounce it as a happy sentence. But what happens before and after comes into the equation and you need to bring that across. And then there's a last big piece. So voice model has the sound of how you intonate the given fragment. But the second big part is the voice itself of the characteristics of accents, of style, of prosody across that voice. So when you actually try to vocalize something, when you create a voice model, you turn text into audio, you need the text, you also need the voice reference of how you want it to be spoken. So here is kind of the second big innovation. So part of context is how you decode and encode those features.
So when Bell Labs came with their initial representation of speech, the big piece there was you would have effectively hard coded parameters for that speech. With ElevenLabs models...

**John** (3:49)
Hard coded parameters for enthusiastic speaker, British accents, that kind of stuff.

**Mati Staniszewski** (3:54)
Exactly. That kind of stuff, like the set of pitch elements that you can select, set of energy spectrograms you can select from. And in our approach, effectively, you would give the model open-ended ability to select what those parameters should be. So it's not going to be British, Polish, Spanish, English speaker, but the model will deduce them themselves. The same for other set of parameters that are not hard-coded, whether it's the enthusiasm, whether it's the sadness, et cetera.
The world of voice AI, with Mati Staniszewski of ElevenLabs

Feed this to your agent