Dario Amodei — "We are near the end of the exponential" Transcript — Dwarkesh Podcast

**Dwarkesh Patel** (0:00)
So, we talked three years ago. I'm curious, in your view, what has been the biggest update of the last three years? What has been the biggest difference between what I felt like last three years versus now?

**Dario Amodei** (0:08)
Yeah. I would say, actually, the underlying technology, like the exponential of the technology, has gone, broadly speaking, I would say, about as I expected it to go. I mean, there's like plus or minus a couple. There's plus or minus a year or two here, there's plus or minus a year or two there. I don't know that I would have predicted the specific direction of code. But actually, when I look at the exponential, it is roughly what I expected in terms of the march of the models from smart high school student to smart college student to beginning to do PhD and professional stuff, and in the case of code, reaching beyond that. The frontier is a little bit uneven. It's roughly what I expected. I will tell you though what the most surprising thing has been.
The most surprising thing has been the lack of public recognition of how close we are to the end of the exponential. To me, it is absolutely wild that you have people within the bubble and outside the bubble, but you have people talking about these, just the same tired old hot button political issues, and around us, we're near the end of the exponential.

**Dwarkesh Patel** (1:22)
I want to understand what that exponential looks like right now, because the first question I asked you when we recorded it three years ago was, what's up with scaling? What does it work? I have a similar question now, but I feel like it's a more complicated question because, at least from the public's point of view, yes, three years ago, there were these well-known public trends where across many orders of magnitude of compute, you could see how the loss improves. Now, we have RL scaling and there's no publicly known scaling law for it.
It's not even clear what exactly the story is of, is this supposed to be teaching the model skills, is this supposed to be teaching meta-learning? What is the scaling hypothesis at this point?

**Dario Amodei** (1:58)
Yeah. So I have actually the same hypothesis that I had even all the way back in 2017 So in 2017, I think I talked about it last time, but I wrote a doc called The Big Blob of Compute Hypothesis.
It wasn't about the scaling of language models in particular. When I wrote it, GPT-1 had just come out, right? So that was one among many things, right? There was, back in those days, there was robotics. People tried to work on reasoning as a separate thing from language models. There was scaling of the kind of RL that happened in AlphaGo and that happened at Dota at OpenAI, and people remember StarCraft at DeepMind, the AlphaStar. So it was written as a more general document. The specific thing I said was the following.
It's very, Rich Sutton put out the bitter lesson a couple of years later, but the hypothesis is basically the same. What it says is, all the cleverness, all the techniques, all the kind of, we need a new method to do something like that doesn't matter very much. There are only a few things that matter, and I think I listed seven of them. One is like how much raw compute you have. The other is the quantity of data that you have. Then the third is kind of the quality and distribution of data, right? It needs to be a broad, broad distribution of data. The fourth is, I think, how long you train for. The fifth is you need an objective function that can scale to the moon. So, the pre-training objective function is one such objective function, right? Another objective function is the kind of RL objective function that says like you have a goal, you're going to go out and reach the goal. Within that, of course, there's objective rewards like you see in math and coding. And there's more subjective rewards like you see in RL from human feedback or kind of higher order versions of that. And then the sixth and seventh were things around kind of like normalization or conditioning, like just getting the numerical stability so that kind of the big blob of compute flows in this laminar way instead of running into problems. So that was the hypothesis, and it's a hypothesis I still hold. I don't think I've seen very much that is not in line with that hypothesis. And so the pre-training scaling laws were one example of kind of what we see there, and indeed, those have continued going. Like, you know, I think now it's been widely reported, like, you know, we feel good about pre-training. Like, pre-training is continuing to give us gains. What has changed is that now we're also seeing the same thing for RL, right? So we're seeing a pre-training phase, and then we're seeing, like, an RL phase on top of that. And with RL, it's actually just the same. Like, you know, even other companies have published, like, you know, in some of their releases, have published things that say, look, you know, we train the model on math contests, you know, AIME or the kind of other things, and, you know, how well the model does is log-linear and how long we've trained it. And we see that as well. And it's not just math contests. It's a wide variety of RL tasks. And so we're seeing the same scaling in RL that we saw for pre-training.
Dario Amodei — "We are near the end of the exponential"

Feed this to your agent