John Schulman (OpenAI Cofounder) — Reasoning, RLHF, & plan for 2027 AGI Transcript — Dwarkesh Podcast

**Dwarkesh Patel** (0:00)
Today, I have the pleasure to speak with John Schulman, who is one of the co-founders of OpenAI and leads the post-training team here.
And he also led the creation of ChatGPT and is the author of many of the most important and widely cited papers in AI and RL, including PPO, and many others. So John, really excited to chat with you. Thanks for coming on the podcast.

**John Schulman** (0:21)
Thanks for having me on the podcast. I'm a big fan.

**Dwarkesh Patel** (0:23)
Thank you, thank you for saying that.
So the first question I had is, we have these distinctions between pre-training and post-training, beyond what is actually happening in terms of loss function and training regimes. I'm just curious, taking a step back conceptually, like what kind of thing is pre-training creating? What does post-training do on top of that?

**John Schulman** (0:44)
In pre-training, you're basically training to imitate all of the content on the internet or on the web, including websites and code and so forth. So you get a model that can basically generate content that looks like random web pages from the internet. And the model is also trained to maximize likelihood where it has to put a probability on everything. So the objective is basically predicting the next token, given the previous tokens. Tokens are like words or parts of words. And since the model has to put a probability on it, and we're training to maximize log probability, it ends up being very calibrated. So it can not only generate all of the content of the web, it can also assign probabilities to everything.
So the base model can effectively take on all of these different personas or generate all these different kinds of content. And then when we do post-training, we're usually targeting a narrower range of behavior where we basically want the model to behave like this kind of chat assistant. And it's a more specific persona where it's trying to be helpful. It's not trying to imitate a person. It's answering your questions or doing your tasks.
And we're optimizing on a different objective, which is more about producing outputs that humans will like and find useful, as opposed to just trying to imitate this raw content from the web.

**Dwarkesh Patel** (2:21)
I think maybe I should take a step back and ask.
Right now, we have these models that are pretty good at acting as chatbots. Just taking a step back from how these processes work currently, what will the models release by the end of, kinds of things the models release in the end of the year, what we're capable of doing, what do you see the progress looking like five, you know, carry this forward for the next five years?

**John Schulman** (2:40)
Oh yeah, five years. Yeah, I think the models will get quite a bit better.

**Dwarkesh Patel** (2:44)
But in what way in the course of five years?

**John Schulman** (2:46)
Yeah, so I mean, I think even in one or two years, we'll find that a lot of, you can use them for a lot of more like involved tasks than they can do now. So you could, so for example, right now, like you could imagine having the models to carry out a whole coding project instead of maybe giving you one suggestion on how to write a function.
So you could imagine the model, like you giving it sort of high level instructions on what to code up and it'll go and write many files and test it, look at the output, iterate on that a bit. So just much more complex tasks.

**Dwarkesh Patel** (3:31)
And fundamentally the unlock is that it can act coherently for long enough to write multiple files of code or what has changed between now and then.

**John Schulman** (3:39)
Yeah, I would say this will come from some combination of just training the models to do harder tasks like this. So just like I'd say, right, the models aren't, aren't particularly like, most of the training data is more like doing single steps at a time. And I would expect us to do more for training the models to carry out these longer projects.
So I'd say any kind of training, any like doing RL to learn how to do these tasks, however you do it, whether it's, whether you're supervising the final output or supervising it, like each step, I think any kind of training at carrying out these long projects is going to make them a lot better. And since the whole area is pretty new, I'd say there's just a lot of low-hanging fruit.

**Dwarkesh Patel** (4:36)
Interesting.

**John Schulman** (4:38)
In doing this kind of training.

Feed this to your agent

Try it now — copy, paste, done:

curl -H "x-api-key: pt_demo" \
  https://spoken.md/transcripts/1000651996090

Works with Claude, ChatGPT, Cursor, and any agent that makes HTTP calls.

From $0.10 per transcript. No subscription. Credits never expire.

Using your own key:

curl -H "x-api-key: YOUR_KEY" \
  https://spoken.md/transcripts/1000655679622