**Jacob Effron** (0:00)
Noam Shazeer and Jack Rae really need no introduction. The two are at the forefront of Google's Gemini, LLM efforts, and have been involved in some of the most important discoveries in AI in the last decade. Noam as one of the co-inventors of the Transformer and mixture of experts, Jack, key part of many DeepMind breakthroughs. It's a real privilege of the job to get to sit with these two and ask them literally every top of mind question in AI today. We talked about how far test time compute will get us, and the spaces where it will and won't work. We talked about how the infrastructure needs will be different for test time compute versus the large pre-training paradigm. We also hit on the impressive pace at which open-source models have caught up with closed-source peers and their reactions to DeepSeq. We talked about their reception and reaction to Ilya saying that this test time compute paradigm won't get us all the way to AGI, as well as Jan LeCun saying this current generation of models can't actually have any novel thoughts. We talked about what it actually looks like to do cutting edge AI research today and what their day-to-days look like, as well as the future model milestones that actually matter to them. And then we also got Noam's reflections on character, as well as both of their responses to what AGI means for the role of humanity. I think folks are going to love this. It was really just a pleasure to get to speak with both Noam and Jack. Without further ado, here they are.
Well, Noam and Jack, thanks so much for coming on the podcast.
**Noam Shazeer** (1:13)
Oh, thank you.
**Jacob Effron** (1:14)
Are we sitting in the very office that the Transformer was invented?
**Noam Shazeer** (1:18)
No, this is a new building. We were in 1965 Charleston, I think, probably about half a mile.
**Jacob Effron** (1:25)
Half a mile, so it's in the air.
**Noam Shazeer** (1:26)
Pretty close, pretty close.
**Jacob Effron** (1:28)
Well, many things to dive into today. I mean, obviously, want to start with some of the latest Gemini 2 models, and obviously all the work you've been doing around test time compute and Gemini 2 flash thinking. I guess just at the highest level for our listeners, how do you characterize where these models work today, where they don't work as well? And as you were kind of experimenting with them, what surprised you most about those results?
**Jack Rae** (1:48)
One surprising thing is when we started the particular concerted effort to build a lot of research into test time compute, into Gemini, and then think about shipping it, is that we were really focused on starting out with reasoning tasks. So math and code were big areas of focus. And it wasn't really clear, whilst we're sprinting in that domain, we obviously want to broaden it naturally over time, but it wasn't really clear how that would work. Would there be any kind of sense of generalization? Would thinking be useful beyond those reasoning tasks if we're just concentrating on those as researchers? And I think it was pretty fun to see one of the early models that we felt like had been trained to try and match the style of Gemini Flash. So it had been trained with thinking, but then it had also undergone some kind of training to actually be just generally a nice style, a nice model to talk with.
And it was actually very fun seeing thinking interact and improve creative tasks as well. You could ask the document to compose an essay on a particular topic, and actually A, the thought content was very fun to read, and it would go through various different ideas, and then it would go through revisions of the idea or things that it should cut. And that was kind of fun. And then also the output felt really nice. So that was one thing that surprised me.
**Jacob Effron** (3:14)
Any surprise for you, Noam?
**Noam Shazeer** (3:16)
Well, yeah, I mean, in general, I'm all for generality, like let's train something that's great at everything. It is important. And I was skeptical at first of like, okay, this intense focus on things like math. But it is very important to have good benchmarks that are going to encourage you to be able to reason about the difficult tasks. Because a lot of things will drop perplexity, like add more parameters to the model and memorize more.
So it's nice to have the evals that can distinguish better some of the more difficult problems.
**Jacob Effron** (4:05)
What evals are even meaningful to you at this point? I mean, obviously, I feel like people are trying to hill climb the same set of evals that feel increasingly less relevant to day-to-day work. What do you guys do when you're testing these models?
56 more minutes of transcript below
Try it now — copy, paste, done:
curl -H "x-api-key: pt_demo" \
https://spoken.md/transcripts/1000651996090
Works with Claude, ChatGPT, Cursor, and any agent that makes HTTP calls.
From $0.10 per transcript. No subscription. Credits never expire.
Using your own key:
curl -H "x-api-key: YOUR_KEY" \
https://spoken.md/transcripts/1000699518901