Genie 3: An Infinite World Model with Shlomi Fruchter and Jack Parker-Holder

**Jack Parker-Holder** (0:07)
I think this is really a step change as a foundation model in terms of the breadth and generality and capabilities. Now we're there, there's a whole host of different potential things that could be used for and have impact on.

**Shlomi Fruchter** (0:18)
The number of details that we have in those visual worlds is just like it's staggering. The quality of the memory, considering how much information it has to actually remember.

**Hannah Fry** (0:31)
Welcome back to Google with DeepMind, the podcast. I'm Professor Hannah Fry. Now the latest video generation models have impressed the entire world. They've created this near perfect imitation of reality. But the limitations of video is that you are just a viewer rather than a participant. And that's not how humans experience the real world, right? We, instead, can navigate environments we've never been to and still have an expectation of what we're likely to encounter. We can explore in every feasible direction kind of without limits and interact with things that we chance upon along the way. And that is the next great frontier for this technology, to move beyond generating a perfect recording of a scene and towards building a dynamic simulation of a world we can finally step into.
Enter Genie 3, a prototype world model that can generate an unprecedented variety of interactive environments. It's already been described as a stepping stone towards AGI. And with me today are two of its creators, Shlomi Fruchter, Research Director, and Jack Parker-Holder, Research Scientist. Welcome to the podcast. Thanks. Okay, let's get straight into it. What is Genie 3?

**Jack Parker-Holder** (1:45)
It's a real time interactive world model that allows you to create diverse, visually interesting worlds from a text prompt. So there is no underlying game engine, no structure, no code. It's just a neural network that's predicting every single pixel in reaction to inputs from the user and also the past. And so the flexibility and the diversity of things you can create in basically no time is quite unprecedented.

**Hannah Fry** (2:10)
You haven't had a whole army of artists sitting in rooms constructing a world in order to be able to interact.

**Shlomi Fruchter** (2:16)
Yes, I think the key point is that you can create any world that you can imagine, right? And that's not something that you can do with a game engine, right?

**Hannah Fry** (2:23)
Well, let's OK, let's have a look at it, because you've got some demos for me, right?

**Shlomi Fruchter** (2:26)
Yeah, so we have a few. The first one, I think you might like it. So it's basically playing a cat.

**Hannah Fry** (2:32)
OK, so you've got me.

**Shlomi Fruchter** (2:34)
Ginger cat, not so.

**Hannah Fry** (2:35)
Excellent. And what you have here is a beautiful ginger cat wandering around an apartment. It's very beautifully furnished. It's got these nice Persian rugs and wooden floor. It's got a sofa that it's currently trying to jump on. But it's not doing it itself, right? You're prompting its movement.

**Shlomi Fruchter** (2:53)
Yes, exactly. I'm just using the keyboard to control the cat. So I can look around, move the cat, basically tell it where to go, so it can jump over the sofa.

**Hannah Fry** (3:03)
OK.

**Shlomi Fruchter** (3:03)
And I really like walking into the sunlight here.

**Hannah Fry** (3:07)
So this is reacting to the inputs that you're giving it.

**Shlomi Fruchter** (3:11)
Yes, exactly.

**Hannah Fry** (3:12)
Is it light going to change as you go into the... Oh, look at that. Look at that. Yes.

**Shlomi Fruchter** (3:18)
So the model is basically trying to predict what's going to happen next based on the sequence of inputs that it gets, and it does it in real time.

**Hannah Fry** (3:27)
I mean, there's also sort of the detail that you're seeing in this 3D environment. It's also quite reminiscent of some of the stuff that we're seeing with Veo. Like if I didn't know that you were interacting with it, how is it different from that?

**Shlomi Fruchter** (3:38)
So when you create a video using Veo, so you provide a prompt, and then the model is trying to figure out how to create this entire video of, say, eight seconds from start to finish. And once it's ready, then you cannot change how the camera moves around, and definitely you cannot explore it much more than those eight seconds, right?

**Hannah Fry** (3:59)
Can you use an image to prompt this, or is it only text?

**Shlomi Fruchter** (4:01)
Yes, so we just found out that we can actually use an image and the videos to prompt them all. In this particular case, we found that we can actually use paintings. For example, this is Nighthawk by Edward Hopper.
Genie 3: An Infinite World Model with Shlomi Fruchter and Jack Parker-Holder

Feed this to your agent