Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun — Latent Space: The AI Engineer Podcast

**Chris Manning** (0:00)
I think this whole space is extremely difficult as things are emerging now. And I mean, it's not only for world models, I think it's for everything, including text-based models, right? Because, you know, in the early days, it seemed very easy to have good benchmarks, because we could do things like question-answering benchmarks. But, you know, these days, so much of what people are wanting to do is nothing like that, right? If you're wanting to get some recommendations about which backpack would be best for you for your trip in Europe next month, it's not so easy to come up with a benchmark. And it's the same problem with these world models.

**swyx** (0:44)
Before we get into today's episode, I just have a small message for listeners. Thank you. We will not be able to bring you the AI engineering, science, and entertainment contents that you so clearly want, if you didn't choose to also click in and tune into our content. We've been approached by sponsors on an almost daily basis, but fortunately enough of you actually subscribe to us to keep all this sustainable without ads, and we want to keep it that way. But I just have one favor to ask all of you. The single most powerful, completely free thing you can do is to click that subscribe button. It's the only thing I'll ever ask of you, and it means absolutely everything to me and my team that works so hard to bring Latent Space to you each and every week. If you do it, I promise you, we'll never stop working to make the show even better. Now let's get into it.
Okay, we're back in the studio with Moonlake's two leads. I guess there's other founders as well, but Sun and Chris Manning, welcome to the studio.

**Fan-yun Sun** (1:42)
Thanks a lot, thanks for having us.

**swyx** (1:45)
You guys have burst onto the scene with a really refreshing new take on world models. I would just want to sort of, I guess, ask how the two of you came together. Chris, you're a legend in NLP and just AI in general. You're his grad student, I guess.

**Fan-yun Sun** (2:01)
Actually, my co-founder. Oh, yeah. I should give a lot of credit to my co-founder, Sharon. Yeah. She was actually working with Professor Fei-Fei Li and Jojen, and then she ended up working with Ron and Chris Manning here. So I got connected to Chris initially, actually, through my co-founder.

**swyx** (2:18)
What is Moonlake? Actually, I'm also very curious about the name, but why going into world models?

**Fan-yun Sun** (2:25)
So I was working a lot with actually Nvidia research during my PhD years on essentially generating interactive worlds to train reinforcement learning agents or embodied AI agents. And then there was two observations, one in academia and one in industry. In industry, folks like Nvidia are actually paying a lot of dollars to purchase these types of interactive worlds, whether it's for the sake of evaluation or training the robots, or policies or models. And then in academia, the same thing is happening. And more specifically, when I was actually working with Nvidia on the Synthetic Data Foundation Model Training Project, we were actually generating a lot of these synthetic data and showing that, hey, you can actually, these synthetic data are actually as useful as real-world data when it comes to multimodal pre-training. But then, like I said, there's a lot of dollars being paid out to external vendors or other folks to manually curate these types of data.
It was very clear to us that, okay, on our way to, let's call it, embody general intelligence, models need to learn the consequences behind their actions, which means that they need interactive data. The demand for those types of data are growing exponentially, but everybody's thinking about it from a pure, say, video generation perspective or something else. But we feel like the true actually, opportunity is actually building reasoning models that can do these things, like how humans do these things today. So that's a little bit on the genesis of Moonlake. I think the reason I got into world models was partly a philosophical take on the world, where I believe the simulation theory and stuff like that. But on the other hand, it's really just like, oh, there's an opportunity there that I feel like nobody's doing it, the way I think should be done.

**Chris Manning** (4:04)
I can say a little bit about that. Yeah, so the overall goal is the pursuit of artificial intelligence. And most of my career has been doing that in the language space, and that's been just extremely productive, as we all know the story of the last few years. I don't have to tell about how much we've achieved with large language models. But although they're being extremely effective for ramping language and general intelligence, it's clearly not the whole world. There's this multimodal world of vision, sound, taste, that you'd like to be dealing with more than just language. And then the question is how to do it. And despite a huge investment in the computer vision space, right? As a research field, computer vision has been for decades far, far larger than the language space actually.
Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun

Feed this to your agent