Why Video Agent models are next — Ethan He, xAI Grok Imagine artwork

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast

June 1, 2026

We’re announcing AIEWF speakers this week! Take the AI Engineering Survey!
Speakers: Swyx, Ethan He, Vibhu
**Swyx** (0:05)
Okay, we're here in the studio with Ethan He, most recently of xAI. Welcome.

**Ethan He** (0:09)
Yeah, thank you. Glad being here.

**Swyx** (0:11)
We're also here with Vibhu. You were first coming to us or joining the Latent Space world because you were working on Cosmos and Nvidia, and you did a great paper. We loved it. You presented it as well. So thank you for doing that.

**Vibhu** (0:22)
Yeah.

**Ethan He** (0:23)
I also presented the MOEs.

**Vibhu** (0:26)
Yes.

**Ethan He** (0:27)
Twice at Latent Space.

**Swyx** (0:29)
Yeah. How did you actually hear about us? Did we reach out to you? Is that how it worked?

**Ethan He** (0:33)
No, actually, the community, I realized there is this online community that people talk about AI and also learn from each other through papers every week through the paper club. It's very nice. Yeah.

**Swyx** (0:49)
I think three years non-stop, we haven't stopped even on Christmas and New Year's. Many weeks, I want to stop if it's good.

**Vibhu** (0:58)
I think you had posted that you worked on a paper and I was like, oh, very cool. We have a paper club. Present it. I might have reached out to you after.

**Swyx** (1:05)
Yeah, because it's an amateur club, right?

**Ethan He** (1:07)
Yeah.

**Swyx** (1:08)
So it's very unusual. But we have sometimes people, authors come by and actually explain the paper. Today, we just did the poolside paper, which is apparently very good.

**Vibhu** (1:18)
Came out yesterday.
Pretty interesting, right? Fully open. They talk about everything, just so it's a good one. We'll recommend people to read it.

**Swyx** (1:25)
Bring us up to speed on your transition to xAI. I actually don't even know when you joined. Just tell us a taller story about the transition.

**Ethan He** (1:34)
Before xAI, I was working on Cosmos World Model, as in Nvidia. So Cosmos is a giant video foundation models that aims to simulate the world. And it serves as a foundation for all of the roboticists to build on top of. There, once I built the Cosmos One, I realized that this thing also has a scaling law similar to the English model. We need to scale up the video models further.
That's why I realized I need to move to somewhere with much more computer resources. That's how I...

**Swyx** (2:12)
Than Nvidia.

**Vibhu** (2:14)
The GPU-rich kingdom sounds.

**Ethan He** (2:19)
Yeah.

**Vibhu** (2:19)
And timeline-wise, when was Cosmo? It was pretty early, right? It was open world model, open paper.

**Ethan He** (2:25)
It was like end of 2024

**Vibhu** (2:28)
End of 2024

**Ethan He** (2:29)
Yeah. Then at mid-2025, I moved to xAI. At that time, I joined by the time when xAI was about to build video models and multi-model models. There were no infra, no data, and no model. And just a few engineers, we built it in three months. It released the first model, Grok Imagine 0.9.
And since then, I keep working on video models. And move more from pre-training and to post-training of the video models. For example, like a reference to videos, kind of like the cameo feature and video extensions. And before I left, I work on a world model, leading a small team to focus on the real-time, long-horizon video generation.

**Swyx** (3:24)
Can you give like a rough roadmap of like, okay, you're on a brand new team, Grok previously was OnlyTex, so they partnered with BFL for their image gen stuff. What are the building blocks? You have compute, data you can procure somewhere.
What is the sequence of things that people should think about when you're setting up a new team?

**Vibhu** (3:43)
I mean, actually, even deeper, not just data you can procure. You guys had to go through getting the data too, right? So you shipped it pretty fast.

**Swyx** (3:51)
Yeah, three months is actually very surprisingly fast.

**Ethan He** (3:55)
Yeah, one thing I say is thanks to my experience at Nvidia. Because the first time when we were building Kosmos together, we built it for about a year. So this is the second time I do it.
Roughly, I have an idea of what to do. I say the most important thing is the talent. Everyone, everyone were very strong and very close with each other towards a common goal. So that speed up things a lot. So you reduce the communication bandwidth among people, and everyone can work towards the same goal.
It's like every day, there's not that much meetings on the calendar, like maybe like a sync a day. And after that, it's just all building. It was pretty fun at that time. And another thing is that xAI has very strong foundations of like data, data infrastructure, model infrastructure. And the supporting there can help the model develop a lot. When I look at like training models, I don't...

76 more minutes of transcript below

Feed this to your agent

Try it now — copy, paste, done:

curl -H "x-api-key: pt_demo" \
  https://spoken.md/transcripts/1000651996090

Works with Claude, ChatGPT, Cursor, and any agent that makes HTTP calls.

From $0.10 per transcript. No subscription. Credits never expire.

Using your own key:

curl -H "x-api-key: YOUR_KEY" \
  https://spoken.md/transcripts/1000770600564