What's Missing Between LLMs and AGI - Vishal Misra & Martin Casado — AI + a16z

**Vishal Misra** (0:00)
Atropic makes great products. Plot code is fantastic, co-workers fantastic, but they are crazy of silicon doing matrix multiplication. They don't have consciousness, they don't have an inner monologue. You take an NLM and train it on pre-1916 or 1911 physics, and see if it can come up with the theory of relativity. If it does, then we have AGI.

**Martin Casado** (0:21)
Just today, by the way, Dario allegedly said that you can't rule out that they're conscious.

**Vishal Misra** (0:27)
Maybe you can rule out their thoughts.

**Martin Casado** (0:29)
Come on.

**Vishal Misra** (0:30)
To get to what is called AGI, I think there are two things that need to happen.

**Unknown** (0:35)
Five years ago, Vishal Misra got GPT-3 to translate natural language into a domain-specific language it had never seen before. It worked. He had no idea why. So he set out to build a mathematical model of how LLMs actually function. The result? A series of papers showing that transformers update their predictions in a precise, mathematically predictable way. In controlled experiments, the models match the theoretically correct answer almost perfectly. But pattern matching is not intelligence. LLMs learn correlation. They don't build models of cause and effect. To get to AGI, Misra argues, we need the ability to keep learning after training and the move from correlation to causation. Martin Casado speaks with Vishal Misra, Professor and Vice Dean of Computing and AI at Columbia University.

**Martin Casado** (1:28)
Vishal, it's great to have you in again.

**Vishal Misra** (1:30)
Great to be back.

**Martin Casado** (1:31)
This is one of my favorite topics, which is how do LLMs actually work? And I think that, in my opinion, you've done kind of the best work on this, modeling it out.

**Vishal Misra** (1:39)
Thank you.

**Martin Casado** (1:39)
For those that did not see the original one, maybe it's probably worth doing just a quick background on kind of what led you to this point, and then we'll just go into the current work that you've been doing.

**Vishal Misra** (1:50)
Five years ago, when GPT-3 was first released, I got early access to it, and I started playing with it, and I was trying to solve a problem related to querying a cricket database. And I got GPT-3 to do in-context learning, few short learning, and it was kind of the first, at least to me, it was the first known implementation of RAG, Retrieval Augmented Generation, which I used to solve this problem of querying, getting GPT-3 to translate national language into something that could be used to query a database that GPT-3 had no idea about. I had no access to GPT-3's internals, but I was still able to use it to solve that problem. So it worked beautifully. We deployed this in production at ESPN in September 21

**Martin Casado** (2:40)
Wow. You did the first implementation of RAG in 2021?

**Vishal Misra** (2:44)
No, no, no, in 2020

**Martin Casado** (2:45)
2020

**Vishal Misra** (2:46)
2020, I got it working, and by the time you talked to all the lawyers at ESPN and productionized it, it took a while. But October 2020, I had this architecture working. But after I got it to work, I was amazed that it worked. I wanted to understand how it worked. And I looked at the attention to all your deep papers and all the other deep learning architecture papers, and I couldn't understand why it worked. So then I started getting deep into building a mathematical model.

**Martin Casado** (3:20)
And now you've published a series of papers. The first one that I read is the one where you had your matrix abstraction. So maybe we'll talk about that and then we'll talk about the more recent work. So perhaps we'll just start with the first one, which is you're trying to come up with a mathematical model of how LLM works. And you have, which is very helpful to me. And at the time, you're actually trying to figure out how in-context learning was working. And you came up with an abstraction for LLMs, which is basically a very large matrix, and you use that to describe. So maybe you can kind of walk through that work very quickly.

**Vishal Misra** (3:50)
Sure. Yeah. So what you do is you imagine this huge, gigantic matrix where every row of the matrix corresponds to a prompt. And the way these LLMs work is given a prompt, they construct distribution of probabilities of the next token. Next token is next word. So every LLM has a vocabulary, GPT, and its variants have a vocabulary for about 50,000 tokens. So given a prompt, it will come up with a distribution of what the next token should be, and then all these models sample from that distribution.

Feed this to your agent

Try it now — copy, paste, done:

curl -H "x-api-key: pt_demo" \
  https://spoken.md/transcripts/1000755750361

Works with Claude, ChatGPT, Cursor, and any agent that makes HTTP calls.

From $0.10 per transcript. No subscription. Credits never expire.

Using your own key:

curl -H "x-api-key: YOUR_KEY" \
  https://spoken.md/transcripts/1000755750361