#181 - Andrew Ng - Why Data Engineering is Critical to Data-Centric AI Transcript — Monday Morning Data Chat

**Joe** (0:01)
Hi, Andrew.

**Andrew Ng** (0:03)
Hey, good to see you, Joe.

**Joe** (0:04)
Good to see you too. How's things?

**Andrew Ng** (0:09)
Exciting times.

**Joe** (0:10)
Awesome. Well, yeah, thanks for joining the show today. It's good to see you. Yeah, so we're here to talk about a topic, I think, near and dear to you, which is data-centric AI, but also how data engineering is critical to data-centric AI. I guess to back up, walk us through the beginnings of your thoughts around data-centric AI, because I think before that, there was more of a model-centric world we were in.

**Andrew Ng** (0:40)
Yeah. I think for many decades, AI progress was driven, it feels like, primarily by people, say, downloading data sets off the Internet, and then spending a lot of time trying to invent new map, invent new models, and they could do better on that data. That's fine, nothing's wrong with that because of that recipe, AI made a lot of progress. But I think many practitioners of AI, including you and me and many others, have known that if we're trying to build something practical and ship it, sometimes entering the data is much more fruitful than trying to enter the map or the model.
And so I wanted to creating and trying to popularize this term data-centric AI to coalesce a lot of the already ongoing work on entering the data rather than the model. And it's been quite exciting actually to see how much momentum the data-centric AI community has.

**Matt** (1:35)
I think we have to ask because, I mean, this is the main thing in the air right now, generative AI. Does data-centric AI have something to say about how we train large language models, for example? What are your thoughts on that?

**Andrew Ng** (1:47)
I think very much so. In fact, everything from training the foundation model to post-training, maybe fine-tuning to even some of the deployment usage seems to keep on running into data issues. I know that in the popular press, people talk a lot about scaling laws and building bigger transformer networks or whatever to train even more data on, and that is a key part of it. But when I talk to my friends, they're involved in the actual day-to-day of how to get these models to work. A large fraction of their efforts, I'm tempted to say more than 50 percent, but a very large fraction of their efforts is actually thinking through how to get the right data to feed into these foundation models. And then of course, even after someone else has trained a large language model, a large foundation model, a lot of data work is going to fine-tuning it, and then also in practical deployment and usage. Maybe doing a few short learning, a lot of data-oriented thinking there as well. So not everything is data-centric AI, but even in GenEI and foundation models, a much larger fraction of it is, then I think people widely appreciate. I actually talked about this for a long time, but you guys are-

**Joe** (2:57)
Oh no, please do.

**Matt** (2:58)
Please do, keep going. Yeah, we're here to listen.

**Andrew Ng** (3:01)
Oh, so maybe one fun thing.
In the Llama 3.1 paper, the meta-release, I think one of the coolest things, a lot of cool things in that paper, but one of the coolest things was Meta used an earlier version of the model, used Llama 3, and then an agentic workflow, to basically use Llama 3 to generate coding puzzles that were then used as training data to train Llama 3.1. And I think this has always been one of the puzzles, of how to get synthetic data to work to train foundation models, and using an agentic workflow, we use the earlier model, but let it think for a long time, iterate over something over and over, to come up with a good result, and you train the next generation of model to come over the equally good answer very quickly, rather than needing to think of it over and over. I thought that was one really nice recipe for creating data to train foundation models. And really, when I think, when I talk to my friends, training some of the very large foundation models, a lot of the head space is, boy, can I sign the right licensing deals with the publishers to get the data? And of all these data, which one do I invest dollars in to buy? Or for the preference tuning, be it RHF or DPO, to align it with human values? What's the labeling schema? How do I get that data? And then it turns out that while there's certain innovations on training, transforming networks and all that, it feels like there's at least as much, maybe even more hard to compare. There's certainly a lot of thinking day to day on how to get the data to train these models.

Feed this to your agent

Try it now — copy, paste, done:

curl -H "x-api-key: pt_demo" \
  https://spoken.md/transcripts/1000651996090

Works with Claude, ChatGPT, Cursor, and any agent that makes HTTP calls.

From $0.10 per transcript. No subscription. Credits never expire.

Using your own key:

curl -H "x-api-key: YOUR_KEY" \
  https://spoken.md/transcripts/1000669728394