Evals, Feedback Loops, and the Engineering That Makes AI Work

**Ankur Goyal** (0:00)
AI is continuous and systems are discreet. Humans fundamentally think a little bit more in terms of systems and predictability and reliability, consistency, than they do non-determinism.

**Martin Casado** (0:13)
In the past, large companies that received lots of money were basically naturally rate-limited by engineering speed. These frontier labs don't have that problem. They can literally just raise money and build a model based on the money. Like they just throw more computer, more data at it. So you kind of have to ask the question like, what is going to end up limiting them?

**Ankur Goyal** (0:34)
If you're building an agent and you're using pre-trained models or whatever, it's a fool's errand to think that the agent that you're building, the way that you're providing context to it, like all that stuff, that is not engineering and it shouldn't be engineered and it should be bitter lesson pill. Meaning you should build that system in a way that you can throw it away tomorrow. Right now, we're kind of building, you know, like God. And so, it's possible and probably economically viable to keep throwing capital at the problem to make God 1% smarter. But when you can't make God 1% smarter, there is like an insane opportunity to engineer God to be more efficient.

**SPEAKER_3** (1:15)
The AI industry keeps reaching for brute force. Frontier Labs throw compute at training runs instead of optimizing. Developers give agents a Unix environment instead of structured tools.
Teams chase the latest model instead of engineering the one they have. But the pattern Ankur Goyal keeps seeing is the opposite. The companies shipping AI products that actually work aren't using the smartest models. They're the ones with the best engineering around the models. The evals, the feedback loops, the testing harnesses. This conversation covers where that discipline matters and where it doesn't. The cycle between open source and closed source models. Why Chinese models show high token volume but low dollar spend. And a benchmark comparing Bash vs SQL for agents with results Goyal calls comical. Martin Casado speaks with Ankur Goyal, founder and CEO of Braintrust.

**Martin Casado** (2:04)
I think people watching this will know you, but let's just very quickly go to your background, mostly just to set the stage, because I want to talk a lot about whether AI is actually a systems problem or not.

**Ankur Goyal** (2:12)
Great.

**Martin Casado** (2:13)
So do you mind just kind of giving the rough sketch?

**Ankur Goyal** (2:14)
Yes. Nice to meet you, Martin.
I'm Ankur. Prior to Braintrust, I, back in the ancient history, used to work on relational databases, and way before LLMs, I saw deep learning come out and become a thing, and started to get excited about how the way that we query and work with data, which was primarily SQL, is not as powerful as what we could do. So I started a company almost 10 years ago now called Impura, where we did-

**Martin Casado** (2:43)
It's been that long?

**Ankur Goyal** (2:45)
Yeah, it's been time flies, yeah.

**Martin Casado** (2:47)
Wow.

**Ankur Goyal** (2:47)
Where we did AI-powered document extraction. And that turned out to be one use case that worked pretty well, although we were using computer vision models and stuff. Way back then. Yeah, I mean, at that point, they were way more powerful than language models. And so there we had a bunch of customers who were doing different use cases, and we might make our invoice stuff better, and then make the bank stuff worse, or make the bank stuff better and the invoice stuff worse. And we had to get really good at avoiding that. And that's when we built internal tools to do evals, and then get good data to do evals and sort of run this feedback loop pipeline. I didn't think too much of it at the time, but then we got acquired by Figma, and I led the AI team there, and we had exactly the same problem, but this time, like building on top of LLMs.

**Martin Casado** (3:33)
By the way, you know what's interesting? Everybody says the word eval. You say the word eval, I say the word eval. I found out very few people actually know what an eval is. Specifically, what an eval is.

**Ankur Goyal** (3:42)
Yeah, I think evals are like the scientific method applied to software engineering with non-deterministic systems like AI systems. So you come up with a hypothesis. Let's say I'm going to try out a new model, or I'm going to tweak my prompt, or I'm going to throw more context into this agent by fetching from whatever, this API, and I suspect that this is going to improve the quality of my agent, or I suspect that it will make it faster, whatever. So you come up with a hypothesis, and then you essentially simulate running the system on a set of inputs, and you observe the outputs. You might have ground truth, you might not have ground truth, and you try to measure what the difference is. And then you quantitatively look at the difference. And I think what's really important is you qualitatively understand, like hey, this thing, okay, it says it got better, but let me look at it with my eyes and see if it actually was better or not. And you sort of, by reconciling that, not only do you double check things with your intuition, but you also give yourself the opportunity to improve the next eval.
Evals, Feedback Loops, and the Engineering That Makes AI Work

Feed this to your agent