**Nathaniel Whittemore** (0:00)
Today on The AI Daily Brief, we're doing a 101 on one of the most important concepts in AI right now, Harness Engineering. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, KPMG, Blitzi, Drata and Mercury. To get an ad-free version of the show, go to patreon.com/aidailybrief, or you can subscribe on Apple Podcasts. Ad-free is just $3 a month. If you are interested in sponsoring the show or really finding out anything else about the show, head on over to aideallybrief.ai or shoot us a note at sponsors at aideallybrief.ai. One final note before we dive in, today is hopefully the last day for a while that I will be on the road traveling. So this episode was recorded at the end of last week. If for some reason Sam Altman decided to release Spud over the weekend and you're wondering why the heck this is the episode you're getting, that is why, but I will be back, I promise very soon. In the meantime, this gave me a chance to dive a little deeper on something that I think is extremely important and I've wanted to explore for a while, which is harness engineering.
Today we are digging into a topic that first, you might have heard this term floating around a little bit, but second, even if you haven't, if you are among the subset of the audience that has been dabbling with Claude Code or Codex or even using OpenClaw, you have been living in and doing this thing whether you realize it or not. I'm talking about harness engineering. You might notice that there is a lineage of engineering that we focus on that have changed over the years in AI. In 2023 and 2024, we talked a lot about prompt engineering, the art and the science of finding the right ways to prompt the model to get the results that you wanted. There was so much in prompt engineering that people spent so much time on. Think about the things that everyone used to recommend, like getting the model to adopt a persona or later on, the whole idea of JSON engineering where people hyper-structured their prompts in the way that an engineer might. Now, last year in 2025, we started to talk a lot more about context engineering. The idea of context engineering was that it turned out that what mattered for AI performance was not just the way you spoke to the model, but what set of information or context that model had access to. Take the example of asking ChatGPT to help you create a marketing campaign. One part of getting good results, sure, might be what you prompted for and how you ask it, but obviously it's kind of intuitive that if ChatGPT had access to information about the performance of all your past marketing campaigns, it might be able to be more informed in how it helped you. So context engineering was all about the way that we brought together different context and gave AI access to it. Now, interestingly, context engineering actually kind of has had divergent meetings for different people. For engineers and developers, context engineering has often been about designing the systems that surround AI and agents in order to better interact with and use context, dealing with problems like persistence and memory and state. And in a way, this is kind of a part of what we'll talk about with harness engineering. For laypeople, for non-technical users, context engineering has been much more about what's the best way to give AI access to the information it needs to help me do its job. Now it's important to note that while prompt engineering might have decreased a little bit in its importance scale, context engineering is still very much alive and important. In fact, I did that entire episode about a week ago about how to build a personal context portfolio so that you could transport your personal context from LLM to LLM or agent to agent without having to repeat yourself every time. But the term du jour right now is harness engineering, which is effectively about everything you put around a model, the systems, the tooling, the access that help it do what it's meant to do. And when one starts to look around, you kind of start to see the harness engineering conversation popping up everywhere. At the beginning of April, Cursor launched its newest version, Cursor 3 In their announcement post, they wrote, Software development is changing and so is Cursor. In the last year, we moved from manually editing files to working with agents that write most of our code. How we create software will continue to evolve as we enter the third era of software development, where fleets of agents work autonomously to ship improvements. We're building towards this future, but there is a lot of work left to make it happen. Engineers are still micromanaging individual agents, trying to keep track of different conversations, and jumping between multiple terminals, tools and windows. We're introducing Cursor 3, a unified workspace for building software with agents. The new Cursor interface brings clarity to the work agents produce, pulling you up to a higher level of abstraction with the ability to dig deeper when you want. It's faster, cleaner and more powerful with the multi-repo layout, seamless handoff between local and cloud agents, and the option to switch back to the Cursor IDE at any time. So all of the features that then go on to announce, having all of your agents in one place, the ability to run many agents in parallel, new UX for handoff between local and cloud, all of this is the instantiation of harness engineering into a product. Even more recently, we got Claude Managed Agents. In their announcement post, they said explicitly it pairs an agent harness tuned for performance with production infrastructure. And in the accompanying blog post, they basically say this is kind of all about harnesses. The post was called Scaling Managed Agents, Decoupling the Brain from the Hands. Now of course, in this metaphor, the brain is the model and the hands are the harness. Harnesses they write encode assumptions that go stale as models improved. Managed Agents, then, is built around interfaces that stay stable as harnesses change. Now we'll maybe come back later to some of the specifics of that new product, but again the point here is that harness engineering is kind of everywhere. At the beginning of March, Layton Space dropped a post called Is Harness Engineering Real? And to provide another analogy, their team references back to when they worked in finance. It doesn't say for sure, but I assume this is Sean slash Swicks writing because this was part of his experience set, but whoever it was wrote, A common debate in my finance days was about the value of the human versus the value of the seat. If a trader made 3 million in profits, how much of it was because of her skills and how much was because of the position, institution and brand she is in, and any generally competent human could have made the same results. They continue, The same debate is currently raging in Harness Engineering, the system subset of Agent Engineering and the main job of Agent Labs. Agent Labs, by the way, are how the Latent Space Team refers to everyone like Cursor, Cognition, etc. The central tension, they continue, is between Big Model and Big Harness. An AI framework founder you all know once confided in me at an OpenAI event, I'm not sure these guys even want me to exist. To define Harness, they write, In every engineering discipline, a harness is the same thing. The layer that connects, protects and orchestrates components without doing the work itself. They continue, Talking with the Big Model guys, you really see it. Every podcast with Boris Cherny and Cat Wu, the creators of Cloud Code, emphasize how minimal the harness of Cloud Code is, meaning their job is mostly letting the model express its full power in the way that only the model maker knows best. In one interview, Boris said, I would like to say there's nothing that secret in the sauce. Generally, our approach is, all the secret sauce, it's all in the model. And this is the thinnest possible wrapper over the model. We literally could not build anything more minimal. Cat added, it is very much the simplest thing I think by design. Gnome Brown from OpenAI seems to agree. They quote him as saying, Before the reasoning models emerged, there was like all of this work that went into engineering agentic systems that made a lot of calls to GPT-40 or these non-reasoning models to get reasoning behavior. And then it turns out, we just created reasoning models and you don't need this complex behavior. In fact, in many ways, it makes it worse. You just give the reasoning model the same question without any sort of scaffolding and it just does it. And so people are building scaffolding on top of the reasoning models right now, but I think in many ways, these scaffolds will just be replaced by the reasoning models and models in general becoming more capable. On the other side, says Latent Space are the big harness guys. Jerry Liu from Llama Index wrote a post on this on X that he titled The Model Harnesses Everything. He added a picture that sums up his point as saying, Agent reasoning is exponentially improving, but models are blank slates. The biggest barrier to AI value is the user's own ability to context and workflow engineer the models. The more complex the business process, the more complex the prompt that users need to define. Now where Latent Space comes out is that while they might have some bias towards the big model thesis, actually referencing the bitter lesson that we talked about in episodes a couple of weeks ago, they also acknowledge that harness engineering has real value. So let's dive a little deeper into what harness engineering actually is. And for part of our guide, we're going to use a post from humanlayer.dev from the middle of March called Skill Issue, Harness Engineering for Coding Agents. Author Kyle writes, We spent the last year watching coding agents fail in every conceivable way, ignoring instructions, executing dangerous commands unprompted and going in circles on the simplest of tasks. Every time the instinct was the same. We just need better models, GPT-6 will fix it, we just need better instruction following. It'll work when the niche library I'm using is in the training data. But over the course of dozens of projects and hundreds of agent sessions, we kept arriving at the same conclusion. It's not a model problem, it's a configuration problem. Yes, models will get smarter and yes, some existing failure modes will disappear. And then because they are smarter, we will give them new problems which are bigger and harder, and they will continue to fail in unexpected ways. Unexpected failure modes are a fundamental problem for non-deterministic systems. So instead of praying for GPT-6 for Codex Ultra High Extended to save us all, what if we focused instead on answering the question, how do we get the most out of today's models? And the next point that Kyle makes is the one that I was saying before, which is that most of us who have been dabbling in these systems, be it OpenClaw or CloudCode or Codex, have been doing harnessed engineering whether we realize it or not. He continues, There are lots of ways to get better performance out of your coding agent. If you use coding agents for moderately hard tasks, you've probably configured your coding agent a bit. Have you used skills? MCP servers? Subagents? Memory? Agents.md files? A coding agent equals AI models plus a harness. These are all technically separate concepts, but they are all part of the coding agent's configuration surface. Basically, what does the model use to interact with its environment? Harness engineering, they write, describes the practice of leveraging these configuration points to customize and improve your coding agent's output quality and reliability. They continue by arguing that harness engineering is the subset of context engineering, which primarily involves leveraging harness configuration points to carefully manage the context window of coding agents. It answers, how do we give our coding agents new capabilities? How do we teach it things about our code base that aren't in the training data? How do we increase task success rates beyond magic prompts? And one of the things that they point out is that harnesses aren't just one thing. To some extent, harnesses work backwards from what models can't do natively to create some component to solve for that. In another post from Viv from Langchain, called The Anatomy of an Agent Harness, Viv added a chart that showed the desired agent behavior versus what the agent adds. For example, the simple one that's a part of every cloud code session. If the desired agent behavior is to write and execute code, the harness adds bash and code execution. If the desired agent behavior is safe execution and default tooling, the harness adds sandboxed environments and tooling. If the desired agent behavior is remembering and accessing new knowledge, the harness is going to need to provide memory files, web search and MCPs. And importantly, when you've heard about all of these techniques, like Carpathy's Auto Research or the Ralph Wiggum Loops, those are harness additions to get to the desired agent behavior of completing long-horizon work. They also point out that this is something that the big labs are talking about quite a bit now too. Back in February, OpenAI dropped a post called Harness Engineering, Leveraging Codecs in an Agent First World. The place that they start from in this post is the goal of building and shipping an internal beta of a software product with zero lines of manually written code. That has been the context through which they have had to figure out what needed to be part of the harness that they were designing. One of the big experiments that they found was effectively that in this new approach to engineering, they had to uncover new ways of giving the agent progressively more context. This is this idea which you might have heard me talk about before called progressive disclosure, which is a key part of the way that agent skills have been designed, where skills that provide context effectively unfold, with the agent being able to access the minimum amount of information to know if it needs to go deeper into that skill without having to crowd out its context window with all sorts of unnecessary information. The key part of the story though is in some of the last lines in the post. They conclude, Our most difficult challenges now center on designing environments, feedback loops and control systems that help agents accomplish our goal, building and maintaining complex, reliable software at scale. That is a very different proposition than just making a model better.
12 more minutes of transcript below
Try it now — copy, paste, done:
curl -H "x-api-key: pt_demo" \
https://spoken.md/transcripts/1000651996090
Works with Claude, ChatGPT, Cursor, and any agent that makes HTTP calls.
From $0.10 per transcript. No subscription. Credits never expire.
Using your own key:
curl -H "x-api-key: YOUR_KEY" \
https://spoken.md/transcripts/1000761136773