GPT 5.4 First Test Results

**Nathaniel Whittemore** (0:00)
Today on The AI Daily Brief, GPT 5.4 is here and these are both the first impressions from the broader world as well as my first test results. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, KPMG, AIUC, Blitzi and PromptQL. To get an ad-free version of the show, go to patreon.com/aidailybrief, or you can subscribe on Apple Podcasts. If you are interested in sponsoring the show, send us a note at sponsors at aidailybrief.ai. Lastly, as often happens, when we have a big new model release, no headlines today, we are just going to spend all of our time on this exciting new model. So without any further ado, let's dive in. After a couple of weeks where mostly we've been talking about big macro issues like the Pentagon and Anthropic and all that sort of thing, we finally have the cool fresh breeze of a new exciting model to test. And this one indeed is pretty exciting. Ethan Malik tweeted, I think we've been through enough release cycles for models at this point to say that the latest model from OpenAI or Anthropic or Google is generally going to be the best model in the world upon release, with some jagged edges, until the next release by one of the big three. Now with that background, a different way to look at where we've been is that it's simply been OpenAI's turn. However, the expectations coming in to GPT 5.4 were a little bit higher than they might have been for some of OpenAI's more recent releases. Ever since the release of GPT 5, all the big model providers got the memo that trying to promise too much in each update, rather than just being very incremental, was a pretty scary proposition. That's what got us the 5.1, 5.2, 5.3, now 5.4 kind of paradigm. But of course, it's not just OpenAI doing that. Google and Anthropic are both on that same plan as well. And yet, even with that, 5.4 has had a little bit more hype and anticipation around it than some of the previous iterative models that we've gotten more recently. This was theoretically supposed to be the big outcome of OpenAI's Code Red, which was launched back in December. And what's more, the buzz for the last week or week and a half or so, has been that this one was really meaningful. Enough so that it almost felt to me, like some of the more recent leaks to publications like The Information, were almost trying to tamp down on expectations. To take one example, rumors have been flying that there was a 2 million token context window, whereas the information is reporting from the last couple of days suggested it was just 1 million. Seemed to me to be a little bit of expectation setting through leaks. In any case, on Thursday afternoon we actually got the model. And the initial buzz was strong. Ben Hilak wrote, I've been using GPT 5.4 for the past few weeks. In a sea of endless model drops and benchmark maxing, this model is the first in a long time to be worth your time to try. Honestly, didn't expect OpenAI to pull this off. So let's talk first about how OpenAI frames things, look at some of the early reactions in the community, and then we'll walk through a more comprehensive case study with a project that I recently did to put the new capabilities through the ringer. Now one of the interesting things about this week is that this was not the first new OpenAI model we got. Just a couple of days ago, we got GPT 5.3 instant, although OpenAI started promising almost immediately that 5.4 was coming. 5.3 instant was, as we've talked about, a speed and personality play. The announcement tweet called it more accurate, less cringe. This actually was part of the inspiration for our episode about what's going to actually matter in consumer AI, as this was so clearly aimed at that default sort of experience that the average ChatGPT user is going to have because they're not optimizing the model selector for what they think they want. GPT 5.4, although still having a lot of offshoots, does feel like they're trying to bring together the models under a coherent banner. They write, GPT 5.4 brings together the best of our recent advances in reasoning, coding, and agentic workflows into a single frontier model. It incorporates industry-leading coding capabilities of GPT 5.3 Codex, while improving how the model works across tools, software environments, and professional tasks involving spreadsheets, presentations, and documents. The result is a model that gets complex real work done accurately, effectively, and efficiently, delivering what you ask for with less back and forth. In other words, if 5.3 Instant was for the personal use cases, GPT 5.4 was, as the subheader says, designed for professional work. As the information had it in that leak, we do indeed get a 1 million token context window, which, among other things, improves 5.4's ability on tasks that require longer thinking. The quotes from early testers that they included were ravishing, even more than the ones that they usually include. Brendan Foody, the CEO at Mercor, writes, GPT 5.4 is the best model we've ever tried. It's now top of the leaderboard on our Apex agent's benchmark, which measures model performance for professional services work. It excels at creating long-horizon deliverables such as slide decks, financial models and legal analysis, delivering top performance while running faster and at a lower cost than competitive frontier models. Indeed, these professional tasks are some of the big focus for OpenAI's announcement blog. They compare GPT 5.4's outputs to GPT 5.2's on spreadsheets, documents and presentations, all with significant upgrades. And if overall knowledge work is what OpenAI chose to focus on, there was also a big emphasis on computer use capabilities. We are, of course, living in an open-claw world, and so it makes sense that this is becoming increasingly important. Computer use is actually listed above coding, perhaps because the jump from 5.3 Codex to 5.4 isn't all that big. Indeed, they basically said that 5.4 is the integration of 5.3 Codex with other improved aspects of the model, meaning it seems to me that their latest round of big coding innovations was embedded in 5.3 Codex itself. There's a sub-theme running throughout the announcement about efficiency. In the first part of the announcement post, they write that GPT 5.4 is our most token-efficient reasoning model, using significantly fewer tokens to solve problems when compared to GPT 5.2, translating to reduce token usage and faster speeds. In the coding section, they also talk about fast mode in Codex, which they say delivers up to 1.5x faster token velocity. They write, it's the same model and same intelligence, just faster. That means users can move through coding tasks, iteration and debugging while staying in flow. The new efficiency gains also show up in ToolSearch. They write, previously, when a model was given tools, all tool definitions were included in the prompt up front. For systems with many tools, this could add thousands or even tens of thousands of tokens to every request, increasing cost, slowing responses, and crowding the context with information the model might never use. With ToolSearch, GPT 5.4 instead receives a lightweight list of available tools, along with a ToolSearch capability. When the model needs to use a tool, it can look up that tool's definition and append it to the conversation at that moment. This approach, they say, dramatically reduces the number of tokens required. They evaluated 250 tasks from Scales MCP Atlas and found that this new configuration had the same accuracy but reduced total token usage by 47%. That combined with improved accuracy on tool calling obviously makes it really appealing for agentic use cases. There's a few other parts of the announcement including improved web search and improved steerability, but those are the big hits. And as people dug in, there were a few things that stood out. First of all, that efficiency is showing up in people's tests. Greg Kamrat, the president of ArcPrize, said that on ArcAGI 2, they were seeing a consistent 20% lift versus 5.2 at the same price. Still the three benchmarks that people were most discussing were around coding, computer use and GDPVal. On Coding People flagged that GPT 5.4 was only nominally better than 5.3 Codex on benchmarks like SweetBench Pro, but most people understood that that wasn't the core value proposition of this updated model. The computer use improvement got more of a discussion. This is maybe the most concerned I've ever seen people with computer use, and I think that's the difference between the pre and post OpenClaw world. Now that people have got all these Mac minis running around with their OpenClaw agents having pretty much unfettered access to them, how good the models are at using the computer becomes much more relevant in a day-to-day way not just a theoretical way. Rahul Agarwal writes, GPT 5.4 is here and it can use a computer better than a human? OpenAI shipped GPT 5.4 on March 5th. The headline isn't the reasoning improvements. It's that this is their first general purpose model with native state-of-the-art computer use. It can operate websites and software autonomously, issue keyboard and mouse commands, write and execute code, and navigate full desktop environments. On OSWorldVerified, it hit 75%, which is above human-level performance at 72.4%, and a massive jump from GPT 5.2's 47.3%. That's not incremental, that's a step change. When agents can reliably navigate desktops, the bottleneck on automation shifts from can the model do it to do you trust it enough to let it. That's the question nobody has a good answer to yet. Jamie Cuff from Pace wrote an X article about this specifically. He called it, We stress tested GPT 5.4 on the hardest UI on the Internet. And writes, People don't realize how good AI computer use has actually gotten. Until they see it tackle the hardest UIs in existence, legacy insurance portals. At Pace, we build AI agents for insurance workflows like submission intake and first notice of loss. Because we operate in this space, we use legacy enterprise software as our ultimate benchmark. If AI can reliably navigate a 20-year-old hyperdense insurance portal without hallucinating a click, it can navigate anything. For a long time, the technology just wasn't there. But after spending the last few months working closely with OpenAI to stress test their new GPT-54 model inside these environments, it became clear the paradigm has shifted. Couple things they point to as much better. The first is click accuracy. Historically, Jamie writes, The biggest failure point was simply clicking the right thing. Enterprise software is incredibly dense. Layouts are cluttered, buttons are tiny, and the systems were designed decades ago. Earlier models would frequently miss targets. GPT-54 is vastly better at grounding itself visually, clicking precisely where it needs to, even on a crowded screen. They also point out improvements in long trajectory reasoning, speed and time to iteration, as well as memory. Now, there was also a lot of chatter around GDPVAL, basically around the measure of the model's performance on professional work tasks. GDPVAL, remember, tests against knowledge were expanding 44 occupations from the top nine industries that contribute to US GDP. The win rate versus industry professionals was 49.8% for GPT-52, 60% for 5-2 Pro, and between 69.2% and 70.8% for the GPT-54 family. That's just wins when it comes to ties, that number rises to 82 to 83%. Ethan Molek pointed out what this means in terms of time savings. He wrote, given the GDPVAL benchmark for GPT-54, the new model ties or beats humans is judged by other experts at professional tasks 82% of the time. If you give a 7-hour task to AI, even with failure rates and the need to check results, you'd save 4 hours and 38 minutes on average. And in addition to the general performance increases across the GDPVAL set, it's very clear that OpenAI is aggressively going after certain industries even more. COO Brad Lightcap, for example, tweeted, The team worked extremely hard to make GPT-54 great for finance. It's much improved for financial modeling and analysis, integrates directly into Excel, and connects to Factivia, Dilupa, S&P Global, and many more. It does feel like a Codex moment is coming here. But what about the overall impressions outside of the individual benchmarks? Leighton Space wrote, We've learned to take for granted that OpenAI is the smartest kid in the room, always reporting state-of-the-art evals. But this set of updates feels much more substantial and confident than any OpenAI launch in recent history. The Every vibe check summed it up, OpenAI is back. Three months ago, writes the team at Every, OpenAI was losing the agentic coding race. Cloud Code had captured developers' hearts, and Opus 4.5 was shipping at a level other models couldn't touch. Meanwhile, OpenAI's coding agent, Codex, felt like it was built for an older era of coding with AI. It was precise but rigid, powerful but personality-less, and not good with tools or able to run for long periods of time autonomously. OpenAI's latest model release, GPT-54, along with their other recent releases, GPT-53 Codex, GPT-53 Codex Spark, and the Codex desktop app, shifts the competitive balance back towards OpenAI on the coding front. The new model produces plans that are thorough and technically precise and have a user-focused and human feel that has been missing from OpenAI's previous coding models. In our testing, GPT-54 reviews code with more depth than GPT-53 Codex and has a noticeably more conversational voice. With a few tweaks, it became our preferred model to use in our open clause, especially given that it is half the price of Opus 4.6. Even Kieran Claussen, our diehard CodDivote is now reaching for GPT-54 daily since we started testing it a week ago. As ever they say, there are trade-offs. GPT-54 has a tendency to expand the task well beyond what you asked for and to call tasks done before they're finished. It sometimes completed tasks in obviously wrong ways then lied about it. The bigger story here is OpenAI's trajectory. From the Codex desktop app to GPT-53 Codex and to GPT-54, the company is iterating fast and many members of the team now use its tools and models daily for coding, a significant change from a few months ago. A couple of things that they said they liked about it. It did proactive research without being asked. It had a more human voice than previous Codexes. And it was roughly twice as fast as Opus. What they didn't like included scope creep on multi-step conversations, misreading completion reports in OpenClaw, and overengineering. One of their team member called it too eager, adding things that aren't wrong but aren't necessary. Ultimately, they summed it up with their title, OpenAI is Back. Other people who reported their own experience also had positive things to say. Dr. Daria Anutmaz wrote, I asked GPT 5.4 Pro to write a 10,000 word literary article on my beloved T cells. As I read through it, I am both mesmerized and deeply emotional. It's just so beautifully written. When I was tweeting about what I found to be 5.4's over-verbosity, Click Health's Simon Smith wrote, Counterpoint, it's the best writing model from OpenAI I've seen, and probably better than the best Claude models now at writing, and only needs a bit of nudging to write extraordinarily well. I've done a ton of writing tests with 5.4 now, and it's capable of writing every imaginable way with great empathy, creativity, wit, and concision. And it has personality again. How IAI's Claire Vaux wrote, What I love speaks more like a human than 5.2 and 5.3, a million-token context window. Tool use is Chef's Kiss, first model where I've experienced the true go-investigate and fix experience that feels robust. What needs work, basic front-end and UX taste, still loves a bullet point, latency and stability. The next day she updated it and said she's starting to have loving feelings towards the model. Basically, it did some really difficult tech and data work really well without a lot of support. Matt Schumer, who you might remember from Something Big Is Happening, calls GPT 5.4, in short, the best model in the world by far. On coding, he writes, coding capabilities are ridiculous. It's essentially flawless. Inside Codex, it's insanely reliable. Coding is essentially solved. There's not much more to say on this. It's just that good. Now, he did find weaknesses, including frontend tastes, which is something that I'm going to get into in just a moment as well. Mark Tenenholz from Perplexity also pointed out that while the model itself is great, the updates to the actual Codex CLI experience are really good as well. He called them the real hero. So much less friction than the previous approval system, he says.
GPT 5.4 First Test Results

Feed this to your agent