What I Learned Testing GPT-5.5

**Nathaniel Whittemore** (0:00)
Gpt 5.5 aka Spud is here, but doesn't live up to expectations. This is one of the most hyped models we've had in a very long time, and we are going to go through all of the first reactions, the benchmarks, and of course, about a dozen of my own tests. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
All right, friends, quick announcements before we dive in. Now, aidailybrief.ai is of course, where you can find out about all the different things going on in our ecosystem. That includes things like the AIDB New Year's program, Claw Camp, etc. To try to make things a little bit easier, as we have some perhaps new free programs forthcoming, I'm actually launching an AI Daily Brief account system, so that you can just sign up once and then add yourself to programs as they come up without having to sign up again each and every time. If you go to aidailybrief.ai right now, you can claim your user name and be first in line to hear about another free program we have launching tomorrow on an operator's bonus episode.
Well friends, it is here. Ever since back in December, when OpenAI declared a code red, we knew that they were deep in the lab cooking something good, or at least we hoped it would be good. Certainly the last few months have seen the company regain its verve, particularly around Codex, which has grown from just a couple hundred thousand users at the beginning of the year to over four million now. We've heard about the elimination of side quests, TBPN acquisition notwithstanding, and overall that focus has seemed to reshape the company.
And ultimately leaked memos and grand statements about focus don't matter a fig if it doesn't produce results. Now honestly, for OpenAI, the stakes heading into the 5.5 release had been increased dramatically because of their competition with Anthropic. Maybe the biggest story for the last few weeks in AI has been the model that we don't have in Anthropic's mythos. Anthropic basically said to the world, we've got a new powerful model that is a step change in capabilities, but it's too powerful right now for us to provide to the average user. Now of course, in some cases, there has been skepticism that the power is the real reason that Anthropic isn't delivering this. Some have speculated that it has more to do with compute constraints than true cybersecurity concerns. But it has seemed like the limited set of partner companies that have had access have validated that it is indeed a very good model.
Whatever OpenAI put out next then was always going to be their response to that missing Mythos model, and the expectations were ratcheted up accordingly. On Friday at 2pm, OpenAI dropped Gpt 5.5. In their announcement tweet, they called it a new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. It marks, they wrote, a new way of getting computer work done. Some of the use cases they pointed to as where it excelled were writing, debugging code, researching, analyzing data, creating documents and spreadsheets, operating software, and moving across tools until a task is finished. In other words, this is a knowledge work model. And certainly the benchmarks seemed to slap. Taking just a comparison to Opus 47, whereas Opus 4.7 scored a 69.4% on Terminal Bench 2.0, an agentic coding benchmark, Gpt 5.5 scored an 82.7. On the real-world task GDPVal, Opus 47 scores an 80.3, GDPVal gets an 84.9.
Overall, the model ranks right at the top of Artificial Analysis' overall benchmarks, with the extra-high version being the first model to ever score in the 60s. Artificial Analysis themselves write, Gpt 5.5 takes OpenAI back to the clear number 1 OpenAI's new model tops the Artificial Analysis Intelligence Index by 3 points, breaking a three-way tie with Anthropic and Google.
Now while obviously all of that is good news for both OpenAI and for people who like powerful models, not every benchmark was that clear-cut. And in labs found that Gpt 5.5 was behind Opus 47 on Vending Bench, which tasks the model running a profitable vending machine business. In that test, Gpt 5.5 was about on par with Opus 4.6.
In Vending Bench Arena, which is a multiplayer variant that introduces competition, Gpt 5.5 did actually beat Opus 47 by a healthy margin, and in labs also noted that 5.5 didn't display any of the underhanded tactics like Opus had, like lying to suppliers or stiffing customers on refunds.
Val's AI, which maintains a range of benchmarks that test professional tasks, including finance, medical and legal fields, found that Opus 47 still comes out ahead, although Gpt 5.5 was a decent jump over 5.4.
What I Learned Testing GPT-5.5

Feed this to your agent