9 Codex Tips From the Codex Team

**Nathaniel Whittemore** (0:00)
Today on the AI Daily Brief, nine codex tips from the codex team.
Before that, in the headlines, yeah, we got a verdict in the Elon OpenAI trial, but that's much less interesting than Composer 2.5.
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
All right, friends, quick announcements before we dive in.
You can apply for our new growth engineer role. I'm going to be closing that soon, so if you are interested, get your application in. And you can also find a link to register for the third cohort of Enterprise Claw, which is coming right up soon. Basically, if you get all excited about the Codex talk today and want to get that spirit of agent building across your company, that's what Enterprise Claw is going to be good for. But with that out of the way, let's talk Composer 2.5 and what it says about Cursor in the AI race.
One of the questions coming into this year was whether what Swix calls the Agent Labs, but what we now might call the Harness First Labs, these are companies like Cursor, Cognition, etc. would be able to compete on the model front. The concern for these companies of course, was that if they were totally beholden to the models from the big labs, if those labs started to move in the direction of building their own harnesses as well, it could squeeze out the space for the Cursors and Cognitions of the world. And at the same time, the Cursors and Cognitions of the world had something valuable in the form of the data exhaust from the usage of their platforms, which theoretically gave them insight into how people were actually interacting with these models, which could turn into a valuable asset for training their own models. Whether it could or couldn't, it was clear that this was a direction that they were going to start to head, and that the space between these so-called agent labs and the model labs was destined to close. The model labs were going to move into the harness space, the agent or harness labs were going to move into the model space. In January, CEO Michael Trull told staff that it was, quote, wartime, recognizing the Cursor's business model was being eroded from both sides. Cloud Code was coming after them on the harness side, but they also couldn't keep eating the cost serving Anthropic models at a discount. With that in mind, he said that the company's number one priority was to build the best coding model this year.
The release of Composer 2 in March was a decent first step, but it was still mostly about bringing down costs. Where users adopted the cheap in-house model, it was mostly for simple tasks. However, it failed to drive new users to the platform. The just-launched Composer 2.5 could be a different story.
The model appears competitive on the key benchmarks. It scored 69.3 percent on Terminal Bench 2.0, which is just behind Opus 4.7, 69.4 percent. On SuiBench multilingual, it scored 79.8 percent, comparable to both Opus 4.7 at 80.5 percent and GPT-55 at 77.8 percent. On Cursor's in-house benchmark, which tests more difficult coding tasks, Composer 2.5 scored 63.2 percent, just about a point behind both 4.7 and 5.5.
Now, the coding performance gets Composer 2.5 into the ballpark of usability for serious coders, but the bigger part of the story is the cost. Cursor is serving this model at just 50 cents per million input tokens and 250 per million output tokens, making it half the cost of Opus 4.7 or GPT-55. They also seem to have delivered some big token efficiency gains. Their benchmark run on SweeBench came in under a dollar per task, compared to around $5 per task for GPT-55 on extra high settings, or $11 per task for Opus 4.7 on max settings. That has led Cursor to claim that their model is 10x more efficient, compared to similarly capable models. Now, Composer 2.5 is still built on top of the same base model as Composer 2, which is Moonshot's Kimi 2.5. That implies that the entire performance boost came from better reinforcement learning techniques, and suggests that even if people don't switch on-mass from Opus and GPT to Composer, that there is a ton of room to post-train leading open-source models to compete at the frontier, especially around these discrete tasks. Cursor also announced that they are in the middle of training a new model from scratch using XAI's Colossus 2 training cluster. They wrote, With Colossus 2's million H100 equivalents and our combined data and training techniques, we expect this to be a major leap in model capability. Leon Lin summed up the model release posting, So basically we got an Opus 47 model that costs 10x less. I have to test this. Later in the day, he checked back in with the results writing, Pretty fast and efficient model. Does a great job. I'd say it's almost as strong as Opus 47 or in some cases just at the same level. Cheap model, good at front end, still a bit generic design when used without skills. Analyst Max Weinbach agreed, writing, Composer 2.5 is very good. It's good at doing more than just quick iterations of front end now. I will probably use it over Claude and Cursor. Addressing what had changed from Composer 2, he added, Composer 2 was good at some quick tweaks but I didn't trust it enough to do more. I trust 2.5 a lot more. Ellie from Prime Intellect noted that following the XAI deal, which you might remember is where XAI has the option to buy them for $60 billion, Ellie said that Cursor is now competitive with the Frontier Labs on both model performance and training compute. Certainly, Elon seems excited spamming retweets to all of the excited posts about the model.
9 Codex Tips From the Codex Team

Feed this to your agent