**Nathaniel Whittemore** (0:00)
Today on The AI Daily Brief, Why AI Advantage Compounds. Before that in the headlines, AI benchmarks for the real world. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
Welcome back to the AI Daily Brief Headlines Edition, all the daily AI news you need in around five minutes. Regular listeners will know I'm not a super fan of benchmarks, specifically the way that we use them around new model releases, for all the reasons that other people complain about benchmarks as well. Many of them are saturated, meaning the gradations between different models are incredibly small. Many of them can be gamed. Mostly though, they just don't really exist and operate in the real world that we're using these models in, and so they don't tell us all that much about how those models work in the real world. Lately, however, there has been an effort to try to build evals and benchmarks that are the province of the real world, and one of them, which was introduced by OpenAI in September, was called GDPVAL. Basically that test modeled performance against a set of economically valuable tasks representing over 44 occupations. The benchmark measures capabilities to complete knowledge work tasks end to end, including following instruction, researching, doing the actual work, and then delivering the final product. Now, the grading of OpenAI's approach to this was difficult. They rely on expert graders, which was a group of experienced professionals from the same occupations that were represented in the data set. They paired those human graders with an automated grader, but said at the time that it wasn't yet as reliable as expert graders, so they weren't in a rush to replace them. Well, now artificial analysis has completed an evaluation harness based on that set of GDPVAL tasks to allow the benchmark to be run on any LLM. As you might imagine, they are relying on an AI-based grading pipeline to allow the benchmark to be run autonomously and at scale. They wrote, We think this makes it today's best way to compare general agendic performance of language models. They're referring to their setup as GDPVAL AA for artificial analysis. And in their testing run, they found that despite OpenAI creating the benchmark, Opus 4.5 was the leading model. GPT-5 was in second place. Claude Sonnet 4.5 was third. Interestingly, GPT-51 underperformed GPT-5, but still did well enough to land in fourth place. And DeepSeq 3.2 and Gemini 3 Pro were tied for fifth place. GPT-5.1's slight underperformance was curious, but artificial analysis noted that the model used half as many tokens to complete the tasks. While the drop in quality is only slight, it still highlights that 5.1's increased efficiency does come at a cost. While Opus 4.5's were on top the charts, it was very expensive at $608, which was more than twice the cost of any other model that they tested. DeepSeq 3.2 was the standout for cost efficiency, completing the benchmark run for $29, one-twentieth the cost of Opus, and a third of the cost of Gemini 3 Pro's run that tied on a score basis. Ultimately, like anything, I think you need to continue to be skeptical of most benchmarks, but at least this one is trying to get at the actual types of tasks that people in the real world will be using these tools for. Speaking of OpenAI, one quick note on them. The information reports that their sources say that ChatGPT is now nearing 900 million weekly active users, up from that 800 million that they've been stuck at for some time. Moving over into the world of drama and intrigue, Bombshell News from the Chip War has reports claim that DeepSeek has built a Blackwell training cluster from smuggled chips. The information reports that DeepSeek has begun developing its next frontier model on a cluster of several thousand of NVIDIA's state-of-the-art Blackwell chips. These chips are banned for export into China even under Trump's recently announced reductions in export controls. The information is citing six people with knowledge of the matter as their sources, seeming to suggest that the sources are China-based with close ties to DeepSeek rather than US-based intelligence. The reporting claims that DeepSeek secured this chip cluster by importing the chips via data center providers in a third country. The claim is that NVIDIA servers were delivered and installed in third-party data centers, inspected by vendors for compliance with the export controls, then dismantled and smuggled into China as individual components. Now, if true, the report changes the landscape of the chip war. It would be the strongest reporting to date to suggest that Chinese labs have been able to smuggle enough cutting-edge chips to build a commercial training cluster. Previous reports have largely been about smuggling small batches of chips for research, and the incident's reference were often from years ago. Now, NVIDIA, for their part, is calling the report into question, stating, We haven't seen any substantiation or received tips of phantom data centers constructed to deceive us and our OEM partners, then deconstructed, smuggled and reconstructed somewhere else. However, they did leave open the possibility that this was true, saying, While such smuggling seems far-fetched, we pursue any tip we receive. Alongside the DeepSeq report, we also have news that Beijing is holding emergency meetings with tech companies on the new possibility of H200 imports. Officials reportedly met with representatives from Alibaba, ByteDance and Tencent this week to assess their demand for H200s. Officials reportedly asked for specific information on how many H200s the tech companies need for training, somewhat suggesting that Beijing is preparing to take up the White House's offer and allow the chips into the country. Now, we talked a lot about the US's strategic decision-making, but as the information points out, the quote, drumbeat of actions highlights Chinese policy makers' dilemma, whether to support AI development that needs powerful chips China can't yet produce, or push through the adoption of homegrown chips to eventually rid the country of US technology.
13 more minutes of transcript below
Try it now — copy, paste, done:
curl -H "x-api-key: pt_demo" \
https://spoken.md/transcripts/1000651996090
Works with Claude, ChatGPT, Cursor, and any agent that makes HTTP calls.
From $0.10 per transcript. No subscription. Credits never expire.
Using your own key:
curl -H "x-api-key: YOUR_KEY" \
https://spoken.md/transcripts/1000741063922