The Importance of AI Model Release Agreements

**SPEAKER_1** (0:00)
Today on the podcast, Dario Amadei and Jamie Dimian were both on stage in New York while Anthropic was shipping 10 pre-built finance agents. We're also talking about their full Microsoft 365 integrations and a ton of other updates and their new run rate coming out of Anthropic. In addition, every major US Frontier Lab has agreed to hand over their models to the Commerce Department before launch. This is kind of a new thing that's getting drafted up where the government wants to look at the models. A lot of people have opinions on that. Also, Google's DeepMind staff in London have voted 98% to unionize, and they're doing this over the Pentagon and the IDF's deal. We've kind of covered this in the past. There's a bunch of Google employees that have signed letters where they don't want Google to work with the Pentagon. Also, Sarah Brass has filed for a $26.6 billion IPO, and OpenAI is gonna get 33 million shares of that because of their partnership. PayPal has announced 4,500 AI-driven layoffs, and basically, they're having to redo everything at PayPal, and they're trying to use AI as the catalyst for bringing the company back. They are way down from their high of 2021
IBM Think is trying to reframe how you think of WatsonX orchestrate, and there is also a Harvard study that finds that OpenAI's O1 model outdiagnoses ER doctors in real triage cases. A lot to cover. Let's get into it. The first story I wanted to talk about was this Harvard Medical School study. They did it with Beth Israel Deaconess, the team over there, and they just published a science paper that basically said OpenAI's O1 preview it was outbeating a bunch of people inside of the ER, a bunch of doctors specifically. They put it head to head with two internal medicine attendees, so basically they had 76 real ER cases, and ER, by the way, I'm going to say is kind of an interesting place, because this isn't like doctors that know their patients and work with their patients and understand their history. I think you get much better results there, but to be fair, ER is a real use case, it's a real thing, and so inside of the ER, they gave the same electronic medical records, the same vitals, the same intake nurse notes, all of that, and at the triage stage where information is like basically it's the hardest to get, you have to make a call and you have to do it very quickly in the ER, right? You're trying to keep people alive, they're like bleeding to death, they're having some sort of big issue.
The O1 model hit the exact or very close diagnosis 67% of the time, and the two attending doctors hit it 55% to 50% of the time, of the diagnosis of what was actually wrong. So neither of these were perfect, but OpenAI was getting 67% and the doctors were getting 55% and 50%.
Once patients were admitted, O1 hit a 81.6% versus 78.9% and a 69.7% for the doctors. So everybody got better once the patients were actually admitted, but when they're coming into triage and you're trying to figure out what's wrong with them, OpenAI was doing really, really well, which is very promising. They did this across six total experiments and the model went up against hundreds of different physicians, and it kind of held its own in one. I think a lot of people were like, oh, did they only do this with like a couple doctors or like a couple places or a couple things, but no, they did this with hundreds of different doctors, and they did this against a ton of different experiments and a ton of different cases in all of this. So Peter Broder, a Harvard Clinical Fellow at Beth Israel and one of the study's co-authors, he summarized it and he kind of gave a quote over to Fortune. He said, we used to evaluate models with multiple choice tests. Now we're scoring close to 100 percent. We're already at the ceiling. Now, personally, if I'm going to put a caveat on all of this, I will say that these are text-only inputs. There's no imaging, there's no kind of multimodal. And the authors are also pretty clear on that. They're saying that this isn't approval for clinical development. But I think there's definitely a liability question right now that's kind of nobody has solved it. If O1 disagrees with the attending physicians and then the attendings override it, who's going to get sued when a patient has an issue for malpractice? Are they going to look at what the AI model said and what the physician disagreed with? I mean, this would go both ways, right? Let's say the physician agrees with what the AI model says and it turns out to be wrong, who's liable? So these are some big questions that need to be solved. The next story I want to talk about is IBM. They held their Think 2026 keynote today and they announced what to me basically looks like their most aggressive enterprise AI repositioning that they've done in the last few years. I remember going to some AI conferences maybe like three or four years ago and IBM was there talking about IBM Watson. This isn't really something that you hear a ton in the mainstream. They've done some enterprise deals but I don't think it's as big as they had hoped to. And what they're really pivoting and kind of pointing this towards is that it's called the WatsonX orchestrate and this is basically IBM now calling it a quote, agentic control plane. So this is basically the layer that's going to let all the Fortune 500 companies deploy agents from Anthropic, OpenAI, Google, plus all of their own internal agents. And they can do this all with a single policy. So they can, you know, they have some identity. They kind of have an audit layer and it can look across everything. They also shipped IBM Bob, which is an agentic developer partner. I just, I don't know, leave it up to IBM to name their agent Bob. It's just, I don't know, it feels so on brand for IBM versus like, you know, you have like Claude from Anthropic and you use, and you got like Gemini from Google and we got Bob from IBM.
The Importance of AI Model Release Agreements

Feed this to your agent