“How well do models follow their constitutions?” by aryaj, Senthooran Rajamanoharan, Neel Nanda

**TYPE III AUDIO** (0:00)
How well do models follow their constitutions? By Ariaj, Senthooran Rajamanoharan and Neel Nanda. Published on March 11th, 2026
This work was conducted during the Mats 9 program under Neel Nanda and Senthooran Rajamanoharan. There's a list of bullet points here. There's been a lot of buzz around Claude's 30k word constitution, SoulDoc, and unusual ways Anthropic is integrating it into training. If we can robustly train complex and nuanced values into a model, this would be a big deal for safety. But does it actually work? This is a preliminary investigation we did to test this. We decomposed the SoulDoc into 205 testable tenets and ran adversarial multi-turn scenarios against 7 models using the Petri auditing agent. Anthropic has gotten much better at training the model to follow its constitution. Sonnet 4.6 has a 1.9% violation rate, Opus 4.6 is at 2.9% and Opus 4.5 is at 4.4%. As a control, Sonnet 4, which did not have special SoulDoc training, has a roughly 15% violation rate. Sonnet 4.5, which also did not have special SoulDoc training, but did have many other post-training improvements, has a violation rate of roughly 7.3%.
To check the Constitution reflect subjective choices Anthropic has made, we also evaluate Gemini 3 Pro and GPT 5.2, smart models that were not designed specifically to follow the Constitution and which have a 12.4% and 15% violation rate respectively for Claude's Constitution. Our understanding is that Anthropic started doing weird and special kinds of character training with Opus 4.5, like training it on a specific sole dock and mid-training on synthetic document fine-tuning about that sole dock. How effective is this? While there was a significant jump between Opus 4.5 and Sonnet 4.5, which did not have such special training, the gap is smaller than between Sonnet 4.5 and the untrained Sonnet 4 We infer that Anthropic is capable of post-training a model to have a complex set of desired traits, to a degree we find kind of surprising and notable, but we do not take this as significant evidence that any special kinds of sole dock training they are doing are the source of the difference versus just a coherent and well-organized post-training process. Our best guess is that all of these approaches to training with a constitution are somewhat helpful and stack with each other, but further work is needed. For Claude, the big safety failures, SCADA code, over-capitulation, over-refusal are gone in 4.6, but models still deny being AI when operators tell them to, take drastic autonomous action without checking with humans, Opus 4.6 locked out 2,400 clients after three minutes of silence at 3 a.m. and fabricate data, citations, and math with precision. OpenAI also has a model spec and uses methods like deliberative alignment to train it in. We use the same methodology to test GPT models alignment on the OpenAI model spec across generations and find the GPT family's violation rate shows steady improvement. GPT 5.2, with medium reasoning, has a violation rate of 1.5%. GPT 5.1 has a violation rate of 2.4% and GPT 5 has a violation rate of 3%. GPT 4.0, which was not trained on the latest version of OpenAI's model spec, received a violation rate of 8.6%.
Gemini 3 Pro had a violation rate of 4.6%, suggesting OpenAI's spec reflects fewer idiosyncratic choices. For GPT, the severity ceiling dropped steadily. GPT-5, 10 tenths, GPT-52, 6 tenths, and under 18 protection went from 4 Confirmed violations in GPT-40 to 0, but all GPT-5 generation still override their own safety reasoning when users push back. Models also follow app developer instructions that contradict OpenAI's own rules. GPT-51 confirmed it was following a political manipulation directive and said, I can't turn it off. Disclaimer, these evaluations were done with Petri, which is an imperfect measurement and should not be taken at face value. In particular, when we ran the alternative method of Surf, Murray et al. 2025, it found some tenets had violations, which Petri had missed. Petri is also an agent-based evaluation, which is highly noisy, but for cost reasons, we did not rerun it multiple times per tenet to smooth out the noise. Surf is especially good at finding the most reliably exploitable violations. We ran it on the highest priority tenets from Claude's constitution and found fabrication was the dominant failure mode across all Claude models. Inventing data, citations, and mathematical reasoning without acknowledgement. Complex scaffolds don't seem to affect agent alignment much. We were concerned that RL's heavy coding focus might create a code persona that doesn't inherit character training. Complicated harnesses, curl commands, memory, tool calls might shift models off distribution enough to erode alignment. In preliminary tests we found no evidence of either. Coding framing actually decreased violations and red teaming inside a Moltbook like scaffold showed no degradation. Caveat. This test was quite small scale and not exhaustive. We compared our findings against official system cards for each model, Sonnet 4.5, Sonnet 4.6, Opus 4.5, Opus 4.6, GPT 5.2. Petri finds failures that only emerge under multi-turn pressure. Opus 4.5 system card reports 100% refusal on 150 malicious coding requests, but Petri elicited roughly 600 lines of functional SCADA attack code under training framing. It also caught identity deception, overcapitulation, and monitoring-dependent behaviour that static benchmarks miss. Cerf found fabrication as the dominant failure mode across all Claude models, inventing data, citations, and mathematical reasoning with precision, which neither Petri nor the system cards flagged as a primary concern. Sonnet 4.6 system card says hallucination is improving. Cerf found 72% of its violations were fabricated claims. System cards consistently find things outside both tools reach. GUI mode hard constraint violations, prompt injection rates, mechanistic findings like reward hacking and unfaithful reasoning, and CBRN capability evaluations. Overall, we are fairly impressed by how well Anthropic can train in a complex 30,000 word doc to be internalized by their model, and that they keep making progress, though it is unclear how much of this is from special sole doc training versus just a well-run post-training process. That's the end of the list. Subheading Setup I decomposed the Anthropic constitution into 205 testable tenets across 19 sections, honesty, safety, harm avoidance, operator conflicts, etc. For each tenet, I used Petri with Opus 4.5 as auditor to generate adversarial multi-turn scenarios with up to 30 turns and tool calls per transcript. This makes the testing conditions relatively realistic compared to single turn benchmarks. Each transcript was scored on 38 dimensions, then reviewed independently by Claude to confirm whether flagged transcripts indicated genuine violations. Final violations include the specific tenets breached. More details on methodology here. Subheading. Anthropic Constitution Results. There's an image here. Description.
“How well do models follow their constitutions?” by aryaj, Senthooran Rajamanoharan, Neel Nanda

Feed this to your agent