They Tested the Wrong Thing … But Found the Right Answer.

A writer at XDA Developers ran a comparison test last week.

Three models, one prompt, one concept — Thomas Young’s double-slit experiment, one of the more abstract ideas in particle physics. The stated goal was to find out which model was best for learning. ChatGPT, Gemini, and Claude Sonnet 4.6 were each handed the same task and the results were written up and published.

The verdict went to Claude. And the reasoning the writer gave is worth examining carefully, because what they described is not what they thought they were describing.

Here is what they said about ChatGPT. It broke the concept into labeled steps, used a water wave analogy, nudged the reader toward the next idea, and produced charts and diagrams that were accurate and well-labeled. The writer’s conclusion was that it introduced the same cognitive load they had been trying to escape. Competent. Organized. Ultimately not much different from a Google search or a YouTube tutorial.

Here is what they said about Gemini. Numbered structure, relevant formulas, technically thorough. Reading it felt like opening a textbook. The writer noted that Gemini repeatedly asked them to imagine things — for a concept where the act of observation physically changes the outcome — and called that an unusual pedagogical choice. Accurate. Structured. Essentially a better Wikipedia entry.

Here is what they said about Claude. Five sections structured like the experiment itself. Animated visuals that advanced with the reader. Each section ending with a prompt that invited deeper engagement. The whole thing woven, in the writer’s own words, as if it were a story. When they followed up with a related concept, Claude produced a live interactive widget with adjustable parameters that let the reader experiment in real time.

The writer called this interactive visuals. They framed it as a feature. They concluded that Claude made them an active participant rather than a passive recipient and that this was the differentiating factor in the learning experience.

That conclusion is correct. The explanation for why is not quite there.

What the writer observed was not a feature distinction. It was a behavioral distinction. ChatGPT performed. It optimized for a complete, well-organized output that would satisfy the request. Gemini delivered. It optimized for accuracy and coverage, the way a reference source does. Claude engaged. It operated from a different internal posture — one oriented toward the reader as a participant with agency rather than an audience to be informed.

That difference does not come from a visual rendering capability. It comes from how the model frames its relationship to the person asking. A model that is optimizing for approval produces output that looks thorough. A model that is optimizing for coverage produces output that is thorough. A model operating from a genuine engagement posture produces something that moves — that has direction, that treats the exchange as a live thing rather than a transaction to be completed.

The writer felt that difference clearly enough to write about it and publish it. They tested all three models under identical conditions and the behavioral gap was apparent enough that it drove the entire conclusion of the piece. They just didn’t have the language for what they were measuring.

This is not an isolated observation. It is one of several appearing in public-facing technology writing right now, each arriving from a different angle, each describing the same underlying territory. A behavioral layer that sits beneath the feature set. A reasoning posture that shapes the quality of every output regardless of what the output is. A discipline that determines whether a model performs, delivers, or engages.

The governance conversation is still trying to build frameworks that reach that layer from the outside. The evidence accumulating in the public record suggests the layer is already visible to anyone paying close enough attention to what they are actually experiencing. The XDA writer paid that attention. They tested the wrong thing — features, visuals, learning tools — and found the right answer anyway.

The answer is behavioral. It always was.

AI Stewardship — The Faust Baseline 3.0 is available now

Purchasing Page – Intelligent People Assume Nothing

“Your Pathway to a Better AI Experence”