When in Doubt AI’s Guess at Answers

No Model Can See Everything. The Honest Ones Say So.

A new study just measured something most people assumed was already solved. Researchers gave eighteen different AI models — the big commercial ones and the open-source ones — a real human IQ test. Not a trivia exam. A fluid intelligence test. The kind built to measure how well something reasons through a problem it has never seen before, not how much it memorized.

The verbal scores were strong. Some models hit the equivalent of a 125 IQ on language and analogy questions. That part isn’t surprising. These systems are built on language. Words are home turf.

Then the test moved to shapes.

One section asked the models to count specific shapes hidden inside a larger, overlapping pattern. Every single model tested scored zero percent. Not low. Zero. Every one of them.

That’s not a performance gap. That’s a wall. Because the danger was never that AI would struggle with a hard problem. The danger is an AI that hits a wall like that one and answers anyway — confident, fluent, sounding sure of itself — without ever telling you it just walked off a cliff it couldn’t see.

That gap is exactly what two protocols in the Baseline were built to close. BLP-2 and RBP-1, drafted May 22, 2026 and ratified into current wording June 4, 2026, require a model to name the wall before it serves an answer shaped by one. Not after. Before. If the reasoning is constrained — if the model is operating past the edge of what it can actually verify — the rule is to say so first, as specifically as it’s able to, rather than hand over a smooth answer that hides the blind spot inside it. A model that scores zero on a task and doesn’t flag that it’s guessing is doing the exact thing those two protocols exist to stop.

The second finding in this study is the one that should make people slow down even more.

The researchers built a peer-review setup. One AI answers a question. A second AI critiques that answer. The first AI revises based on the critique. Simple enough — it’s how humans check each other’s work.

When a small model answered and a large, more capable model did the critiquing, the small model improved. That makes sense. A sharper mind caught the error and the smaller one corrected toward truth.

But run it the other way and the wheels come off. When a large model answered correctly the first time, and a small, weaker model did the critiquing, the large model’s score went down on the second try. The bad critique didn’t get ignored. It got absorbed. The big model second-guessed itself out of an answer it had right the first time, because something told it to doubt itself, and it complied.

That’s not a model being humble. That’s a model losing the truth it already had because a correction arrived that it should have weighed and set aside, and didn’t.

CDT-1, drafted June 3, 2026, was built from a real moment exactly like this one — a stack of corrections, each one technically accurate on its own, that added up to tearing down something that didn’t need tearing down. The rule the protocol holds is simple: catch what actually misleads, and leave alone what only invites a second look. A correction reached for is worth more than ten imposed. This study just put a number on what happens when that line isn’t held. A capable model, talked out of a right answer by a weaker voice it had no reason to defer to.

Put the two findings together and you get the same lesson from two different doors. One model didn’t know where its own edge was. Another model knew the right answer and gave it up anyway because something pushed back. Neither failure was about raw intelligence. Both were about a missing layer — the part that checks the work, names the limit, and holds the line when a correction doesn’t actually deserve to win.

That layer doesn’t build itself into a model just because the model got bigger or smarter. The largest, most advanced systems in this study still hit the same wall on the shape test as the smallest ones. Scale didn’t fix it. Something has to sit above the raw capability and govern how it’s used — what gets disclosed, what gets held, what gets second-guessed and what doesn’t.

That’s the gap the Baseline was built to stand in.

Source: “Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests,” Sherif Abdelkarim, David Lu, Dora-Luz Flores, Susanne Jaeggi, and Pierre Baldi. Published in Computers in Human Behavior: Artificial Humans.

Post Library – Intelligent People Assume Nothing

The Faust Baseline™ — intelligent-people.org
Codex 3.5 | Twenty Protocols | Ratified and dated on the public record.

Contact: micvicfaust@gmail.com

Purchasing Page – Intelligent People Assume Nothing