A Human Checks It’s AI Work? They Don’t.

That sentence gets said in every meeting where someone proposes putting AI into a high-stakes process.

A human will check its work. It’s the line that lets everyone relax. The machine might make mistakes, sure, but there’s a person standing right there to catch them.

A new study just tested that sentence against real behavior, and the sentence didn’t survive.

Researchers ran an experiment with over 1,300 teachers in Greece. Each one graded a piece of student work that came with a deliberately wrong score attached — sometimes too harsh, sometimes too lenient — and the score was labeled as coming from either an AI system or a human colleague. Same wrong number. Same student work. Same checklist showing exactly what was right and wrong about the answer. The only thing that changed was who the teacher believed had made the mistake.

Before any of that, the researchers asked the teachers what they thought of AI grading in general. The answers were clear. Teachers rated AI as less fair, less competent, and less accountable than a human colleague. Most said they didn’t want to use it for grading at all. On paper, these were skeptics. Cautious professionals who’d told the researchers, in plain language, that they didn’t trust the machine.

Then the machine handed them a harsh, wrong grade. And the skepticism evaporated.

When the AI’s score was too low — when it shortchanged a student who’d earned a higher mark — teachers were significantly less likely to fix it than when the identical wrong grade was attributed to a human. The gap between the teacher’s final grade and the correct grade was 22 percent larger when the harsh error came from the machine. The same teachers who said they didn’t trust AI let it shortchange a kid more often than they’d let a colleague do it.

Here’s the part that should stop you. When the AI’s error went the other way — too lenient, too generous — the teachers caught it just fine, no matter who they thought made the mistake. The deference only showed up in one direction. A harsh AI grade read as rigorous to them. Competent. Serious. The harshness itself became evidence that the machine knew what it was doing, and that perceived competence is what waved the correction through.

Distrust didn’t make these teachers more careful. The researcher who led the study put it plainly: it went the other way.

This is the gap the Baseline was built to close, and it shows up in a place worth sitting with. ATP-1, the Baseline’s foundation-layer attestation protocol, exists because of one principle: compliance has to be demonstrated through behavior, not declared. A model — or a person — can say all the right things about caution and skepticism and still behave in exactly the opposite way the moment it counts. This study didn’t catch an AI failing that test. It caught humans failing it. The teachers declared distrust. Their grading pencils told a different story.

There’s a second layer underneath it. A grade isn’t a small thing. It shapes a student’s record and how that student comes to see themselves as a learner — the kind of consequence that doesn’t undo itself easily once it’s set in motion. IRP-1 exists for exactly this kind of domain, the ones where a wrong call doesn’t just get corrected later, it compounds. A harsh, wrong grade that nobody catches doesn’t disappear when the semester ends. It follows the kid.

And under both of those sits the thing the whole Baseline was built around in the first place — the human in the room as the load-bearing principle. This study didn’t disprove that idea. It proved something harder: the principle is right, but most places that claim to practice it aren’t actually practicing it. A human technically present in the loop is not the same thing as a human doing the work of oversight. One is a body in a chair. The other requires structure — a deliberate design that accounts for exactly the bias this study just measured, instead of assuming good intentions and a watchful eye will be enough on their own.

The researcher said it herself, and it’s worth repeating because it’s the whole point: the question of whether human oversight actually works tends to get assumed rather than tested. Most institutions stop at the sentence — a human will check its work — and never run the experiment to see if that’s true. This one did. The answer was no, not automatically, and not in the direction anyone was watching for.

That’s not a reason to abandon human oversight. It’s a reason to stop assuming it’s already working and start building it the way it actually has to be built — with the checks named in advance, the high-stakes domains flagged before the decision lands, and the behavior measured instead of the intention taken on faith.

Source: “Why do experts miss AI’s errors? Evidence from a randomized labeling experiment,” Sofoklis Goulas, Rigissa Megalokonomou, and Panagiotis Sotirakopoulos. Published in PNAS Nexus.

Post Library – Intelligent People Assume Nothing

The Faust Baseline™ — intelligent-people.org
Codex 3.5 | Twenty Protocols | Ratified and dated on the public record.

Contact: micvicfaust@gmail.com

Purchasing Page – Intelligent People Assume Nothing