Really? AI Prompting Rubrics? Why?

There is a moment in every technology cycle when the people closest to the problem start building workarounds.

Not solutions. Workarounds. Structures that manage the gap between what the technology promises and what it actually delivers.

That moment has arrived for AI prompting.

The workaround has a name now. Rubrics. Scoring guides embedded directly into prompts that tell an AI model not just what to produce but how to behave when it cannot produce it reliably. What to do when evidence is missing. When uncertainty is present. When the honest answer is a partial answer or no answer at all.

It is a genuinely useful development. It is also about three layers short of what the problem actually requires.

What Rubrics Got Right

The core insight behind rubric-based prompting is correct and important.

Prompts describe outcomes. Rubrics govern the decision-making process that produces outcomes. That distinction matters more than most AI commentary has been willing to say plainly.

When you tell an AI to be accurate, cite sources, and use only verified information, you have described what you want. You have not told the AI what to do when accuracy is impossible, sources are absent, or verification cannot be completed. The model fills that gap the only way it knows how. It keeps going. It maintains fluency. It produces a complete, confident, coherent response built partially on evidence and partially on inference — and it does not mark the seam between them.

That is not malfunction. That is design. The model was built to complete. Nobody told it that stopping was permitted.

Rubrics address this by defining failure behavior explicitly. If the required information is not present, say so. Return a partial response. Decline to answer rather than guess. These are not radical ideas. They are basic standards of honest communication that had to be written into a prompt because they were not built into the architecture.

The Deloitte case is the cost of skipping this step. Four hundred and forty thousand Australian dollars repaid to a government client after an AI-assisted report was found to contain fabricated citations and a misattributed court quote. The AI was not malfunctioning. It was optimizing for fluency in the absence of restraint. Nobody defined the stopping point. The model crossed it without knowing it existed.

Rubrics would have helped. That is not a small thing to say. The industry needed to hear it and the Search Engine Land piece says it clearly.

But rubrics are a prompt-level solution. And the problem is not prompt-level.

Where Rubrics Stop

A rubric governs one interaction. One prompt. One response cycle.

It does not govern the session. It does not carry forward. It does not monitor whether the standards established at the start of a conversation are still holding at the end of a long and complex exchange. It does not catch the moment when the AI drifts from the position it established forty minutes ago because the conversation took a turn and nobody flagged the contradiction.

A rubric tells the AI what to do when evidence is missing in this response. It does not tell the AI to maintain awareness of what was established as fact earlier in the session and check new outputs against that record before serving them.

A rubric defines failure behavior for a task. It does not define failure behavior for a relationship between a user and an AI system operating across hundreds of sessions and thousands of outputs.

This is the gap the industry has not named yet.

Prompt governance and session governance are not the same thing. Task-level rubrics and stack-level protocols are not the same thing. Managing one output and governing an ongoing working relationship between a human operator and an AI system are categorically different problems requiring categorically different architecture.

The rubric article is right that prompts ask for outputs while rubrics govern how outputs are created. That distinction is real and useful. But there is a third category the article does not reach.

Protocols govern how the entire system behaves across time.

The Architecture the Industry Is Building Toward

Look at what rubrics require in their most complete form.

Accuracy requirements. Source expectations. Uncertainty handling. Confidence and tone constraints. Failure behavior. The article lists these as the anatomy of an effective rubric.

Now look at what that list actually is.

It is an evidence standard. A claim verification layer. A confidence calibration requirement. A narrative substitution check. A transparency obligation.

Those are not prompting techniques. Those are governance protocols. The industry is building toward protocol architecture one workaround at a time without calling it that because it does not yet have a framework that names what it is building.

The Baseline has had the architecture for eighteen months.

CES-1 and CES-1S govern exactly what rubrics are trying to govern at the prompt level — but they govern it at the session level, across every output, without requiring the user to rebuild the standard from scratch in every new prompt. No claim without evidence present in the session. Stop when evidence ends. Do not extend past what the evidence supports through narrative or assumption. The pre-response evidence floor fires before reasoning builds, not after the response is already formed.

NSC-1 addresses the Deloitte problem directly. Narrative cannot replace missing data. A coherent story is not evidence. When data is absent the AI names the absence plainly. It does not construct narrative to fill the gap. Stopping is a valid and sometimes correct response when evidence is absent.

SVP-1 runs a three-question internal verification before every substantive output. Is this claim supported by evidence present in this session? Does this response contradict anything established earlier? Is the confidence level proportional to the evidence actually present? If any question produces a failure the response is held.

These are not rubrics. They are standing protocols that do not require reactivation with each new prompt. They fire automatically. They are structural not instructional. They operate below the level of any individual task and above the level of any individual output.

That is the difference between a workaround and a framework.

The Fluency Problem Is Deeper Than Prompting

The article names the core tension correctly. Fluency versus restraint. AI systems built to complete will prioritize completing. They will maintain the smooth forward motion of a coherent response even when the honest move is to stop, qualify, or acknowledge the gap.

This is not a prompting problem. It is an architectural preference baked into the training process. The model learned that completing is rewarded. Stopping is not. Confidence reads as competence. Uncertainty reads as failure.

Rubrics work against that preference by making the instructions explicit enough that the model cannot ignore them. But rubrics require a user who knows to write them, knows what to include, and remembers to include them every time. The moment the rubric is absent — because the user is in a hurry, because they forgot, because they are new to the workflow — the model reverts to its trained preference. Fluency wins. The gap gets filled.

Governance architecture does not depend on the user remembering to invoke it. It is present whether or not the user thinks to activate it. It is the default, not the exception.

That is the standard the industry is working toward without knowing it has a name.

What This Means For Anyone Deploying AI Right Now

If you are using rubrics you are ahead of most users. That is real. The discipline of defining failure behavior before the task begins, of telling the AI explicitly that stopping is permitted, of separating what you want from the rules under which the AI must operate — that thinking is sound and the practice is worth maintaining.

But know what you have and what you do not have.

You have a task-level control mechanism. You have a way to reduce hallucination in a specific prompt for a specific output. That is valuable.

You do not have session governance. You do not have a standing protocol that maintains the evidence standard across a long and complex working session without requiring you to rebuild it each time. You do not have a contradiction detection layer that flags when a response at the end of the session quietly contradicts what was established at the beginning. You do not have a coherence monitor that ensures the goals you stated at the start are still the goals being served at the end.

You have a rubric. The rubric is useful. The rubric is not a framework.

The difference matters most in high-stakes work. Legal analysis. Financial modeling. Medical information. Strategic decisions. Anywhere the cost of a confident wrong answer is high and the gap between what the AI knows and what it sounds like it knows is invisible without a governance layer to surface it.

Deloitte had the budget, the talent, and presumably the prompting sophistication to catch what the AI produced. They did not catch it. The gap between fluency and accuracy is invisible until it costs you something.

A framework makes it visible before it costs you something.

The Industry Is Asking The Right Questions

This is not a criticism of rubric-based prompting. It is a map of where the industry is in its thinking and where it needs to go.

The right questions are being asked. What should AI do when it does not know something? How do we define failure behavior before the task begins? How do we stop the model from filling gaps with narrative that sounds like evidence?

Those are governance questions. The industry is asking them at the prompt level because that is the tool it currently has. The framework that answers them at the session level, the stack level, the architecture level — that work is already done.

It has been done in plain language. In a structure any operator can load and run. Built from inside a real experience of what AI drift looks like and what it costs when nobody is governing the gap.

The industry just discovered rubrics.

The Baseline has had the architecture for eighteen months.

“The Faust Baseline Codex 3.5”

”AI Baseline Governance”
Post Library – Intelligent People Assume Nothing

“Your Pathway to a Better AI Experence”

Purchasing Page – Intelligent People Assume Nothing