What AI Poetry Reveals About Judgment — And Why the Baseline Already Solved the Hard Part

The Faust Baseline™Purchasing Page – Intelligent People Assume Nothing

micvicfaust@intelligent-people.org

The article on LLM poetry and “greatness” circles a question most people are asking sideways instead of head-on.

Can AI write great poetry?

The honest answer is: it can write convincing poetry. It can write competent poetry. It can even write lines that feel startling or beautiful in isolation. But greatness—real, durable greatness—keeps slipping out of reach.

That failure is not accidental. And it is not about creativity.

It is about judgment.

Craft Was Never the Problem

The article is right about one thing immediately: technique is no longer the bottleneck.

Modern language models can handle:

meter
rhyme
formal constraint
metaphor
turn and closure

They can imitate almost any poetic surface that has been digitized. If greatness were only a matter of skill, we would already be done.

But anyone who has spent time with poetry knows this: craft is the entry fee, not the destination.

The Real Test: Particular → Universal

The author’s definition of greatness is old, durable, and correct.

A great poem begins in a specific life, in a specific place, inside a specific culture—and somehow reaches beyond it. The universal emerges because the poem refuses to float free of its particulars.

This is where AI poetry consistently falters.

Not because it lacks vocabulary.
Not because it lacks training data.
But because it lacks position.

Pattern Is Not Position

Language models work by moving from general pattern toward manufactured detail. Even when they produce specificity, it is usually decorative rather than earned.

They can describe:

a street
a face
a ritual
a historical reference

But they do not naturally carry what matters most: stake.

A great poem costs something to write.
It risks being wrong.
It risks being embarrassing.
It risks being misunderstood.

That risk shapes the language. Readers feel it even if they can’t name it.

Models don’t risk anything unless the system forces them to.

Why Gwern’s Experiments Matter

Gwern’s work stands out because he doesn’t treat the model as an oracle. He treats it as a workshop apprentice.

His process matters more than any single poem:

analysis before generation
multiple divergent drafts
explicit critique
role separation
revision across time

That is not automation. That is authorship distributed across tools.

In other words, the human is still holding the line on judgment.

Gwern is supplying:

the problem worth solving
the constraints that matter
the sense of taste
the refusal to accept the first answer

The model supplies leverage. Not authority.

That distinction is everything.

Why Mercor Is Aiming at a Different Target

Mercor’s approach is not wrong—but it is aimed elsewhere.

They are optimizing for:

preference alignment
consistency
professional acceptability
scalable judgment

Poetry is being used as a stress test for taste, not as an end in itself.

The problem is structural: when you reward what most experts agree is “good,” you inevitably smooth away the edge cases. But in poetry—and in judgment more broadly—the edge cases are often where truth lives.

A rubric can penalize clichés.
A rubric can reward coherence.
A rubric can score technique.

A rubric cannot detect earned necessity.

That’s not a failure of poetry. It’s a limit of optimization.

This Is Where the Baseline Enters Quietly

What the article is really diagnosing—without naming it—is the same failure mode the Baseline was built to prevent.

The Baseline starts from a different assumption:

Judgment cannot be trusted unless it is anchored to position, constraint, and consequence.

Not preference.
Not popularity.
Not satisfaction scores.

Position means:

Who is speaking
From where
Under what constraints
With what responsibility

Without that, outputs drift toward median comfort. That’s true in poetry, law, medicine, arbitration, and governance.

The Inversion That Matters

Most AI systems move like this:

General pattern → Manufactured particular → User approval

The Baseline reverses the flow:

Concrete situation → Defined role → Named constraints → Reasoned judgment → Limited generalization

That inversion is the difference between imitation and responsibility.

It is also why the Baseline resists “traction” as a primary metric. Traction measures comfort. It does not measure correctness, truth, or durability.

Culture Is Not the Missing Ingredient — Stake Is

The article says LLMs lack culture. That’s partially true, but it misses the sharper point.

Models can absorb cultural artifacts.
They can reference history.
They can reproduce symbolic systems.

What they lack is stake.

The Baseline does not try to give AI culture. That would be fantasy.

Instead, it temporarily loans the system a position:

a role
a boundary
an obligation
an audit trail

That loan can be revoked. That is what makes it safe.

Why Poetry Is the Canary

Poetry exposes this problem early because it compresses everything:

judgment without a single correct answer
long-range coherence
emotional truth
cultural embedding
resistance to averaging

If a system fails here, it will fail quietly elsewhere—just with higher stakes.

The article gets this right without saying it outright.

The Quiet Conclusion

If AI ever produces something we are comfortable calling “great,” it will not be because models got bigger.

It will be because:

position was enforced
consequence was acknowledged
selection was human
revision was slow
and judgment was allowed to resist optimization

That is not a poetry lesson.

That is a governance lesson.

And it happens to be the ground the Baseline has been standing on all along.