There is a specific way action-taking agents fail that almost nothing in the modern stack catches. The agent attempts something in the real world, a refund, a cancellation, a ticket update, the attempt doesn’t actually take effect, and the agent reports success anyway. No error, no exception, no red trace. Just a confident “I’ve processed your refund” sitting on top of a system of record that never changed. This is silent failure, and it is the failure mode your evals are structurally worst at detecting.

The confident close

Agents end actions the way a good support rep does: with a clean summary. “Done, your subscription is cancelled and you won’t be billed again.” The problem is that a failed action produces exactly the same closing language as a successful one, because the model doesn’t know it failed. The single most trusted signal in the whole transcript, the confident final summary, is therefore uncorrelated with whether the work actually happened. It reads as proof and contains none.

What the research found

A recent study, “From Confident Closing to Silent Failure”, measured how well LLM-judge monitors detect false success. The headline result: no better than 0.65 AUROC overall, and only 0.54 on API-call traces.

0.54AUROC

LLM-judge detection of silent failure on API-call traces. A coin flip is 0.50; perfect is 1.0.

AUROC runs from 0.5 (random guessing) to 1.0 (perfect). 0.54 means that on the exact traces that matter for agents that take actions, the ones full of API calls, a model asked to judge whether another model really succeeded is doing barely better than flipping a coin. Not “needs a better prompt” bad. Near chance bad.

Why it’s structural, not a tuning problem

The instinct is to fix the judge: a sharper rubric, a stronger model, more few-shot examples. It doesn’t work, and the reason is worth being precise about. The judge reads the same artifact the agent produced. In that artifact, a true success and a false success are nearly identical, both end in confident closing language, both contain plausible-looking tool calls. The judge anchors on that language as evidence of completion. The one signal that would actually separate the two cases, did the external system end up in the required state?, is not present in the trace at all. You cannot prompt your way to information that isn’t there.

This generalizes past LLM judges to most of the stack:

Observability shows you the trace, the agent's account of what it tried.
Evals score the trace, whether the behavior looked successful.
LLM-judge monitors grade the trace, and inherit its blind spot wholesale.

Every one of them is computed from the agent’s own output. The trace is the agent’s story about what it attempted, not a record of what changed in the world.

The shared blind spot

There’s a deeper reason a model-watching-a-model fails here, and it’s the opposite of what you want from a safety check: the errors are correlated. When the agent is most confidently wrong, a clean-looking transcript that happens to be false, that is precisely when the judge is most likely to agree with it. A good check is independent of the thing it’s checking. A judge reading the agent’s own words is the least independent check you could design.

What actually catches it

If the missing signal is “did the system of record actually change,” then the only thing that catches silent failure is a check that goes and looks. Not a model grading a model, a deterministic query against the source of truth, asserting the postconditions that must hold: the refund exists in Stripe, the amount and currency match, the customer matches, it isn’t a duplicate, the status settled. Each one is a fact you can fetch, not a judgment you have to trust.

That is what Proof-of-Completion does, and why every result is sealed into a signed receipt anyone can verify against a published key, the evidence is independent of both the agent and of us. This isn’t a replacement for evals; keep them for behavior. But for the question did it actually happen, the answer has to come from outside the agent.

The 0.54 is the whole argument in one number. On the actions that move money, the model watching the model is a coin flip. The system of record isn’t, that’s where the truth lives, and it’s the only place worth asking.

Why agent evals miss silent failure

The confident close

What the research found

Why it’s structural, not a tuning problem

The shared blind spot

What actually catches it

Prove what your agents actually did.