On proving what agents do.
- How-to6 min read
Verify a Stripe refund before closing the ticket
The most common high-stakes agent flow in support is “refund the customer and close the ticket.” Here's how to put a deterministic check between the refund and the confirmation, so the customer is only ever told “done” when Stripe agrees.
June 26, 2026 - Research8 min read
Why agent evals miss silent failure
When an agent fails an action but reports success, that's silent failure, and it's the one thing LLM-judge monitors are worst at catching. New research puts a number on how bad: 0.54 AUROC on API-call traces, barely better than a coin flip.
June 24, 2026 - Category5 min read
“Done” is not proof: the completion gap in production AI agents
As agents move from answering questions to taking actions, the riskiest moment isn't the decision, it's the gap between the claim and what the system of record proves.
June 2, 2026 - Engineering4 min read
The trace is not the truth
Observability and evals are essential, but they read the agent's own story. Proof-of-Completion reads the system of record instead.
May 19, 2026 - Metrics4 min read
Measure agents by what actually happened
Success rates and eval scores measure what an agent thinks it did. Verified Completion Rate measures what the system of record proves.
May 5, 2026
New writing on Agent Action Integrity.
Occasional, no spam. Or grab the RSS feed.