A year ago, the hard question about an AI agent was whether it should act. Today, agents are issuing refunds, applying credits, cancelling subscriptions, and closing tickets, and the hard question has moved. It is no longer only can the agent act, but did the action actually happen.
That distinction sounds academic until you watch it fail in production. An agent tells a customer their refund is processed. The trace looks clean, a confident tool call, a 200, a tidy summary. But Stripe has no matching refund. The retry after a timeout quietly created a duplicate. The ticket was closed before payment ever settled. The agent reported success; the business inherited a discrepancy.
The completion gap
We call this the completion gap: the difference between what an agent claims it completed and what the system of record proves actually happened. It is where agent trust quietly breaks, because every layer in the stack is looking somewhere else. Authorization checked permission before the action. Observability recorded the trace of the attempt. Evals scored whether the behavior looked successful. None of them go back afterward and ask the system of record what really changed.
The gap is widest exactly where it hurts most: actions that touch money and customer trust. A wrong “done” on a refund is not a logging problem, it is a chargeback, a reopened ticket, and a broken promise, all created in a single confident step.
Closing it deterministically
The fix is not a better model grading another model. It is Proof-of-Completion: deterministic verification of each postcondition against the system of record, sealed into a signed receipt. Existence, amount, customer, status, duplicates, policy, checked where the truth lives, then proven. “Done” stops being a claim and starts being evidence.