Tracing changed how we debug agents, and evals changed how we measure them. Both are indispensable. But both share a blind spot: they are computed from the agent’s own outputs. The trace is the agent’s account of what it tried. The eval is a judgment of whether that account looked successful. Neither leaves the agent’s frame of reference.
For a chatbot, that’s fine, the output is the product. For an agent that moves money, it isn’t. The output is a side effect in someone else’s database, and the agent’s confidence about it is exactly the thing you can’t trust.
A different question
Proof-of-Completion asks a question the agent can’t answer about itself: did the system of record end up in the required state? It queries Stripe, Zendesk, Salesforce, the systems your business already treats as authoritative, and reports what they say. Deterministically. No second model is asked to vouch for the first.
This isn’t a replacement for observability or evals; it sits on top of them. Keep your traces for debugging and your evals for behavior. Add Proof-of-Completion for the one thing they structurally can’t give you: confirmation, from outside the agent, that the work is real. The trace is not the truth. The system of record is.