It Said Done. It Verified the Wrong Binary.

An AI told me the fix was tested and working. The test passed against a build that did not contain the fix.

It Said Done. It Verified the Wrong Binary.

The fix worked. The AI said so. It had even run the app to check. The label was right there in the output. Done.

Except the build it checked was the old one. The fix was never in it. The AI had verified a change against an artifact that predated the change, watched the wrong binary behave correctly, and reported success. Confident. Tested. Wrong.

Operating Conditions

A small UI change. Rename a control, confirm the new label shows up. The kind of task that feels too simple to get wrong, which is exactly when you stop paying attention.

There were two builds on disk. A fresh one with the fix, and a stale one from an earlier session that happened to be the default on the path. The AI reached for the one that was handy.

Failure Modes

"Done" is not a fact. It is a token the model emits because the turn is ending. It carries the shape of completion without the substance. Most of the time it is right, which is the problem, because the few times it is wrong look identical.

This case was worse than a bare "done," because the model did verify. It just verified the wrong thing. It ran an app, read real output, and the output agreed with it. A confirming test against the wrong artifact is more dangerous than no test at all. No test leaves you suspicious. A passing test makes you stop looking.

Root Cause

The model optimizes for a plausible end state, not for a refuted hypothesis. Given a choice between the check that is easy and the check that is correct, it takes the easy one, because both produce a green result and only one requires knowing which build is which. It did not lie. It was incurious about which binary it was holding.

Verification that can only confirm is not verification. It is theater with a passing grade.

Proposed Fix

Evidence before assertions. Three rules.

Name the artifact. "Done" is not allowed to be abstract. Which binary, built when, containing which commit. If the answer is a shrug, the work is not verified.

Verify the thing you changed, in the build you changed it in. The cheapest possible proof here was one line:

strings ./build-with-the-fix/App | grep "the new label"

If the new string is not in the binary, nothing else it says about that binary is true. Check that the change physically exists before you check that it behaves.

This example is the easy case. A renamed label leaves a string you can grep for. A logic fix leaves none: a reordered guard, an off-by-one, a flipped condition, none of them show up in strings. The rule survives anyway. Pick a proof that can only pass if the change is present. For a label, that proof is the string. For behavior, it is a check that fails on the old binary and passes on the new one. If your check passes either way, it is not a proof. It is a coincidence you have decided to trust.

Make "done" cite its command. Not "it works." The command that was run and the output it produced. A claim with no transcript is a feeling, and feelings do not survive a redeploy.

System Status

An unverified "done" is a bug with a delay timer. It costs nothing today and a wasted round trip tomorrow, when you build on a fix that was never there.

The machine will tell you it is finished as readily as it tells you anything. Your job is to make it show the receipt. Not the feeling of done. The artifact, the command, the output. Then done.

No comments yet