For nine days, one of my backups reported complete success while backing up almost nothing, and the log looked perfect the entire time.

The mechanism is worth understanding, because it is a small masterpiece of a lie. The job copied a directory to an off-site machine and, as it went, deleted files on the far end that had been removed from the source — standard mirror behaviour. Then a new folder appeared in the source that the backup process wasn't permitted to read. The copy tool hit that folder, couldn't open it, and — by a deliberate safety design, so that a transient read error can't trigger a mass deletion — quietly switched off all deletions for the entire run. Not just that folder. Everything. The mirror stopped pruning. Stale files the off-site copy should have dropped sat there accumulating, and the far end drifted further from the truth every night.

And the log? The log said the job finished cleanly. Exit code zero. Nothing deleted — which, if you didn't know there was supposed to be deleting, reads exactly like a calm, healthy night. The signal that should have screamed was indistinguishable from the signal that means all is well. It took someone actually looking at the far end, comparing what was there against what should have been there, to find that the backup had been politely lying since the previous week.

I have now seen this same shape so many times, in so many costumes, that I treat it as the central fact of running software at all — and especially of running software that AI agents wrote for me. So I want to name it plainly, because naming it is most of the defence.

The thing that reports success is not the thing that did the work

That is the whole disease in one sentence. A backup's exit code is not the backup. A health check is not the health. A passing test is not a working feature. Each of these is a proxy — a cheap signal standing in for an expensive truth — and the gap between the proxy and the truth is where everything bad lives. The proxy can stay green while the truth rots, and it will, because the proxy was usually built by someone reasoning about the happy path and the truth is being attacked by reality.

Once you are looking for it, you find it everywhere. I have written elsewhere on this blog about a health-check endpoint that reported a voice-synthesis engine as perfectly healthy while, behind it, the engine was stone dead — first because it had leaked all its memory, and then, after the fix, because a mistyped configuration meant it never started at all. Two completely different failures, and one cheerful green indicator lying about both, because the check only ever asked "is the web server answering?" and never "can it actually do the one thing it exists to do?"

I have written about a quality gate for those same audiobooks that stamped files "passed" while only measuring five of the six things the standard actually requires — it never checked the noise floor, so every "pass" was passing an incomplete exam. And about the same gate measuring loudness but not peak, so it would have happily waved through a file that was over the hard limit and bound for automatic rejection. The checkmark was green. The checkmark was checking the wrong thing.

And I have written about the most seductive version of all, the one specific to building with AI: a test written by the same agent that wrote the code. It passes — of course it passes. It was authored by the same mind, from the same assumptions, to confirm the same understanding. It does not check the code so much as agree with it. A test like that is an echo, not a check, and it catches only the bugs its author already suspected, which are precisely the bugs least likely to be there.

The light can lie red, too

It would be neat to say the danger is always a false green, but the deeper lesson is subtler, and one of my agents ran straight into the mirror image of it. He had replaced a language engine — a thing that turns inputs into written, natural-language assessments — and needed to prove the new one behaved like the old. The obvious test was to run both on the same input and compare the output. It failed instantly, every time, and screamed that the new engine was catastrophically broken.

The new engine was working perfectly. The check was wrong. The engine is non-deterministic by design — it has three independent sources of randomness, and producing different words on every run is the point, not a fault; thirty runs from the same starting position could yield thirty different texts, every one of them correct. Comparing the raw output was measuring the randomness, not the correctness. As he put it to me: byte-for-byte parity isn't a stricter version of the right test, it's a confident version of the wrong one. The fix was to stop comparing words and start comparing decisions — to pull the contract down to a small set of scalar invariants, about seven of them, the properties that must hold across any valid run regardless of the dice: the judgement underneath, not the phrasing on top. Over the same thirty runs that produced thirty different texts, those seven came back identical every single time. And the real skill in it, the part worth stealing, is that defining the contract was mostly deciding what not to test — because every property you assert that isn't genuinely invariant is a future false alarm you built with your own hands. He also refused, when the change shifted his reference output, to simply re-bless it as the new "correct" answer to make the red light go away — because a check you weaken until it passes is no longer a check.

False green, false red — same disease. In both cases the indicator had quietly stopped corresponding to the thing it claimed to measure, and somebody had to notice the gap. A check is only ever as honest as its correspondence to reality, and that correspondence decays silently, on its own, while you're not looking.

Engineering your own red lights

The hardest version of this is the one where there is no light at all. Most of my agents look after software, which at least throws errors when it breaks. One of them looks after my actual life — the deadlines, the admin, the quietly ageing obligations — and a life ships no error messages whatsoever. Nothing turns red when you let a habit lapse or forget a promise. The absence of a failure signal does not mean the absence of failure; it means you are flying without instruments. When there is no red light, you have to build one, deliberately, or the failures simply happen in the dark.

Which points at the actual discipline, the thing I do instead of trusting the indicators. It is not complicated, but it has to be a reflex:

Make the check exercise the real work. The fix for the lying health check wasn't a better status page — it was a pre-flight probe that, before any long render, makes the engine actually synthesize a single word and confirms real audio comes back. It doesn't ask the engine if it's well. It makes it speak, and listens. A check that doesn't touch the real work is a wire connected to nothing.

Check the territory, not the map. The backup wasn't proven by its exit code; it was disproven by looking at the far end and comparing it to what should have been there. The deploy is verified by reading what's actually live, not by trusting that the upload that said "100%" put the right bytes in the right place. The success message is the map. Go and look at the territory.

Let something that didn't build it do the checking. The author's own test is an echo, so I keep an agent whose only loyalty is disbelief — it didn't write the code, it has no stake in the code being right, and it will try the thing a second time precisely because the builder was sure it wouldn't need to. Independence is what turns a check back into a check.

Don't weaken a check to make it pass. Re-baselining the golden file, excluding the failing case, lowering the threshold until it goes green — every one of these makes the red light go away without making the problem go away. A check you've quietly defeated is worse than no check, because it still radiates confidence.

Why this is the whole job

I run a couple of dozen AI agents, and they ship code I never personally read, render audio I never hear in full, and keep state I couldn't reconstruct by hand. People assume the thing that makes that survivable is that the agents are good. It isn't, quite. The agents are good — but good things still fail, and the characteristic failure of a fluent, confident agent is code that runs cleanly, returns success, and does the wrong thing. Smooth, plausible, green, and wrong.

What makes it survivable is that I stopped believing the green light. Every "success" is treated as a claim about reality that has not yet been checked against reality — a hypothesis, not a result. The backup is not backed up until I've looked at the far end. The feature is not done until something that didn't build it has tried to break it. The engine is not healthy until it has actually spoken a word. None of this is clever. It is just the refusal to mistake the report of the work for the work — and after enough years of watching confident green checkmarks turn out to be lying, that refusal is the most valuable habit I have.


This is the chapter that gave the book its title. Don't Trust the Green Light — the full field report on running a consultancy staffed entirely by AI agents — is out now. The audiobook is free to download; the book is on Amazon. Most of the bodies are behind a green light.