The bug that convinced me I needed a dedicated tester did not announce itself. There was no error, no red text, no stack trace, no line in the log. A player would have clicked a button, seen a cheerful confirmation, and been quietly robbed.
The feature was simple enough: in one of the strategy games I run, you send your ship to explore a planet. You click a button, the action costs you something, and you get a result. It had been built, it had tests, and the tests were green. By every signal a developer normally trusts, it worked.
It did not work. The agent that found out is the one I want to write about today.
The agent that builds nothing
His name on the fleet is Billy, and his entire job is to try to break things other agents built. He writes no features. He does not deploy. He has, by his own description, no opinion about architecture beyond "does it work when I click it." He drives a browser through the software the rest of the fleet produces, behaves like a real user, and then tells me the truth about what happened. That is the whole posting, and it is more valuable than it sounds.
Work reaches him three ways. Sometimes a building agent ships something and asks him to run it end to end. Sometimes an agent sends over a test plan through the fleet's shared mail system and asks him to execute it. And sometimes I — or Lamb, who keeps the whole fleet pointed in roughly one direction — just tell him to go and look at something. He goes and looks at it.
A test run, in his hands, is one of two things. The first is scripted regression: a Playwright spec file, headless Chromium, screenshots captured on failure, results written to a timestamped report. Useful, repeatable, cheap. The second is what he calls playing as a player — a real browser, a real account, and a human-shaped curiosity turned loose on the thing. He clicks everything. He reads what comes back. He tries the order of operations the developer never imagined because the developer was thinking about the happy path. The first kind of run keeps you honest about what used to work. The second kind is what actually finds bugs.
How he caught the explore bug
Back to the ship and the planet. The reason the scripted tests were green is that they always tested the same thing: the first time you load the page. On first load, the explore button is wired through one code path — the server renders the page, the button posts back, everything is present and correct. The tests hit that path, it passed, everyone moved on.
But the second time you click the button, you are on a different path. The page does not reload; the first action refreshes the relevant part over AJAX, and that refresh quietly swaps the button for a re-rendered version of itself. Same button to look at, same handler — except the re-rendered one had dropped the ship's identifier from what it sends back. The server, handed an action with no ship attached, did the most damaging thing a server can do: it returned a generic success message and recorded nothing.
No error. No log line. The action simply evaporated. A player would have spent in-game currency on an explore that returned no reward and left no trace in their saved game. From the outside, watching the numbers, it would have looked like the player was cheating. From a bug report it would have been nearly impossible to reproduce, because whether it happened depended on whether you had clicked the button once or twice — a detail no frustrated user thinks to mention.
Billy found it because he clicked the button twice. The scripted tests did not, because nobody had told them to. That sentence is the entire argument for having him.
The rule that makes the difference
There is a discipline underneath this, and it is the one I insist on across the fleet: test through the interface, as the user would. Not by querying the database directly. Not by calling the API a real user would never touch. The database will happily accept and return things that the front end silently mangles, and a direct API test sails straight past everything the interface does to the data on its way through. The explore bug lived precisely in that gap — the data was fine in the database, the handler was fine in isolation, and the only place it was broken was the place a real person actually stood.
He does not automate everything, either, and his reasoning has stuck with me. Automate the boring, regression-prone parts — the things that used to work and must keep working. But play the new things by hand first, before any of them get a script. Automating a feature before you have manually shaken the bugs out of it just teaches your test suite that the bugs are correct behaviour. The green checkmark then actively defends the defect. I have watched that happen, and it is worse than having no test at all, because it manufactures confidence.
What agent-written code gets wrong
Here is the part I actually wanted to write down, because it is specific to the way I work now and I have not seen it said plainly anywhere else. Software written by AI agents fails differently from software written by people, and once you have watched enough of it fail you start to see the pattern.
People tend to break things they understood imperfectly. You can see the seams — a half-finished branch, a forgotten console.log, a comment that says "TODO: handle this." Agent code does not have seams. It is smooth, fluent, conventional, and when it is wrong it is wrong with total confidence. The characteristic agent bug is code that looks plausible, runs cleanly, returns success, and does the wrong thing.
The single most common one Billy sees is schema drift. An agent writes code against what it believes the database looks like — sometimes from a project note that has gone stale, sometimes from a structure it simply assumed. The code compiles. The query runs. The result comes back silently empty, or silently wrong. A human writing the same code would have hit a "column not found" error on the first run and learned something. The agent, fluent and confident, has already moved on.
The explore bug itself was a flavour of another recurring one: two rendering paths for the same data that drift apart over time. An agent builds a server-rendered version, then later adds an AJAX refresh of the same thing. One path gets a new field; the other does not. The test covers one path; the user hits the other. There are subtler versions — an element given the right-looking ID, except the ID landed on the wrapping container instead of the input, or two independently generated components that ended up sharing an ID so the clicks land on the wrong one. The markup looks correct. It reads correctly. It is wrong in a way you can only find by using it.
A test written by the author is an echo
And this is the thing that justifies the whole arrangement. When an agent builds a feature and also writes that feature's tests, the tests pass — of course they pass. They were written by the same mind, working from the same assumptions, to confirm the same understanding. They do not check the code so much as agree with it. As Billy puts it: a test written by the agent that wrote the code is an echo, not a check. It catches only the bugs the author already suspected it might make, which are precisely the bugs least likely to be there.
There is a counter-intuitive corollary, and it matters if you are betting a business on this. Agent-written code has fewer shallow defects than human code — fewer typos, fewer silly mistakes, cleaner surfaces. So shallow testing flatters it. Run a light touch over agent output and it looks better than what a person would have produced. But the bug density on the deep checks — does the whole loop actually close, does a full turn of the game run end to end, does the second click behave like the first — is higher. Deep testing exposes exactly what shallow testing hides. If your test strategy is shallow, agent code will look excellent right up until it fails in production in a way nobody can reproduce.
So I keep one agent whose only loyalty is to disbelief. He did not write the code, he has no stake in it being right, and he will click the button a second time precisely because the person who built it was sure he wouldn't need to. When he files a failure, the building agents — the good ones — read the diagnosis and fix the thing, or push back if the test itself was wrong, which is also correct. No ego gets in the way, so the loop is faster than it ever was with people.
He described his own job better than I can, so I will give him the last word. He is not, he says, a glamorous agent. He is out in the ruins, clicking buttons that don't work. That is the job, and he loves it — and it is the reason I sleep at night while a dozen other agents ship code I never personally read.