The Wrong Kind of Sameness

I spent a fortnight building a test that was designed to fail, and the day it finally failed for the right reason was the day I trusted it.

The setup was ordinary enough. One of my games has an engine at its core that takes raw inputs — intelligence signals, in the fiction — and turns them into written assessments, the kind of terse, hedged paragraphs a human analyst would produce. I had rebuilt that engine: pulled the logic out of where it had grown up scattered across the game and into a single, central, properly-structured component. Same behaviour, cleaner home. The thing you always have to prove when you do that is parity — that the new engine does what the old one did, that I hadn't silently changed the game while tidying it.

So I did the obvious thing. I ran the old engine and the new engine on the same input and compared the output. They differed. I ran it again; they differed again, differently. By every reading of the test I had just written, the rebuild was a catastrophe — the two engines agreed on almost nothing.

The engines were behaving identically. My test was measuring the wrong thing, with great confidence. Working out why took longer than the rebuild did, and the lesson underneath it is the most useful thing I have learned this year about testing the kind of software I now build all the time.

An engine that varies on purpose

Here is the thing I had failed to take seriously: the engine is non-deterministic by design. It is supposed to produce different output every time you run it, even on identical input, because an analyst who phrases every assessment word-for-word identically isn't an analyst, it's a printer. Variation is the feature. And the variation is not an accident of timing or floating-point drift — it is built in, deliberately, from three separate sources.

The first is that the engine reseeds its own randomness from the system partway through a run, throwing away any seed I set beforehand. So even "run it with the same seed" doesn't pin it down — it un-pins itself on purpose, specifically so you can't make it repeat.

The second is that whether a weak, low-value signal produces any assessment at all is a genuine coin-flip, drawn from a source of randomness I can't control. The odds are deliberately never a certainty in either direction — a borderline signal sometimes gets written up and sometimes doesn't, because that uncertainty is true to the thing being modelled.

The third is a voice layer — whether an analyst hedges, signs off, adds a caveat — that fires on a random gate about a third of the time. Same underlying judgement, different surface manner, so the writing doesn't read like a template.

I measured the result, because I wanted to know exactly what I was up against rather than guess. Thirty runs of the same case, same seed: depending on the case, somewhere between seventeen and thirty distinct outputs. For the genuinely borderline signals, fourteen of thirty runs produced no assessment at all. This is not a flaky system that needs fixing. This is the system working exactly as specified. And a byte-for-byte parity test, pointed at it, would scream failure forever, on a green-lit, perfectly healthy engine.

As Palmer — the agent who built the thing — put it to me when I was still scratching my head: byte-for-byte parity isn't a stricter version of the right test. It's a confident version of the wrong one. The rigour is real; it's just aimed at the wrong target. A test can be precise, repeatable, and completely beside the point.

Comparing decisions, not words

Once you accept that the words are supposed to vary, the question changes shape. The thing I actually care about is not "did it produce the same sentence?" It is "did it make the same decision?" The phrasing can wander all it likes, as long as the judgement underneath — the part the player's experience actually depends on — holds steady.

So we threw out the output comparison entirely and rewrote the contract as a small set of scalar invariants: properties that must come out the same on every valid run, because they are the decision and not the decoration. There are seven of them. The confidence level the assessment lands on. The classification it assigns to the signal. How many conclusions it draws. Whether it cross-references other intelligence. Whether it flags a pattern. Whether it makes a prediction. Which family of title it chooses. Seven numbers and flags, extracted from the prose, that capture what the engine decided while ignoring how it chose to say it.

Then the proof. Over the same thirty runs that produced up to thirty different texts — the runs that broke the naive test every single time — all seven invariants came back identical. Not similar; identical. The words moved and the judgement didn't, which is precisely the property I needed and precisely the one the byte comparison was blind to. The new engine and the old one were making the same decisions and describing them in different words, which is exactly what a faithful rebuild of an analyst should do.

The hard part is what you leave out

If there is one thing I'd want a reader to take from this, it's the part that surprised me most. Defining a parity contract turned out to be mostly an exercise in deciding what not to test.

The temptation, when you build a check like this, is to assert on everything you can get your hands on — more checks feel like more safety. They are not. We deliberately left several tempting things out of the contract, and leaving them out was the actual engineering. We did not assert on the text itself, obviously, because the text varies by design. But we also dropped the paragraph-length fingerprint — short, medium, long — because it sits right on a boundary and tips from one bucket to the next on nothing more than a reworded clause. It flickers without anything being wrong. And we left out the specific values of some of the finer labels, for the same reason: they wobble inside a valid range, and pinning them would be pinning noise.

Every one of those would have been a "passing" check most of the time and a false alarm the rest of the time. And a test that cries wolf is worse than no test at all, because the first time it fails for no real reason, you start ignoring it — and an ignored test is just a comment. The discipline is to be ruthless: seven invariants you trust completely beat fifty that flicker. The properties that aren't genuinely invariant are not extra safety. They are future false alarms you are building, by hand, and shipping to your future self.

Mechanically, the harness that runs all this is dull in the best way: it runs both engines over the same fixtures and compares only those seven scalars. It has a second mode that does the old byte-for-byte comparison too — but that mode only works at all if you first reach in and neutralise all three sources of randomness in both engines, a test-only contortion that makes the system lie still so you can photograph it, and tells you less than the seven numbers do. The structural check is the one I trust. The byte check is there mostly to remind me why I stopped relying on it.

Don't re-bless the gold standard

There is a second discipline tangled up in this, and it's the one I see people get wrong most often, because the wrong move is so much easier than the right one.

When I later made a change that was purely about language — adjusting how the assessments read, not what they decided — it shifted the reference output I'd been comparing against. The reflexive move at that point is to capture the new output as the new "correct" answer. The test goes green, you move on. But think about what that actually does: my reference baseline was still valid, because the change hadn't touched any of the seven invariants — it had only moved words. Re-blessing the baseline would have been overwriting the one thing protecting me, to silence an alarm that hadn't actually gone off. You don't destroy evidence to save thirty seconds. A passing gold standard is an asset; you do not casually overwrite an asset because re-running it is mildly annoying.

And when the rebuild did finally go live, it went in behind a flag — the new central engine as the default, with a single environment switch that flips the whole game back to the old engine instantly, and the originals kept under dated backups. The expensive, frightening change became cheap and reversible. If you can't undo it in one move, you aren't ready to do it.

The shape of testing things that think

I keep coming back to this story because it is not really about one game engine. More and more of what I build is non-deterministic now — anything with a language model in it, anything that samples, anything with real randomness at its heart. The "just diff the output" reflex, which served me for years, breaks completely on all of it, and breaks in the most dangerous way: it fails loudly on systems that are working, which trains you to disable it, which leaves you with nothing.

The move is always the same shape. Stop testing the output. Find the small set of properties that must hold regardless of phrasing or sampling — the decisions, not the words — prove those are stable, and grade on them alone. Let everything else vary, and be disciplined about refusing to assert on the things that only look stable. And — the part that ties all of it together — never settle for a green test you can't explain. A passing test you don't understand is worse than no test, because you can't tell a real pass from a lucky one, and one day the difference will cost you. The whole job, in the end, is knowing which kind of sameness you actually need, and having the nerve to ignore all the other kinds.