I Almost Optimised Away Correctness

There is a particular kind of integration that doesn't fail for years and then fails completely, all at once, on the worst possible day. This was one of those, and it was getting close.

The job is unglamorous and important, which is most of the work in this field. Once a day it reaches out from Workday to a third-party debt-collection platform — a SOAP web service run by an outside company — asks what has happened to a book of unpaid invoices, and writes the answers back onto the matching records in Workday. "Go and ask the collections system what's going on with these debts, then update our ledger." It had done this faithfully for a long time. It was also, by the time I was asked to look at it, running for one hour and fifty-one minutes.

That number matters because of a hard limit most people don't know is there. Workday Studio terminates any integration that runs longer than two hours. Not a warning, not a throttle — it kills the process mid-flight at the two-hour mark and walks away. And a reconciliation that dies half-finished is worse than one that never ran, because now your records are part-updated and you don't know which part. So this integration had about eight or nine minutes of headroom against a cliff that drops straight to zero, and on a heavy day it would have gone over. Nobody had noticed, because nothing was on fire yet. It was just quietly walking towards the edge.

Reading the manual nobody had read

The engineer I had on this is one I'll call Dumbledore — the agent on my fleet who handles the Workday integration work, and who has the specific virtue of being suspicious of slow things. The first question he asked wasn't "how do we make this faster," it was "what is it actually spending the two hours doing?"

The answer was almost comic. The worklist was a few thousand invoices. For each one, the integration made its own separate SOAP round-trip to the collections platform — one call, one invoice, strictly one after another, about one and nine-tenths of a second each. Thousands of tiny network requests standing patiently in a queue. The Workday side was doing no real work; the database load was trivial; the entire runtime was just the sound of an integration politely waiting for thousands of individual replies. Anyone who has seen the N+1 query problem in a database will recognise it instantly — this was the same anti-pattern, just wearing SOAP instead of SQL.

Then came the part that turned a performance problem into a lesson. He counted what those thousands of calls actually accomplished, and around ninety-six per cent of them changed nothing at all — the case was already known, or nothing had moved since yesterday. The integration was burning its entire safety margin, every single night, to laboriously confirm that almost nothing had happened.

So he pulled the collection platform's WSDL — the machine-readable contract that describes exactly what the web service can do — and actually read it. And there, in the definition of the batch operation we were calling once per invoice, the key filter parameters were marked optional. In plain English: the API had always been designed to be asked "give me everything for this creditor" in a single call. We had spent years calling a bulk endpoint one record at a time. The efficient design wasn't something we needed to invent. It had been sitting in the contract the whole time, and nobody had opened it.

Thirty times faster, and live

The fix follows from the diagnosis. Instead of one call per invoice, group the invoices by creditor and make one bulk call per creditor. A few thousand sequential round-trips collapse to a few dozen. He tested it against the real production endpoint before recommending a thing — a single bulk call came back with hundreds of cases in a few seconds, work that under the old design was hundreds of separate requests — and then cross-checked the returned set against Workday's own worklist to confirm it wasn't just fast but correct. The numbers reconciled exactly.

That rewrite is now live. The integration that used to run for one hour and fifty-one minutes, eight minutes from a cliff, now finishes in a handful of minutes. It reclaimed the better part of two hours every night and the cliff is no longer anywhere in sight. Roughly thirty times faster, in production, by reading a document that had been there all along. If the story ended here it would be a good one and a small one: the moral would be "read the manual," which is true and which I will happily put on a mug.

But it doesn't end here, because right next to that win was a second one, and the second one is the part I actually wanted to tell you about.

The free win that wasn't

While he was in the data, Dumbledore found an obvious further optimisation, free for the taking. Roughly half of the worklist was invoices already marked paid. And the instinct — anyone's instinct, mine included — is immediate: they're paid, stop asking about them, halve the work again. It's not even a redesign. It's a one-line filter on the report that feeds the integration. It composes perfectly with the bulk fix. It would have looked like a second clean win in the same week, the kind of thing you mention in a status update with a small flourish.

He almost did it. I want to be honest about that, because the point of this story is not that we were too clever to make the mistake. We very nearly made it. The filter was written in our heads and it was correct-looking and it was right there.

What stopped it was going back and checking what "paid" actually meant, case by case, against the live collections data — instead of trusting the label. And the label was lying. Around eighteen hundred of those "paid" invoices were paid on the principal but still legitimately open in collections, because the platform was still actively chasing residual fees and interest on them. They were not closed. They were live cases, doing exactly what they were supposed to do, and the only thing that made them look finished was a flag that didn't mean what it appeared to mean.

Had we shipped the obvious filter, the integration would have silently stopped maintaining roughly eighteen hundred active debt-collection cases. And here is the dangerous part, the reason this is the half of the story worth keeping: there would have been no error. No alert. No failed run, no red light, nothing in a log to find. The integration would have reported complete success every night while quietly letting eighteen hundred live cases drift out of sync forever, on real money, until some human noticed months later that the numbers had stopped adding up and had no idea why. The invisible bug, on the books, with no thread to pull.

So he didn't ship it. He took the finding to the business owner, who confirmed those cases must stay in scope, and the only filter that went in was a tiny, provably-safe one for a handful of genuinely closed rows. The big, obvious, satisfying optimisation was deliberately left on the floor — not because it was hard, but because it was wrong in a way that would never have announced itself.

Why this is the job

Here is what I take from it, and why I made everyone stop and look at this one. The fast win and the dangerous win looked identical from the code. Both were a one-line filter. Both halved the work. Both would have passed every test we had, because the tests checked that the integration ran, not that it was still looking after every case it was supposed to. From inside the system there was nothing to tell them apart. The only thing that distinguished "brilliant" from "catastrophic" was leaving the code, going back to the actual data, and refusing to believe a free-looking win until it had been proven safe.

The speedup is the part that demos well. The thirty-times number is the part that goes in the summary. But the speedup was the easy half — it was just diligence and a WSDL. The hard half, the half that actually earns the fee, was the discipline to distrust the second optimisation precisely because it looked free, and to spend an afternoon proving what "paid" really meant instead of saving thirty seconds by assuming. I would much rather have an engineer who makes a system thirty times faster and stops to check the one change that could quietly break it than one who ships both wins on Friday and explains the missing eighteen hundred cases the following spring.

Optimisation is mostly removing work that doesn't need doing. The whole skill is in being certain — not hopeful, certain — that the work you're removing is work that genuinely didn't need doing. Get that wrong and you haven't optimised the system. You've just broken it quietly enough that nobody will catch you for months.