The Winter of Integration Tests

by Cody on 2023-01-04 filed under eng

Christmas can be an exciting time of year for engineers. Not only is there the sheer rush of finding out what's in figgy pudding, but things slow down enough that you can get ambitious in the codebase. If there's 1 time of year to attempt a major refactoring, it's Christmas. Meetings are few, concentration is high, and the codebase isn't shifting beneath your feet. I tried a major cleanup of my own several years back and learned a valuable lesson: don't trust humans. Ok, that's a little severe, but let me talk you though it.

A Mile-High Tower of Jello

Once upon a time, I worked on a gigantic codebase for our company's infrastructure. It covered everything from deployments to the runtime world of logging, metrics, and alerting. Not only was it a sprawling codebase, it was a heavily self-referential codebase, where everything depended heavily on the layer just below it. Trying to build the whole thing was like trying to stabilize a mile-high tower of jello. In fact, the core layers of the codebase had changed a lot going into Christmas, but the higher-level abstractions didn't always catch up. That's not because people were lazy. (In fact, that team was full of brilliant people.) It's because it was hard to know if your change broke anything. It took hours to run the full suite of integration tests, some of which provisioned, deployed, and scaled big fleets of VMs across cloud providers. Frequently, you made an innocuous change to the identity layer, then found out the next day you broke some GCP auto-scaling something or other. As we approached late December, the full set of CI jobs had been broken for weeks. Did the system contained in the master branch actually work? No one knew.

I knew that the rest of the team would be gone around Christmas. I also knew that 1 devoted human with a week alone in the codebase could fix all those tests and get us back to a healthy build. It would be an annoying week, hunting down failures left and right. But once the pipeline was green, we could enforce a new standard where people verified the full set of integration tests passsed prior to merge. Ah, the myopia of youth.

The first part of the plan ... went according to plan. Left to my own devices, I fixed one test after another. By the end of the week, I had fixed at least 20 slow-running tests all over the codebase. And reader, dear reader, that pipeline was green by the time Santa came down the chimney. I sent a triumphant message into our team chat that the build was fixed. And now that it was fixed, let's keep it fixed. I had single-handedly changed the direction of our team, our company, and cloud computing.

That next Monday, I came into work riding high on my triumph. Was I greeted with a trophy, or at least a modest amount of confetti? I was not. I was greeted with more failing integration tests. People had been at work for like 45 minutes and everything was broken again.

Trust the Mechanisms, Not the Humans

What happened? Our general agreement was to follow your build along the CI pipeline and fix whatever broke. The problem with that is that it took hours for the full chain of builds to complete. One job would complete, publish a JAR, the next job would consume it, publish its own JAR on success, etc. But, due to the time delay, there wasn't actually a pre-merge hook that ensured these tests were passing for each job. No one had time for that! That meant that smart people in a hurry could, nay, WOULD break things as soon as they started merging, even if some person had spent the whole week of Christmas fixing all the integ tests.

In retrospect, I learned a lot here about the effectiveness of humans trying hard vs. actual mechanisms. You can't trust humans to achieve the right outcome by trying hard. You can't trust humans to achieve the right outcome by being smart. Even when we're smart and we're trying hard, things slip through the cracks. Real life intrudes, we step away for lunch and a new fire erupts that requires immediate attention, new people don't know the right mystical incantation to whisper, and things silently break in the background.

When you need something to go right every time, trust a mechanism, not a person. A mechanism is a reliable system-enforced check that functions when no one is looking. They are simple machines that don't get interrupted. For us, that proved to be an actual pre-commit hook that would only allow merge on 100% test success. After days of work that had been immediately flushed down the toilet, I demanded we enable this and the team went along.

Let me tell you, this mechanism was effective. It was so effective that it drove us all nuts. Remember, it took hours for all these tests to pass. As soon as that became a requirement and we actually had to sit to wait for the tests to pass, we had to face the hard realization that this sytem had grown out of control. We were all smart humans, but none of us was smart enough to predict the outcome of 20 chained builds that all depended on each other. Productivity went to 0 as we stared at Jenkins all afternoon.

Mechanism Truths

Here's a hard fact about mechanisms: they require investment. If you just got them for free, you'd already have them. It can be hard to prioritize mechanisms against customer-facing improvements, especially when they relate to more abstract topics like developer productivity or reliability. For truly existential risks that can be prevented via mechanism, someone must advocate for these. That someone can be you! If there's no other way to persuade those around you, use the post mortem to make this happen. Never waste a crisis.

Here's another hard fact of mechanisms: they tell you the truth, not what you want to hear. You watch them execute and you have to be objective about what they tell you in what's working and what's not. This mechanism told us something that none of us had been willing to say: we had lost control of the architecture and the jello tower had to come down. But that is a story for another time.