What is Antithesis? How we’re different Problems we solve Security approach Demo Fintech Customer stories Working with Antithesis Contact us Backstory Leadership Careers Brand
Yusuf Van Gieson
Senior Software Engineer

Finding a way to make on-call better

Snouty finding a bug in a forest

One of the ugly truths about software testing is that we don’t test our code nearly enough, because of some very basic problems:

  1. Most test setups don’t really work.
  2. Because most teams don’t have time to write tests that work.
  3. Even if we could write tests that work, we don’t have time to triage and act on the results.

Sure enough, at some point in every proof-of-concept, a prospect will run their software through Antithesis for the first time, and that almost always turns up a bunch of bugs, because Antithesis has just handed them a whole bunch of tests that actually work, without the customer having to write them. At this point, the third obstacle rears its ugly head.

Sometimes the prospect starts to panic, because they’re already drowning in bugs, and there’s no way they could possibly triage and fix all these in the next month/quarter/year. So they ask: is finding all these new bugs really helping?

The answer is yes, but not for the reason you think.

Knowing where all your bugs are is super useful, because it enables you to fix the new things first, and fixing the new things first is just about the best thing you can do for software quality.

Today I want to expand on this idea, and introduce a new UI we just released – Findings – that embeds this philosophy in the Antithesis Platform.

Why fix the new things first?

Most software teams fix the bad things first. They rank bugs in order of severity to prioritize their limited engineering resources. This is perfectly sane, and perfectly reasonable, and there’s one gigantic argument in favor of doing things this way, which is that any software team can do it.

But there’s also one gigantic drawback: fixing bugs in order of severity won’t help you fix bugs faster.

Consider:

  • Old bugs are hard to understand and fix.
  • Code is complex, so you’re never sure what might be growing in the corners you ignore.
  • You’re forever uncertain about when any bug you discover was actually introduced, which compounds the first problem, because it makes bisection really hard.

On the other hand, new bugs – if you can find them quickly and reliably – are the fastest ones to fix, because:

  • If you find a bug right after it’s introduced, you know exactly which code is responsible for it, so diagnosing and fixing the bug is much cheaper than if you need to investigate it. In the extreme case, the fix is virtually costless: CTRL+Z.
  • If you fix bugs right away, you’re conserving an immeasurable amount of mental space, because you don’t have to worry about what they might turn into.
  • You don’t need to worry about figuring out how severe a new issue might be – you just fix it.

If you aren’t introducing new bugs, you have a stable foundation for development – whether that means building new features or burning down a huge backlog of older bugs. If you’re stuck in a software quality death spiral, drowning in so many bugs you can’t make any meaningful changes – this is the only reliable way out, because it’s the only approach that actually enables you to fix bugs faster than they’re being introduced.

Why not fix the new things first?

This isn’t exactly a groundbreaking philosophy. “No new tech debt” is an idea that’s been around for years, and various software engineering ideologies, like Extreme Programming, espouse similar ideas. The problem is that this sounds great, but is really hard to execute.

It’s really hard – technically, organizationally, and culturally – to implement a testing system comprehensive enough to give you any reasonable degree of confidence that you know where all your bugs are. Without such a testing setup, it’s nearly impossible to draw that line in the sand. And if your testing isn’t this thorough, you can’t be sure you’re not letting any new bugs through.

As a reference for what it takes to build this kind of testing in a large codebase, our Engineering Lead, Youhana, was on a team at Microsoft that implemented this approach. It took tens of thousands of unit tests, hundreds of integration tests, and dozens of cloud integration tests, and an entire team of 800 engineers committed to stopping their deployment pipeline every time a test broke. Very few engineering organizations can support that level of resourcing and commitment.

Fortunately, we’ve already built the testing system you need – it’s called Antithesis!

Findings

Findings tell you when your system’s behavior has changed. They condense an extensive set of test results – the equivalent of tens of thousands of hours of human test creation – into something you can triage in seconds.

When we were building Findings, I thought a lot about the days I spent on-call at Google. I’d come on rotation, and the previous on-call and I would do a handoff, and the handoff was always a mess. There would be a pile of issues, some of them would be on fire, some of them would be resolved but not marked as such, there would be a pile of logs and a test report with a bunch of red, and I’d have to try and put all these things together, just in order to see which red flags I should tackle first. In the time it took to actually dig into any one issue enough to evaluate and bisect it, another half dozen would come in, and I’d end every shift with a long queue of things I hadn’t looked at. I’d hand this mess to the next on-call, brief them as well as I could, and the circle of life would start again.

This isn’t a knock on my colleagues (who were smart, committed engineers) – it was just how it went. Put another way, it’s a structural problem, not an individual one. The way we were testing and triaging made it impossible to tell which issues were new – and therefore easy to fix – and which were old and thorny and would bog us down.

If you’ve ever been an on-call engineer, this story’s probably familiar. So here’s what an on-call handoff looks like with Findings.

The initial triage that used to take me half my shift has been done for me. I know at a glance what’s new, what’s fixed, and what’s flaky.

Most importantly, I have a timeline, instead of having to piece one together myself. The timeline tells me exactly when each failure was introduced, so bisection is easy. You can see how this would make life as an on-call much easier – and with each issue matched to a small set of recent commits, the code owners should have a much easier time fixing things too.

If you keep this up for a few weeks, you should be able to get to a state where your list of Findings starts looking more like this. So now you’re fixing issues as they come in, those new ones will be taken care of shortly, and your team can focus on either new feature work, or cleaning up the thorny bugs you had already. I know I’m making this sound almost trivial, but sometimes the answer really is that simple, at least conceptually. This is definitely one area where it’s the implementation that’s daunting.

So I view Findings less as a feature than as a stake in the ground. Combined with Antithesis’ autonomous testing, they make it not just feasible, but so easy to fix the new things first, that there’s no conceivable reason not to.

What could this do for your team? Or your relationship with this enormous, complex system you spend your day building? Give us a call. Let’s find out.