Luis Fernandez
Senior Consultant

From zero to first bug

Celebrating bugs

I’m pretty excited about all the attention coming our way since our launch. I’m on the team of consultants that helps new and prospective customers get started with our platform, and it’s fun to see how quickly our latest customers go from “What’s Antithesis?” to “This is cool!”

Big thank you to all of our early adopters for making that possible! In this post, I want to outline what the very earliest phases of working with us look like – starting with the point that you ask “What’s Antithesis?” and ending with when you find your first bug via our platform.

Expectations, assumptions, and common questions

  • Yes, we expect to find a real bug in your code during onboarding.
  • Yes, we have an extremely paranoid security architecture, since customers share their source code, issue trackers, and top engineers with us. We’ll write about this in more detail someday, but tl;dr, we run an entirely separate copy of our infrastructure for every customer and concentrate security responsibilities in a small and heavily-scrutinized TCB.
  • Yes, we can find bugs in code written in any language. We have better tooling and support for some languages than others.
  • Yes, we can only work with OCI-compatible container images that run on Linux, at the moment.

Five steps to your first bug

Step 0. Understand the requirements

Your software must be a containerized, distributed system, for now, because we want to be able to manipulate the timing and delivery of requests and responses between services. The requirement for a distributed system is not a fundamental limitation of our system, but rather reflects the fact that our current product focus is on testing for fault-tolerance, and it’s hard for a single-process system to be fault-tolerant.

Also it’s a great way to dip your toes into autonomous testing, because with a distributed system you can point us to your containerized services and we can disrupt the environment in which those services operate without knowing anything about the internals of each service. This means you get a lot of the benefits of things like fuzzing and property-based testing without having to write complex generative workloads and harnesses.

Step 1. Set up the hermetic simulation environment for testing

When you express interest in working with us, we start working independently to scout out any issues we might face getting your environment running on our platform. We will look for publicly available docker images and, if you are open-source, we will try to build your software from source.

Our team wants to do two things:

  1. find simple hooks in your code that allow us to tweak straight-line code execution without much or any effort from you,
  2. identify potential pitfalls in getting started.

Here are typical things we look for:

All of these items will influence the kinds of bugs we find, and how effectively we search the state space of your program. But the last two items are especially important during onboarding.

If you have an external dependency, it must be included in your test environment or it must be mocked. Remember that determinism depends on being able to control everything in the environment, a closed system. Every time you touch the outside world, determinism is broken, so we simply prohibit communication over the internet during testing.

Because we’ve encountered many different kinds of external dependencies, we have solutions available for a lot of them. The easiest way to check if you have such a dependency is to try to run your docker environment without the internet.

Deploying open source dependencies into your test is usually pretty straightforward, and we’ll often have solutions ready to go in those cases. For dependencies on external services, things can get a little trickier. We have off-the-shelf support in place for common AWS services, and slightly less off-the-shelf support for some other stuff. Your best bet is to ask about support for a given service. If we do already offer a productized mock, the shift to using our mock could be as easy as setting an environment variable. For AWS Lambda, we’ll ask you to follow some basic instructions from AWS.

Step 2. Help us verify your setup

If we’ve done the previous step right, this step is a non-step, too. At this point, we send you a log file representing the start-up sequence of your software as it runs on our platform.

Up until this point, most of the work involved in testing with us has been focused on setting up an environment that is compatible with our testing philosophy. The stuff covered in the preceding steps takes us about a week or two to get right. We tend to work behind the scenes on the details because you are probably not a build engineer and build pipelines are tricky to understand. We also don’t like distracting you with tedious orchestration issues so we try to troubleshoot those issues for you.

Step 3. Provide a workload and define the properties of your system

Once we have your distributed system up and running, we can shift our focus to the single most important piece of code impacting our ability to test anything at all meaningful – the workload. Although it’s not the perfect workload, we ask you to pick an existing test that has a predictable outcome and then we run that test, infinitely.

This is good enough because you probably haven’t exhaustively considered the conditions under which your “predictable outcome” is valid. Should a three-node cluster of your system still be available for a write with node A down, node B out of memory, and node C up? We won’t ask you to make your workload perfect in the first week. But we will ask you to make your expectations explicit and your test properties perfect.

What is a bug? Anytime your software, including the workload, behaves in a way that contradicts the documented, expected behavior, you have a bug. If you don’t define expectations, you can’t find bugs. If you can’t find bugs, your users will - they have expectations for the behavior of your software (even if you don’t).

Just like your users, we can’t entirely define the expectations for you, but we do try to tell you about some pretty universal ones - software shouldn’t crash, errors and exceptions should be handled, race conditions shouldn’t be possible, and more. You’ll need to add to this list of expectations, or test properties, over time.

This is the hardest part about working with us. If you don’t invest time defining expected behavior by defining test properties unique to your system, we will only find the bugs that we can define completely generically. That isn’t zero, but it’s far from the full value of our platform.

We will perfectly reproduce every bug we find for you – which is pretty cool, but not as cool as being able to find bugs in your design and then reproducing those unexpected and rare situations, perfectly.

Step 4. Triage the bugs

If you are triaging, that means we’ve already found a bug…

After the first bug

There are a few other things we show you while we have your attention: how to use our SDK, how to set up a GitHub action to run tests automatically, and how to read our reports – that stuff is pretty straightforward.

I have one more thing I can’t leave out because three years after joining Antithesis it still blows my mind and makes me smile whenever I see it – executing the exact same steps (with timestamps!) that lead to the one-in-a-hundred-thousand (or even one-in-a-million!) failure scenario and watching the failure play out exactly (with timestamps!) the same way it did the first time. And then it gets better: we rewind to exactly 0.1 seconds before the failure and inject a command that causes the failure to disappear. But that’s not enough – because you need to see exactly what causes the failure to come back. So we rewind a little more or a little less, and we run it forward a little more or a little less, and we inject the command or we don’t – thousands of times. Until you’ve identified and eliminated all of the bugs.

There’s a lot more you can do to get the most out of Antithesis. Once somebody’s become a customer and is seeing the value, my team switches to helping them climb this maturity model.

Reliability Testing Maturity Model
Crawl
Uncover basic reliability bugs
Stand
Uncover functional bugs and more reliability issues
Walk
Uncover bugs the day after introduction, increasing productivity and confidence
Run
Most bugs found right away, reducing the burden of production emergencies
Fly
Hugely increased developer velocity, high confidence changes and releases
System under test Bring your software as is without any modifications Log messages added to guide Antithesis search and surface bugs Software has coverage instrumentation Software uses Antithesis entropy for randomness Full use of the Antithesis SDK
Workload Bring your existing integration tests to exercise your software Antithesis decides which integration tests to run and controls concurrency Custom workload that enables Antithesis to explore the full breadth of your system’s functionality Use feedback to assess workload coverage and expand functionality
Test properties No configuration needed. Detect generic issues: exits, OOMs, process crashes Basic test properties that parse log messages as bug indicators Comprehensive set of functional properties Comprehensive set of reliability properties Engineers use the Antithesis SDK to write and update assertions as they work.
Fault injection Basic network faults Network faults + Thread pausing + Node termination + CPU modulation + Clock jitter + Node throttling Use of quiescence periods to check liveness, eventual availability, etc.
Test frequency Infrequent / Ad Hoc Weekly Nightly Continuous Integration Engineers validate changes with Antithesis before merging

I love seeing customers get all the way to the right on the chart, but in so many ways the first step is still the most fun. You get to see the look in the customer’s eyes when you show them actual for-real magic. If that sounds like a job you’d enjoy, you can apply for our team here.