From zero to first bug

May 9, 2024

Celebrating bugs

I’m pretty excited about all the attention coming our way since our launch. I’m on the team of consultants that helps new and prospective customers get started with our platform, and it’s fun to see how quickly our latest customers go from “What’s Antithesis?” to “This is cool!”

Big thank you to all of our early adopters for making that possible! In this post, I want to outline what the very earliest phases of working with us look like – starting with the point that you ask “What’s Antithesis?” and ending with when you find your first bug via our platform.

Expectations, assumptions, and common questions

Yes, we expect to find a real bug in your code during onboarding.
Yes, we have an extremely paranoid security architecture, since customers share their source code, issue trackers, and top engineers with us. We’ll write about this in more detail someday, but tl;dr, we run an entirely separate copy of our infrastructure for every customer and concentrate security responsibilities in a small and heavily-scrutinized TCB.
Yes, we can find bugs in code written in any language. We have better tooling and support for some languages than others.
Yes, we can only work with OCI-compatible container images that run on Linux, at the moment.

Five steps to your first bug

Step 0. Understand the requirements

Step 1: Set up the hermetic simulation environment for testing

Step 2: Help us verify that your software runs the way you expect it to run

Step 3: Provide a workload and tell us the properties of your system

Step 4: Triage the bugs

Step 0. Understand the requirements

Your software must be a containerized, distributed system, for now, because we want to be able to manipulate the timing and delivery of requests and responses between services. The requirement for a distributed system is not a fundamental limitation of our system, but rather reflects the fact that our current product focus is on testing for fault-tolerance, and it’s hard for a single-process system to be fault-tolerant.

Also it’s a great way to dip your toes into autonomous testing, because with a distributed system you can point us to your containerized services and we can disrupt the environment in which those services operate without knowing anything about the internals of each service. This means you get a lot of the benefits of things like fuzzing and property-based testing without having to write complex generative workloads and harnesses.

Step 1. Set up the hermetic simulation environment for testing

When you express interest in working with us, we start working independently to scout out any issues we might face getting your environment running on our platform. We will look for publicly available docker images and, if you are open-source, we will try to build your software from source.

Our team wants to do two things:

find simple hooks in your code that allow us to tweak straight-line code execution without much or any effort from you,
identify potential pitfalls in getting started.

Here are typical things we look for:

Can we control the number of threads?
Can we inject our own source of entropy?
Is adding instrumentation to the build going to be super easy?
Can we control logging (level and format)?
Do the build or setup phases have any external dependencies?
What kinds of off-the-shelf tests do we have access to?

All of these items will influence the kinds of bugs we find, and how effectively we search the state space of your program. But the last two items are especially important during onboarding.

If you have an external dependency, it must be included in your test environment or it must be mocked. Remember that determinism depends on being able to control everything in the environment, a closed system. Every time you touch the outside world, determinism is broken, so we simply prohibit communication over the internet during testing.

Because we’ve encountered many different kinds of external dependencies, we have solutions available for a lot of them. The easiest way to check if you have such a dependency is to try to run your docker environment without the internet.

Deploying open source dependencies into your test is usually pretty straightforward, and we’ll often have solutions ready to go in those cases. For dependencies on external services, things can get a little trickier. We have off-the-shelf support in place for common AWS services, and slightly less off-the-shelf support for some other stuff. Your best bet is to ask about support for a given service. If we do already offer a productized mock, the shift to using our mock could be as easy as setting an environment variable. For AWS Lambda, we’ll ask you to follow some basic instructions from AWS.

Step 2. Help us verify your setup

If we’ve done the previous step right, this step is a non-step, too. At this point, we send you a log file representing the start-up sequence of your software as it runs on our platform.

Up until this point, most of the work involved in testing with us has been focused on setting up an environment that is compatible with our testing philosophy. The stuff covered in the preceding steps takes us about a week or two to get right. We tend to work behind the scenes on the details because you are probably not a build engineer and build pipelines are tricky to understand. We also don’t like distracting you with tedious orchestration issues so we try to troubleshoot those issues for you.

Step 3. Provide a workload and define the properties of your system

Once we have your distributed system up and running, we can shift our focus to the single most important piece of code impacting our ability to test anything at all meaningful – the workload. Although it’s not the perfect test template, we ask you to pick an existing test that has a predictable outcome and then we run that test, infinitely.

This is good enough because you probably haven’t exhaustively considered the conditions under which your “predictable outcome” is valid. Should a three-node cluster of your system still be available for a write with node A down, node B out of memory, and node C up? We won’t ask you to make your workload perfect in the first week. But we will ask you to make your expectations explicit and your test properties perfect.

What is a bug? Anytime your software, including the workload, behaves in a way that contradicts the documented, expected behavior, you have a bug. If you don’t define expectations, you can’t find bugs. If you can’t find bugs, your users will - they have expectations for the behavior of your software (even if you don’t).

Just like your users, we can’t entirely define the expectations for you, but we do try to tell you about some pretty universal ones - software shouldn’t crash, errors and exceptions should be handled, race conditions shouldn’t be possible, and more. You’ll need to add to this list of expectations, or test properties, over time.

This is the hardest part about working with us. If you don’t invest time defining expected behavior by defining test properties unique to your system, we will only find the bugs that we can define completely generically. That isn’t zero, but it’s far from the full value of our platform.

We will perfectly reproduce every bug we find for you – which is pretty cool, but not as cool as being able to find bugs in your design and then reproducing those unexpected and rare situations, perfectly.

Step 4. Triage the bugs

If you are triaging, that means we’ve already found a bug…

After the first bug

There are a few other things we show you while we have your attention: how to use our SDK, how to set up a GitHub action to run tests automatically, and how to read our reports – that stuff is pretty straightforward.

I have one more thing I can’t leave out because three years after joining Antithesis it still blows my mind and makes me smile whenever I see it – executing the exact same steps (with timestamps!) that lead to the one-in-a-hundred-thousand (or even one-in-a-million!) failure scenario and watching the failure play out exactly (with timestamps!) the same way it did the first time. And then it gets better: we rewind to exactly 0.1 seconds before the failure and inject a command that causes the failure to disappear. But that’s not enough – because you need to see exactly what causes the failure to come back. So we rewind a little more or a little less, and we run it forward a little more or a little less, and we inject the command or we don’t – thousands of times. Until you’ve identified and eliminated all of the bugs.

There’s a lot more you can do to get the most out of Antithesis. Once somebody’s become a customer and is seeing the value, my team switches to helping them climb this maturity model.

Reliability Testing Maturity Model
	Crawl Uncover basic reliability bugs	Stand Uncover functional bugs and more reliability issues	Walk Uncover bugs the day after introduction, increasing productivity and confidence	Run Most bugs found right away, reducing the burden of production emergencies	Fly Hugely increased developer velocity, high confidence changes and releases
System under test	Bring your software as is without any modifications	Log messages added to guide Antithesis search and surface bugs	Software has coverage instrumentation	Software uses Antithesis entropy for randomness	Full use of the Antithesis SDK
Workload	Bring your existing integration tests to exercise your software	Antithesis decides which integration tests to run and controls concurrency	Custom workload that enables Antithesis to explore the full breadth of your system’s functionality		Use feedback to assess workload coverage and expand functionality
Test properties	No configuration needed. Detect generic issues: exits, OOMs, process crashes	Basic test properties that parse log messages as bug indicators	Comprehensive set of functional properties	Comprehensive set of reliability properties	Engineers use the Antithesis SDK to write and update assertions as they work.
Fault injection	Basic network faults	Network faults + Thread pausing + Node termination	+ CPU modulation + Clock jitter + Node throttling	Use of quiescence periods to check liveness, eventual availability, etc.
Test frequency	Infrequent / Ad Hoc	Weekly	Nightly	Continuous Integration	Engineers validate changes with Antithesis before merging

I love seeing customers get all the way to the right on the chart, but in so many ways the first step is still the most fun. You get to see the look in the customer’s eyes when you show them actual for-real magic. If that sounds like a job you’d enjoy, you can apply for our team here.

You made it to the end! Grab some stickers

Place them anywhere and watch the compliments compile.

Get free stickers

Reliability Testing Maturity Model

You made it to the end! Grab some stickers

You made it to the end! Grab some stickers