Lucy Keer pic
Lucy Keer
Technical Writer

Did you get lucky or unlucky?

Product features

Some bugs are flaky, appearing only at specific times, or when the system is in a particular state, or in the presence of problems like network faults, or some annoying combination of all of the above. These problems are particularly frustrating to debug because they’re so hard to hit in the first place. They frequently make it through testing undetected, only to appear in production where debugging becomes far more expensive and disruptive.

Antithesis finds these tricky bugs by running your software many times, while injecting different combinations of faults, to try and flush them out. Once we’ve found one, our deterministic simulation environment can perfectly reproduce it, allowing you to replay it as many times as you want. However, the process of finding rare bugs in the first place takes time. For any given level of testing effort, some bugs will lurk at the edge of discoverability, appearing in some of your test runs but not others.

It’s difficult to figure out when these rare bugs are fixed. Say that you put in what you hope is a fix and then do another test run. For most bugs, seeing your Antithesis test switch from red to green in the new run is a good indication that your fix worked. For a rare bug, this isn’t enough. Did you really fix the problem, or were you just “lucky” enough to avoid it this time?1

To help you figure this out, Antithesis provides extra information for rare bugs in your triage report. You’ll see extra findability information, with a graph that estimates how many more runs you’ll need to do before you can be confident that the problem has been fixed. Here’s an example:

Example of the findability graph.

Each data point marks a test run. When we first encountered the bug (the leftmost point), the probability of the bug being present was 100%. It declined following a string of runs when we didn’t see the bug, only to jump back to 100% when we encountered it again. The most recent data point is circled in purple. Future runs are projected in lighter purple, assuming evenly spaced runs of the same length.

The y-axis shows what we’re calling p_survival, the probability that we could have tested for that long without seeing the bug, under the hypothesis that the bug still exists. If the bug does still exist, the probability of seeing it increases as we carry out more runs. Equivalently, the probability of not seeing the bug decreases. If we haven’t seen the bug, we say that we have “survived” so far.

The point with a green circle indicates the point where we reach 94% confidence that the bug is no longer there.

To understand how we create this graph, we’ll start simpler. As a first measure of findability, we’ll estimate a mean time-to-bug: the mean CPU time it takes to find a bug in a given test run. First, though, let’s make a quick detour to discuss what “time” can mean in Antithesis, as this can be a confusing topic.

Time vs. time vs. time

There are three types of time relevant to Antithesis: CPU time, wall-clock time, and vtime. The first two are in common use, whereas vtime is Antithesis-specific.

We’ve already mentioned CPU time. CPU time is the time that a central processing unit spends actively working, as opposed to sleeping or idling.

Wall-clock time is the actual time taken from the start of a task to the end. CPU time may be less than wall-clock time, for instance because the task needs to wait for other processes to finish. On the other hand, if the task is running in parallel on multiple CPUs, CPU time can be many times longer than wall-clock time. In Antithesis, we fuzz using every core of the machine, each of which is simulating its own parallel timeline.

Antithesis also has a concept of simulated time within each timeline. We call this vtime (short for “virtual time”), and in general it also runs differently to wall-clock time. This is the time that your program measures — for example, if it sleeps for one second, then one second of vtime will elapse. Our multiverse map is essentially a tree of all the timelines with the horizontal axis given in vtime:

Example of the multiverse map, showing branching timelines with bugs marked on some of them.

In this post, we’re mostly going to be concerned with CPU time, as this is what we use to get a measure of how much testing Antithesis needs to do to find a bug. However we’ll also need to take into account the branching structure of the multiverse, as we’ll see in the next section, so it’s useful to know about vtime too.

Estimating mean time-to-bug

The simplest possible thing we could do to estimate mean time-to-bug is to ignore the whole concept of branches and divide the total CPU time by the total number of bugs seen across all timelines.

Unfortunately, this won’t give us a very good estimate. To see why, let’s look at a simplified version of the multiverse map. A real multiverse map has a lot more branches, but here we’ll just draw a few:

A tree of branching histories, with vtime running horizontally and different branches arranged vertically. Some branches have one or more bugs, others don't.

You can see that there’s a single branch at first, and then there’s a “root moment” in vtime where the tree first splits into branches (five of them, in this simplified case). This is where the fault injector first gets turned on — from this point on, the branches will differ as different faults occur in different timelines. Each of these branches from the root moment is the start of a subtree which may branch further.

Once we’re inside a subtree, we can’t expect bugs in different branches of the subtree to be independent of each other. There might be some fault that “bakes in” the bug for all branches of the subtree (you can see this happen in the bottom subtree in the picture above). So we can only really count the first instance of the bug that we see in each subtree:2

The same tree of branching histories, but with each separate subtree highlighted. The first bug in each subtree is highlighted in yellow.

We’ll take the CPU time spent in each subtree up to the moment of the first bug (this will cover multiple branches if the bug isn’t found in the first branch). If there’s no bug, we’ll take the whole subtree. In our example above, this gives us five time intervals, three where we saw the bug and two where we didn’t. The following diagram shows the time intervals as solid bars, with dotted lines showing the rest of the time spent in each subtree.

Diagram showing the CPU time to find the bug for each subtree. For the subtrees where a bug was found, the solid bars show the CPU time up to first seeing the bug, and dotted bars show the time spent in the rest of the tree. For the subtrees where no bug was found, the bars show the total time spent in the subtree.

If we’d seen the bug in all the branches, we could get a mean time-to-bug by just adding up all the time intervals to get a total time T, and then dividing by the number of branches N (N=5 in this example). However, we only saw the bug in three branches. In the other two branches, we might have seen it if the run had gone on for longer. So we still need to account for these incomplete runs somehow in our time-to-bug estimates. To handle incomplete data, we can borrow from the field of survival analysis.

The classic example question in survival analysis is “how long will the average patient in this study survive?”, with incomplete information because (hopefully!) some patients survive past the end of the study. We instead have the question “how long until we see the bug?”, and we have incomplete information because some subtrees end before we find it. (The survival analysis framework assumes that you’d eventually see the bug.)

How we should account for the incomplete runs depends on what we expect the distribution of the CPU time to first bug in a subtree to look like. We’ll assume that first bugs occur independently, so that the time to first bug in one subtree doesn’t affect the time to first bug in another subtree, and at a constant average rate λ per unit time. This leads to an exponential distribution for the time until you see the bug, with a higher probability of seeing the bug soon, and a decaying probability of seeing it at future times. The speed of decay is set by the rate λ.

For example, if you saw 1 bug in a 200 CPU hour run, your total survival time T might be 195 hours. So λ would be 1/195, and plugging this in to the distribution you would get a 36% chance of not seeing it in one 200 hour run, and a 13% chance of not seeing it in two runs.

For an exponential distribution, we can make a surprisingly simple estimate of the mean time-to-bug. Say you have M branches where you saw the bug (in our example, M=3 ). Then we find that the most likely value for the constant rate in our exponential distribution is λ=MT.3 The mean time-to-bug is then the mean value of this distribution, which is 1λ for an exponential distribution:

mean time-to-bug=TM.

Notice that for common bugs that you find in every subtree, M is just N, the total number of subtrees, and this reduces to a simple average of the times. For rarer bugs that don’t turn up in every run, M will be smaller than N and so the mean time-to-bug estimate will be larger. This makes sense, because you’d expect the runs where you haven’t seen the bug yet to increase the mean time.

Back to the findability graph

Now we’re ready to go back to the original question of how long we have to wait to be confident we’ve fixed the bug.

Let’s look at the findability graph from the beginning of the post again:

The findability graph again.

To create this graph we want to calculate a probability p_survival for each run that the bug is still there, and when this drops low enough we’ll count it as fixed.

For a run in which we saw the bug, calculating p_survival is easy. We have p_survival = 1 because no time has yet passed since we saw the bug.

Now let’s take a run where you didn’t see the bug. This is where the calculation from the previous section comes in. We found the most likely value of the rate parameter λ for an exponential distribution: λ=MT, where M is the number of branches where you saw the bug, and T is the sum of all the time intervals.

As a first rough estimate, we can plug in this value of λ to the exponential distribution to find out the probability that you would have not seen the bug in a given length of CPU time. From there we can get a value for p_survival.

However, we actually do something a bit more complicated. The value of λ we calculated is the most likely value for the average rate. However, we may have a small sample size of bugs, so that we are uncertain about the exact value of λ: maybe it’s actually a bit higher or lower. So really we want to take a weighted average of exponential distributions for different values of λ, weighting each by how likely we think each value is.

Once we do that, we find that the curve is a mixture of exponential distributions that has a heavier tail than a single exponential distribution, meaning that it decays more slowly.4 This gives us more conservative estimates for when the bug is fixed than exponential decay, which is useful because rare bugs that reoccur in production are scary and we want to be sure that they’re definitely fixed.

Taking the example we used before of a total survival time T of 195 hours, with this heavier-tailed distribution you’d get a 49% chance of not seeing it in one 200 hour run, and a 33% chance of not seeing it in two runs. These are higher numbers than before, because the estimate is more conservative and gives more probability to seeing the bug later.

There’s one extra complication. Each point on the graph represents a different run, and you might have made changes to your system between runs, altering the frequencies of bugs that you see. To help account for this, we take a weighted average of data from multiple runs to calculate total values for T and M, and use these to calculate the decay rate for the curve.

We now have all the pieces we need to understand the findability graph. We’ve already discussed the projected decay curve. You can also see similar decay behaviour for the existing runs. The difference is that these are real runs rather than projected ones, so they are spaced out variably in time and the shape of the curve is less obvious. They also have a different decay rate as the calculated value of λ was different back then. But it’s the same general concept.

Deciding when to stop looking

In software testing, we’re always in the difficult position of trying to prove a negative. We can prove that a bug still exists by finding it, but if we don’t find it we still can’t say with absolute certainty that it’s gone. This is frustrating when you just want to know that your software really, truly, for sure has no bugs. But certainty is a high bar, and very little meets that standard of proof.

From a practical point of view, what we can do is quantify how much testing we need to do to have confidence that your flaky bug is gone, and then throw that amount of disruption at your system. In the course of each overnight Antithesis test run, your system experiences the equivalent of years or decades of production chaos. The findability graph is a guide to how much chaos is enough.


If you’ve found yourself interested in findability, you might also like to find out more about how Antithesis works. See our video Antithesis in less than 5 minutes.