Finding More Bugs#

Do these things first#

Many of the most important techniques for finding more bugs are covered elsewhere in this documentation:

  • Ensure you have a workload that exercises the most important parts of your system.

  • Use Sometimes Assertions to validate that you’re testing what you think you are.

  • Confirm that we find your most important bugs every time and consider increasing your usage if we don’t.

  • Enable instrumentation for your code so we can bring more bug-finding techniques to bear.

  • Monitor the efficiency of your testing, and take steps to optimize the performance of your system under Antithesis.

If you’ve done all of the above, and want to push your testing even further, there are some advanced techniques that will help you find every last bug.

Make rare events common#

A common problem in software testing is particular code routines that are very rarely exercised in the normal course of operation: certain kinds of error handling, failure recovery, retries, etc. Antithesis helps with this problem by injecting faults into your environment at an elevated rate, and you can help further by tuning your program for our test environment. But sometimes it’s worth going even further to try to provoke these rare events.

One technique that can accomplish this is “buggification”. Buggification is the technique of occasionally, when running under test, causing some endpoint or interface to deliberately throw errors that it is permitted to throw as part of its contract for the purpose of increasing test coverage.

For example, consider this code which implements an RPC interface that processes some data in a way that can occasionally fail:

void my_function(Data data) {
    try {
        int result = process_data(data);
        send(result);
    } catch (Exception e) {
        send("Unable to process data. Exception was: " + e);
    }
}

The caller of this function needs to be able to handle exceptions, but if the process_data function very rarely throws, this error-handling might not get tested frequently. This code could be buggified by adding the following to the beginning of the try block:

void my_function(Data data) {
    try {
        if (running_under_antithesis() && random() <= 0.01) {
            throw new Error("Artificial error introduced by buggification in my_function()");
        }
        int result = process_data(data);
        send(result);
    } catch (Exception e) {
        send("Unable to process data. Exception was: " + e);
    }
}

Now the function has a 1% chance of returning an error when tested in our environment, ensuring that the error handling of its callers is adequately tested.

Swarm testing#

Consider the following simple program:

static int counter = 0;

void increment() {
    counter = counter + 1;
    if (counter > 100) {
        crash_and_die();
    }
}

void decrement() {
    counter = counter - 1;
    if (counter < -100) {
        crash_and_die2();
    }
}

int get_counter() {
    return counter;
}

This program’s API has three functions: one that increments a counter, one that decrements it, and one that returns its value. Unfortunately, there is also a bug in the program! If the counter value ever becomes greater than 100, or less than -100, the program will crash. This may seem silly and contrived, but many real-world examples of bugs have a similar structure (consider a bug in a garbage-collection routine that only runs when some data structures get large).

A good randomized workload for this program would sometimes call the increment() function, sometimes call the decrement() function, and sometimes call the get_counter() function. But here’s the problem: if the workload calls each of those functions with equal probability, then it’s exponentially unlikely that the counter will ever get large enough or small enough to trigger the bug.

Swarm testing is an approach to solving this problem. The idea behind it is that you can sometimes find more bugs by turning off parts of your testing system, or parts of the software that you’re testing. In the example above, if we disabled the decrement() function, either in the workload or in the program itself, then we would be much more likely to find the bug that triggers when the counter grows to large.

The problem is obviously that by doing that, we would stop testing the decrement() function, and would never find any bugs in that code. So the swarm testing approach is only to do this sometimes. Perhaps each time the test harness initializes, it randomly chooses a subset of program features to enable and disable.

The same technique can be used to randomly reset configuration and tuning parameters on each test, so that you’re more likely to encounter combinations of configuration settings that produce errors.

Contact us if you need help with this.