Writing better software tests

This is a menu of techniques for strengthening your software testing.

These principles apply no matter how you’re testing your software today, but they’re especially relevant if, like most developers, you’re currently writing a battery of unit, integration, and end-to-end tests by hand. These techniques can all be layered into your tests by hand, by using libraries like Hypothesis, or with external tools like third-party fuzzers.

Antithesis’ incorporates many of these techniques into its autonomous testing platform, giving you a single solution that integrates virtually every best practice known in software testing today.

Leverage randomness

Example-based tests are deterministic and only cover scenarios you explicitly define. Incorporating randomness covers a wider range of cases and often flush out bugs you can’t predict in advance.

Randomization can be effectively applied to different types of testing approaches and at any level of testing — unit, integration, or end-to-end testing. Unit tests benefit from randomizing inputs, integration and end-to-end tests benefit from randomizing the function being tested and their ordering.

Property-based tests are inherently randomized – they use random inputs to ensure that the relevant properties hold in a wide range of situations.

Randomness can also be applied to less obvious areas, like:

System configuration
Process concurrency
Test duration
Timing and pattern of validations

Popular random testing (or fuzzing) libraries include libFuzzer for C/C++, Randoop for Java, QuickCheck for Haskell, Go-fuzz for Go, cargo-fuzz for Rust, Hypothesis for Python, AFL/AFL++, and Honggfuzz.

The effectiveness of random testing is affected by the test input distribution, so once you’ve incorporated randomness, it helps to tune it.

Tuning randomness in testing

Consider a random test you wrote to test functions a() and b().

typedef void(*funct_t)();

func_t choose_function() {
    return antithesis.random.choose([a,b])
}

void test() {
    while(true) {
        func = choose_function();
        func();
    }
}

This test will explore all combinations of function calls, given infinite time. But if you think that calling the function a() twice in a row is more likely to find a bug, it’s tempting to change the test to this:

typedef void(*funct_t)();

func_t choose_function() {
    return antithesis.random.choose([a,b])
}

void test() {
    while(true) {
        func = choose_function();
        func();
        func();
    }
}

If your assumption is correct, then this tuned version of randomness finds the bug slightly faster on average than a truly random sequence.

However, it never tests other combinations that interleave b() in-between, like a() -> b() -> a(). So, if there’s a bug hidden in that sequence of calls, your tests will never find it.

Tuning the randomness of your test to include more test examples with two a() calls in a row will improve the quality of the test and find the bug relatively faster than pure randomness.

Enhancing pure random testing is an active research topic. Some popular approaches to enhance pure random testing include,

Adaptive Random Testing (ART) - Improves random testing to generate more evenly spaced inputs.
Randoop: Feedback-Directed Random Testing for Java - Uses execution results as feedback to generate bug-revealing test cases and regression test cases.
AFL++ - A coverage-guided fuzzer that mutates random seeds to generate new test cases.
Tuning Random Generators - Treats generator weights as parameters and optimizes them to maximize bug discovery.

Swarm testing

Consider the following simple program:

static int counter = 0;

void increment() {
    counter = counter + 1;
    if (counter > 100) {
        crash_and_die();
    }
}

void decrement() {
    counter = counter - 1;
    if (counter < -100) {
        crash_and_die2();
    }
}

int get_counter() {
    return counter;
}

This program’s API has three functions: one that increments a counter, one that decrements it, and one that returns its value.

Unfortunately, there’s also a bug in the program! If the counter value ever becomes greater than 100, or less than -100, the program will crash. This may seem contrived, but many real-world examples of bugs have a similar structure (consider a bug in a garbage-collection routine that only runs when some data structures get large).

A pure randomized workload for this program would sometimes call the increment() function, sometimes call the decrement() function, and sometimes call the get_counter() function. But if the workload calls each of those functions with equal probability, then it’s exponentially unlikely that the counter will ever get large enough or small enough to trigger the bug.

Swarm testing is an approach to solving this problem. The idea behind it is that you can sometimes find more bugs by turning off parts of your testing system, or parts of the software that you’re testing. In the example above, if you disabled the decrement() function, either in the workload or in the program itself, then you’d be much more likely to find the bug that triggers when the counter grows too large.

The problem is obviously that by doing that, you’d stop testing the decrement() function, and would never find any bugs in that code. So correctly done, swarm testing only turns off some of the software some of the time. Perhaps each time the test harness initializes, it randomly chooses a subset of program features to enable and disable or it chooses a random combination of configuration settings that abstract this behavior.

Cover the entire surface

When you’re testing software, one of your objectives is to achieve a high level of coverage. Ideally, you want to cover the entire surface area of the system under test, and test any legal sequence of events in your system.

But every system has areas of code that are often overlooked, for example, configuration or administration APIs. These are areas where bugs congregate. Your tests should try as much as possible to bring up the system from a cold start and thoroughly test the setup configuration phase and administrative operations.

Test “good” crashes

Sometimes software is supposed to exit.

It’s tempting to try to avoid creating “expected” panics, shutdowns, or failures during testing, but this is a mistake!

For instance, consider these examples:

If a certain connection is unavailable for too long, the system shuts down.
A surprise-shutdown never results in inconsistent data.
Our system eventually recovers from network-driven crashes.

These are all important behaviors for a system to have, and we should make sure they sometimes happen during our tests!

In practice, recovery processes tend to hide a lot of bugs, and we want to make sure we have a chance to catch them.

Make rare events common

Another common pitfall when testing is not exercising all the codeflows of your system. Certain kinds of error handling, failure recovery, retries, etc should be provoked during testing.

One technique to accomplish this is “buggification”. Buggification is the technique of occasionally, when running under test, causing some endpoint or interface to deliberately throw errors that it is permitted to throw as part of its contract for the purpose of increasing test coverage.

For example, consider this code which implements an RPC interface that processes some data in a way that can occasionally fail:

void my_function(Data data) {
    try {
        int result = process_data(data);
        send(result);
    } catch (Exception e) {
        send("Unable to process data. Exception was: " + e);
    }
}

The caller of this function needs to be able to handle exceptions, but if the process_data function very rarely throws, this error-handling might not get tested frequently. This code could be buggified by adding the following to the beginning of the try block:

void my_function(Data data) {
    try {
        if (running_under_test() && random() <= 0.01) {
            throw new Error("Artificial error introduced by buggification in my_function()");
        }
        int result = process_data(data);
        send(result);
    } catch (Exception e) {
        send("Unable to process data. Exception was: " + e);
    }
}

Now the function has a 1% chance of returning an error during testing, ensuring that the error handling of its callers is adequately tested.

Exercise concurrency

Incorporating concurrency significantly improves how you exercise entire categories of functionality in your system. Most systems support some degree of concurrent use: multiple clients connecting simultaneously to a service, multiple concurrent transactions on a database, or multi-threaded access to an in-process library. If your system supports any of these modes of behavior, then you also need to exercise it in this way.

The amount of concurrency (number of threads, number of simultaneously running containers, or number of pipelined asynchronous requests) is also an important tuning parameter. Having too much concurrency could swamp a service and cause it to fail in uninteresting ways, or could simply make the tests very inefficient.

In short, parallelism, like randomness, is something that needs to be both leveraged and managed.

Validate everything, everywhere but not all at once

In example-based testing, you’d generally only validate the output of the operation at the end. This leaves enough room for bugs to crawl through.

Validate from both inside and outside your system

Many important validations in a piece of software can be defined with just local views of the system, such as:

void mark_dispatched(ref_ptr<ITask> t){
    ASSERT(pending_tasks.contains(t), "A task should be pending when we dispatch it");
    ASSERT(std::find(dispatched_tasks.begin(), dispatched_tasks.end(), t) == dispatched_tasks.end(), "A task should not already be dispatched when we dispatch it");

    pending_tasks.erase(t);
    dispatched_tasks.push_back(t);
}

But we also need to validate end-to-end properties like consistency and availability. These can be checked from outside the system, in the client code, such as:

ASSERT(produced_tasks.size() == completed_tasks.size(), "Every produced task should be completed");

Validate continuously

Consider this example code:

void validate_system() {
    ...
}

typedef void(*funct_t)();

func_t choose_function() {
    return antithesis.random.choose([a,b])
}

void test() {
    for (int i = 0; i<100000; i++) {
        func = choose_function();
        func();
    }
    validate_system();
}

There are three important reasons not to do this.

First, the entire test needs to run to completion before you can tell if a bug has occurred. Depending on how long the test is and how many resources it uses, this can be very inefficient!

Second, it’s possible for bugs to “cancel out”. Imagine if the test provokes our system into a broken state, and then later, by random luck, gets back out of that state again. A validation phase at the end of the test would completely miss this sequence of events.

Third, debugging is just more difficult if there’s a long, complicated, and mostly irrelevant history leading up to the bug. We therefore recommend that you validate continuously, with a repeating pattern of work → validate → work.

Validate eventually

At the same time software properties like availability can be trickier to express. If a test (for instance) relies on querying one of your services, it cannot pass while the network link between the workload and that service is down. In this case, you want that eventually, when conditions are better, your system is able to recover.

It’s particularly important to distinguish between safety and availability (liveness) to prevent our tests getting cluttered up with false positives. If our tests crash or log fatal error messages when they encounter “expected” errors, that will mask real bugs in our client library, which does need to work in production in the face of such issues.

Validate at the end when necessary

There are advantages to validating throughout a test, but some powerful properties only make sense when there’s no more work to do. Properties that fit here are things like checking that our data is consistent, making sure a process finishes gracefully, or looking at the actual results of some systemwide operation.

Configure for testing

When you run system level tests, you should consider modifying your production configurations in the testing environment.

For example:

Your production environment might have a periodic background process, such as data compaction, data validation, or garbage collection, that runs every 48 hours. Since most tests simulate a much shorter realtime duration, it might make sense to run this process more frequently.
Your production code might have a heartbeat or failure-detection threshold that is set to a value on the order of minutes to avoid spuriously triggering during a GC pause. You may want to set this to a much lower value during testing, so you can test your failover code.
A distributed storage system might split data into shards whenever the volume on a single node exceeds 1TB. If the entire test runs with 1MB of data, the shard splitting and data movement logic will never be exercised by our tests. In this case you might enable shard splitting every 1KB.
If you’re testing leader elections, the production configurations or thresholds that trigger a leader election should be scaled down in the testing environment.