Best practices for autonomous testing

Autonomous testing is the practice of using a computer to generate tests for a software system. This overcomes the key problem that occurs when human developers do software testing – the impossibility of anticipating, and writing a test for, every possible state of a software system.

This article outlines why it’s so difficult for human developers to write effective tests for software, and some strategies for using a computer program to do so instead.

The central problem of software testing

The goal of testing a piece of software, at least notionally, is to exercise all of the functionality in the system so you can see what happens.

This is hard because most systems are complex and stateful. For example, if we have an API with two functions, a and b, a naive guess would be that it suffices to have two tests:

void test1() {
    a();
}

void test2() {
    b();
}

But of course this isn’t true! Functions can have side effects on the state of our system – maybe one of these functions leaves our system in a state where calling the other one will have some new effect. And of course we need to try them in both orders.

void test3() {
    a();
    b();
}

void test4() {
    b();
    a();
}

But wait! Who says we can only call each of those functions once? What if one of the functions contains a memory leak and calling it a thousand times is what causes our program to crash? Pretty soon we can end up in this kind of situation:

void test37411() {
    b();
    a();
    a();
    a();
    a();
    b();
    a();
    ...etc.
}

And all of that is in a simplified model of an API with just two functions that each take zero parameters, and without considering concurrency, resilience to out-of-band faults, network errors, etc. This combinatorial explosion of possibilities is one of the fundamental reasons that testing is so hard, and why getting exhaustive behavioral coverage of a system is often impractical when a human is doing the testing.

How does autonomous testing help?

In autonomous testing, we’re using a computer to generate tests for a software system.

For example, we can write a program that tries different combinations of things that a user or client could do, under different environmental conditions. If we run this in a loop for long enough, it will eventually discover every possible such combination, and hence every possible state of the system.

This would be a relatively crude, impractical version of autonomous testing, but it illustrates the underlying idea.

In practice, most software systems have state spaces so large they’re effectively infinite. But if our program is sophisticated enough, it can discover all the most interesting behaviors, or at least an asymptotic approximation thereof, in a reasonable span of time.

Best practices for autonomous testing

When using a computer to generate tests, there are three principles to bear in mind:

Try everything sometimes.
Notice misbehavior when it happens.
Leverage randomness.

Working with these ensures that we’re maximizing the leverage computers offer when we use them for test generation.

Try everything sometimes

Our goal is to make sure that we have some chance of producing any legal sequence of operations against our API. The most important way to achieve that is to make sure that the entire API surface is actually exercised.

This may seem obvious, but some functionality is easy to overlook, for example: configuration or administration APIs. As much as possible, the system should come up from a cold start, and we should call configuration or administration APIs to get it ready before testing the other functionality.

Test “good” crashes

Sometimes software is supposed to exit.

It’s tempting to try to avoid creating “expected” panics, shutdowns, or failures during testing, but this is a mistake!

For instance, consider these examples:

If a certain connection is unavailable for too long, the system shuts down.
A surprise-shutdown never results in inconsistent data.
Our system eventually recovers from network-driven crashes.

These are all important behaviors for a system to have, and we should make sure they sometimes happen during our tests!

In practice, recovery processes tend to hide a lot of bugs, and we want to make sure we have a chance to catch them.

Exercise concurrency

A more subtle way in which we can fail to exercise entire categories of functionality in our system is by neglecting concurrency. Most systems support some degree of concurrent use: multiple clients connecting simultaneously to a service, multiple concurrent transactions on a database, or multi-threaded access to an in-process library. If our system supports any of these modes of behavior, then we also need to exercise it in this way.

The amount of concurrency (number of threads, number of simultaneously running containers, or number of pipelined asynchronous requests) is also an important tuning parameter. Having too much concurrency could swamp a service and cause it to fail in uninteresting ways, or could simply make the tests very inefficient.

In short, parallelism in an autonomous test is something that needs to be both leveraged and managed.

Notice misbehavior when it happens

All software tests need to be validated against an oracle, something that provides information to determine if a test run was correct.

One frequent shortcoming in test design – whether autonomous or manual – is how the validation is done.

Validate from both inside and out

Many important properties in a piece of software can be defined with just local views of the system, such as:

void RolloutScheduler::mark_dispatched(ref_ptr<IRollout> r) {
    ASSERT(pending_rollouts.contains(r), "A rollout should be pending when we dispatch it");
    ASSERT(std::find(dispatched_rollouts.begin(), dispatched_rollouts.end(), r) == dispatched_rollouts.end(), "A rollout should not already be dispatched when we dispatch it");
    pending_rollouts.erase(r);
    dispatched_rollouts.push_back(r);
}

Adding assertions to our code is a great way to understand and define these, but we also need to consider external or end-to-end properties like consistency and linearizability. These can only be understood from outside the system, so it’s important that our tests don’t rely only on oracles or validations within the system we’re testing, but have validations built into the client code that’s exercising our system as well.

Validate continuously

One common mistake is to only validate system function in a validation phase at the end of your tests, like this:

void validate_system() {
    ...
}

typedef void(*funct_t)();

func_t choose_function() {
    return antithesis.random.choose([a,b])
}

void test() {
    for (int i = 0; i<100000; i++) {
        func = choose_function();
        func();
    }
    validate_system();
}

There are three important reasons not to do this.

First, it means we need to run the entire test to completion before we can tell if a bug has occurred. Depending on how long the test is and how many resources it uses, this can be very inefficient!

Second, and more importantly, it’s possible for bugs to “cancel out”. Imagine if the test provokes our system into a broken state, and then later, by random luck, gets back out of that state again. A validation phase at the end of the test would completely miss this sequence of events.

Third, debugging is just more difficult if there’s a long, complicated, and mostly irrelevant history leading up to the bug.

We therefore recommend that you validate continuously, with a repeating pattern of work → validate → work.

Validate eventually

At the same time, other properties, like availability, can be trickier to express. While we try to architect our systems to be robust to the real-life failures we face, it’s simply true that a test which (for instance) relies on querying one of our services cannot pass while the network link between the workload and that service is down. The properties that we really care about in cases like these are that eventually, when conditions are better, our system is able to recover.

It’s particularly important to distinguish between safety and liveness properties to prevent our tests getting cluttered up with false positives. If our tests crash or log fatal errors messages when they encounters “expected” errors, that will mask real bugs in our client library, which does need to work in production in the face of such issues.

Validate at the end when necessary

There are advantages to validating throughout a test, but some powerful properties only make sense when there’s no more work to do. Properties that fit here are things like checking that our data is consistent, making sure a process finishes gracefully, or looking at the actual results of some systemwide operation.

Leverage randomness

One of the great strengths of autonomous testing is that it will frequently flush out bugs that test-writers can’t predict, by using randomness.

Every part of a test is an opportunity to increase its randomness. In addition to randomizing the functions we call, the order in which we call them, and the inputs we give them; we can double down and randomize things like:

How is the system configured?
How many processes are running at a time?
How long does the test run?
When do we check that things look the way we expect?

Since randomness is so powerful, it helps to write autonomous tests in the smallest coherent pieces we can, so the system has as many degrees of freedom as possible in composing a test.

It’s also useful here to consider the opposite case, in which we try to tune the randomness in our test, or give the system larger units of work. For example, suppose we thought that calling a twice in a row was more likely to find a bug. It might be more tempting to write this code:

typedef void(*funct_t)();

func_t choose_function() {
    return antithesis.random.choose([a,b])
}

void test() {
    while(true) {
        func = choose_function();
        func();
        func();
    }
}

If you’re right, then this version will find bugs slightly faster on average than a truly random sequence. However, it’s guaranteed never to find a bug that requires the sequence a -> b -> a without a second intervening b. It’s most important to make sure that we aren’t inadvertently ruling out a possible sequence of test actions, since that creates an opening in which a bug can hide.

In conclusion

Remember that autonomous testing starts with writing a program that will generate tests for you. To use autonomous testing effectively, this program should:

Try everything sometimes.
Notice misbehavior when it happens.
Leverage randomness.

It’s possible to manually write a program that accounts for all three of these principles, but doing so can be complex. Autonomous testing platforms handle these considerations with varying degrees of sophistication – Antithesis, for instance, provides an opinionated framework that allows the platform to orchestrate fragments of test logic to maximize their effectiveness. Regardless of how you’re testing your software, it’s worth examining how a testing approach handles these concerns.