Principles of Test Composition

This page explains what makes a test template effective. The Test Composer will take care of most of this for you, but if you want to try writing a test template from scratch, or just to understand what the Test Composer is doing, you might want to read this.

Strengthening your test template tends to be a high leverage way of improving your testing – it yields significant improvement per unit of effort spent. Many advanced users have well developed test templates of their own, which simply run in Test Composer as singleton_driver commands.

Introduction

The goal of testing a piece of software, at least notionally, is to exercise all of the functionality in the system, but that’s very hard because most systems are complex and stateful. For example, if we have an API with two functions, a and b, a naive guess would be that it suffices to have two tests:

void test1() {
    a();
}

void test2() {
    b();
}

But of course this isn’t true! Functions can have side effects on the state of our system – maybe one of these functions leaves our system in a state where calling the other one will have some new effect. And of course we need to try them in both orders.

void test3() {
    a();
    b();
}

void test4() {
    b();
    a();
}

But wait! Who says we can only call each of those functions once? What if one of the functions contains a memory leak and calling it a thousand times is what causes our program to crash? Pretty soon we can end up in this kind of situation:

void test37411() {
    b();
    a();
    a();
    a();
    a();
    b();
    a();
    ...etc.
}

And all of that is in a simplified model of an API with just two functions that each take zero parameters, and without considering concurrency, resilience to out-of-band faults, network errors, etc. This combinatorial explosion of possibilities is one of the fundamental reasons that testing is so hard, and why getting exhaustive behavioral coverage of a system is often impractical.

The Antithesis approach is the opposite of trying to exhaustively enumerate all the possible test cases like the above examples. Instead, we’re going to write a program which runs in a loop, which, if allowed to run long enough, would eventually try every possible combination of things that a user or client could do.

Of course, we can never actually do that, because the space of possibilities is effectively infinite. But the goal is that running this program for a very long time will cause it to try all of the most interesting behavior, or at least an asymptotic approximation thereof.

The Antithesis Platform speeds up that process by using coverage instrumentation, sometimes assertions, and other forms of feedback to guide which paths are taken, but the nature of the program that’s doing the trying – what we call a test template – also matters.

There are three principles to bear in mind:

  1. Try everything sometimes
  2. Notice misbehavior when it happens
  3. Leverage autonomy

Try everything sometimes

Our goal is to make sure that we have some chance of producing any legal sequence of operations against our API. The most important way to achieve that is to make sure that the entire API surface is actually exercised in the test template.

This may seem obvious, but some functionality is easy to overlook, for example: configuration or administration APIs. As much as possible, our test template should bring the system up from a cold start, using configuration or administration APIs to get it ready before testing the other functionality.

Don’t forget “good” crashes

Sometimes our software is supposed to exit.

It’s tempting to treat “expected” panics, shutdowns, or failures as false positives and try to avoid them, but this is a mistake!

Properties like “if a certain connection is unavailable for too long, the system shuts down,” “a surprise-shutdown never results in inconsistent data,” or even “our system eventually recovers from network-driven crashes,” are just as vital as properties about a healthy system, and it’s just as important that they happen sometimes in our tests.

In practice, recovery processes tend to hide a lot of bugs, and we want to make sure we have a chance to catch them.

Exercise concurrency

A more subtle way in which we can fail to exercise entire categories of functionality in our system is by neglecting concurrency. Most systems support some degree of concurrent use: multiple clients connecting simultaneously to a service, multiple concurrent transactions on a database, or multi-threaded access to an in-process library. If our system supports any of these modes of behavior, then we also need to exercise it in this way.

The amount of concurrency (number of threads, number of simultaneously running containers, or number of pipelined asynchronous requests) is also an important tuning parameter. Having too much concurrency could swamp a service and cause it to fail in uninteresting ways, or could simply make the tests very inefficient.

The test composer can take care of managing parallelism for you, and provides tools for managing the amount of concurrency in the system, but if you’re writing a test template from scratch, you may want to expose the degree of concurrency as a parameter that you can vary.

Notice misbehavior when it happens

Many of our most important test properties stem from assertions in our code, but they tend to have very local views of the system. Since it stands outside the rest of the system, the test template has a great view of external or end-to-end properties, and ought to take advantage of that.

Validate continuously

One common mistake is to only validate system function in a validation phase at the end of your tests, like this:

void validate_system() {
    ...
}

void choose_function() {
    return antithesis.random.choose([a,b]);
}

void test() {
    for (int i = 0; i<100000; i++) {
        func = choose_function();
        func();
    }
    validate_system();
}

There are three important reasons not to do this.

First, it means we need to run the entire test to completion before we can tell if a bug has occurred. Depending on how long the test is and how many resources it uses, this can be very inefficient!

Second, and more importantly, it’s possible for bugs to “cancel out”. Imagine if the test provokes our system into a broken state, and then later, by random luck, gets back out of that state again. A validation phase at the end of the test would completely miss this sequence of events.

Third, debugging is just more difficult if there’s a long, complicated, and mostly irrelevant history leading up to the bug.

We therefore recommend that you validate continuously, with a repeating pattern of work → validate → work.

Validate eventually

At the same time, other properties, like availability, can be trickier to express. While we try to architect our systems to be robust to the real-life failures we face, it’s simply true that a test which (for instance) relies on querying one of our services cannot pass while the network link between the workload and that service is down. The properties that we really care about in cases like these are that eventually, when conditions are better, our system is able to recover.

It’s particularly important that our test template distinguishes between always and eventually properties to prevent our tests getting cluttered up with false positives. If our test crashes or logs fatal error messages when it encounters “expected” errors, that will mask real bugs in our client library, which does need to work in production in the face of such issues.

Validate at the end when necessary

There are advantages to validating throughout a workload, but some powerful properties only make sense when there’s no more work to do. Properties that fit here are things like checking that our data is consistent, making sure a process finishes gracefully, or looking at the actual results of some systemwide operation.

Leverage autonomy

One of the great strengths of autonomous testing is that it will frequently flush out bugs that test-writers can’t predict, by using randomness.

Many languages, frameworks, and runtimes have built-in PRNG abstractions that are initialized at runtime. Within Antithesis, this is an anti-pattern, because it means that we cannot go back just a little bit and "change history", but need to restart your program from scratch in order to get a different random sequence. Instead, you should be getting random values directly from the Antithesis SDK, or from the system random devices /dev/random or /dev/urandom. If that's impractical, then at the very least you should be periodically reseeding the PRNG that you are using from these sources.

Every part of our test is an opportunity to increase its randomness. In addition to randomizing the functions we call, the order in which we call them, and the inputs we give them; we can double down and randomize things like:

  • How is the system configured?
  • How many processes are running at a time?
  • How long does the test run?
  • When do we check that things look the way we expect?

Since randomness is so powerful, it helps to break our test template down into the smallest coherent pieces we can, so the system has as many degrees of freedom as possible in composing a test.

It’s also useful here to consider the opposite case, in which we try to tune the randomness in our test, or give the test composer larger units of work. For example, suppose we thought that calling a twice in a row was more likely to find a bug. It might be more tempting to write this code:

void choose_function() {
    return antithesis.random.choose([a,b]);
}

void test() {
    while(true) {
        func = choose_function();
        func();
        func();
    }
}

If you’re right, then this version will find bugs slightly faster on average than a truly random sequence. However, it’s guaranteed never to find a bug that requires the sequence a -> b -> a without a second intervening b. It’s most important to make sure that we aren’t inadvertently ruling out a possible sequence of test actions, since that creates an opening in which a bug can hide.

Again, it’s possible to manually write a test template that accounts for all of this, but we believe the test composer is an extremely helpful – and powerful – tool in this regard.