Autonomous testing - how it works and when to use it

At Antithesis, we’ve built an autonomous testing tool, but the term is loosely defined in the industry. It’s often conflated with automated testing or AI-driven testing frameworks.

Here’s a definition that clarifies what autonomous testing is, the problems it solves, and when to use it.

What is autonomous testing?

Autonomous testing is the practice of using a computer to generate tests for a software system.

Due to the rise of LLMs, the term autonomous testing is sometimes used to refer to an LLM-driven testing framework to create, run, and analyze tests for software systems. Such frameworks are a form of autonomous testing, since a computer is generating the tests, but they mostly write tests for smaller chunks of code – not whole systems.

Autonomous testing vs. property-based testing

Property-based testing involves writing checks that a certain property holds in a system. It doesn’t define how those tests are created.

Many popular property-based testing tools use computers to generate a wide range of random inputs, but the use of machine-generated random inputs isn’t the defining characteristic of property-based testing, and there are other approaches to property-based testing that aren’t driven by randomness at all, such as concolic execution.

Autonomous testing also goes beyond computer generation of inputs – in autonomous testing, the computer is generating the whole test.

Autonomous testing vs. automated testing

Automated tests are simply tests that are run automatically, without human intervention. While automated testing doesn’t inherently imply the tests are example-based, in practice, they usually are. Think about, for instance, the automated unit tests that run when software builds. These tests run the same set of test inputs on every test run.

While autonomous testing can be automated by integrating in the CI/CD pipelines, autonomous tests, unlike automated tests, are generated in each test run.

Problems it solves

The goal of testing a piece of software, at least notionally, is to exercise all of the functionality in the system so you can see what happens.

This is hard because most systems are complex and stateful. For example, if we have an API with two functions, a and b, a naive guess would be that it suffices to have two tests:

void test1() {
    assert(a() == <expected result>);
}

void test2() {
    assert(b() == <expected result>);
}

But of course this isn’t true! Functions can have side effects on the state of our system – maybe one of these functions leaves our system in a state where calling the other one will have some new effect. And of course we need to try them in both orders.

void test3() {
    assert(a() == <expected result>);
    assert(b() == <expected result>);
}

void test4() {
    assert(b() == <expected result>);
    assert(a() == <expected result>);
}

But wait! Who says we can only call each of those functions once? What if one of the functions contains a memory leak and calling it a thousand times is what causes our program to crash?

Pretty soon we can end up in this kind of situation:

void test37411() {
    assert(b() == <expected result>);
    assert(a() == <expected result>);
    assert(a() == <expected result>);
    assert(a() == <expected result>);
    assert(a() == <expected result>);
    assert(b() == <expected result>);
    assert(a() == <expected result>);
    ...etc.
}

And all of that is in a simplified model of an API with just two functions that each take zero parameters, and without considering concurrency, resilience to out-of-band faults, network errors, etc. This combinatorial explosion of possibilities is one of the fundamental reasons that testing is so hard, and why getting exhaustive behavioral coverage of a system is often impractical when a human is doing the testing.

With example-based testing, a lot of developer time is spent coming up with interesting test inputs that find bugs. But this yields very low input-coverage and state-space coverage for the system under test.

Property-based testing, when combined with fuzzing, covers more input-space and system state-space as a result.

Autonomous testing goes beyond both by generating tests to exercise the whole system. For example, we can write a program that tries different combinations of things that a user or client could do, under different environmental conditions. If we run this in a loop for long enough, it will eventually discover every possible such combination, and hence every possible state of the system.

This would be a relatively crude, impractical version of autonomous testing, but it illustrates the underlying idea.

In practice, most software systems have state spaces so large they’re effectively infinite. But if autonomous testing generates end-to-end tests that exercise the whole system to explore the state space, a lot of interesting behaviors can be discovered in a reasonable span of time.

This also highlights another difference between autonomous testing and LLM-driven testing frameworks. Current LLM-driven testing frameworks tend to focus on generating example-based tests to ensure that a particular piece of code functions as intended, whereas autonomous testing generates tests to see if the system ever doesn’t work.

Strengths and limitations

Autonomous testing saves developer time and increases confidence in your software, because:

Instead of predicting test cases that might reveal a bug, developers can instruct the computer to generate tests autonomously, achieving greater coverage of the state-space than other testing approaches, for the same amount of time spent.
Randomized exploration often uncovers bugs developers would never anticipate.

However, autonomous testing can be challenging to implement.

There aren’t many autonomous testing tools available and setting up an autonomous testing framework is a complex, resource-intensive undertaking.
Autonomous testing requires understanding of sequences of operations that are legal and that can be legally run concurrently.

What kind of systems benefit most from autonomous testing?

Autonomous testing is applicable to any kind of software, but particularly excels at testing systems where concurrency, state, and state space exploration matter.

These include:

Distributed databases (e.g. FoundationDB, MongoDB, and TigerBeetle).
Financial transaction engines (e.g. Formance).
Distributed systems infrastructure (e.g. Warpstream, Resonate, and Rising Wave).
Blockchains and consensus protocols (e.g. Sui, developed by Mysten Labs).
Microservice applications.
Asynchronous workflows.
Any complex business system built on distributed infrastructure.

Bugs in such systems tend to be difficult to detect with manually written, example-based tests.

Autonomous testing is better than other testing approaches and a scalable way to achieve confidence in today’s software. At Antithesis, we’ve seen firsthand how it uncovers critical bugs in systems where traditional testing falls short.