Autonomous testing - how it works and when to use it

At Antithesis, we’ve built an autonomous testing tool, but the term is loosely defined in the industry. It’s often conflated with automated testing or AI-driven testing frameworks.

Here’s a definition that clarifies what autonomous testing is, the problems it solves, and when to use it.

What is autonomous testing?

Autonomous testing is the practice of using a computer to generate tests for a software system.

Due to the rise of LLMs, the term autonomous testing is sometimes used to refer to an LLM-driven testing framework to create, run, and analyze tests for software systems. Such frameworks are a form of autonomous testing, since a computer is generating the tests, but they mostly write tests for smaller chunks of code – not whole systems.

Autonomous testing vs. property-based testing

Property-based testing involves writing checks that a certain property holds in a system. It doesn’t define how those tests are created.

Many popular property-based testing tools use computers to generate a wide range of random inputs, but the use of machine-generated random inputs isn’t the defining characteristic of property-based testing, and there are other approaches to property-based testing that aren’t driven by randomness at all, such as concolic execution.

Autonomous testing also goes beyond computer generation of inputs – in autonomous testing, the computer is generating the whole test.

Autonomous testing vs. automated testing

Automated tests are simply tests that are run automatically, without human intervention. While automated testing doesn’t inherently imply the tests are example-based, in practice, they usually are. Think about, for instance, the automated unit tests that run when software builds. These tests run the same set of test inputs on every test run.

While autonomous testing can be automated by integrating in the CI/CD pipelines, autonomous tests, unlike automated tests, are generated in each test run.

Problems it solves

The goal of testing a piece of software, at least notionally, is to exercise all of the functionality in the system so you can see what happens.

This is hard because most systems are complex and stateful. For example, if we have an API with two functions, a and b, a naive guess would be that it suffices to have two tests:

void test1() {
    assert(a() == <expected result>);
}

void test2() {
    assert(b() == <expected result>);
}

But of course this isn’t true! Functions can have side effects on the state of our system – maybe one of these functions leaves our system in a state where calling the other one will have some new effect. And of course we need to try them in both orders.

void test3() {
    assert(a() == <expected result>);
    assert(b() == <expected result>);
}

void test4() {
    assert(b() == <expected result>);
    assert(a() == <expected result>);
}

But wait! Who says we can only call each of those functions once? What if one of the functions contains a memory leak and calling it a thousand times is what causes our program to crash?

Pretty soon we can end up in this kind of situation:

void test37411() {
    assert(b() == <expected result>);
    assert(a() == <expected result>);
    assert(a() == <expected result>);
    assert(a() == <expected result>);
    assert(a() == <expected result>);
    assert(b() == <expected result>);
    assert(a() == <expected result>);
    ...etc.
}

And all of that is in a simplified model of an API with just two functions that each take zero parameters, and without considering concurrency, resilience to out-of-band faults, network errors, etc. This combinatorial explosion of possibilities is one of the fundamental reasons that testing is so hard, and why getting exhaustive behavioral coverage of a system is often impractical when a human is doing the testing.

With example-based testing, a lot of developer time is spent coming up with interesting test inputs that find bugs. But this yields very low input-coverage and state-space coverage for the system under test.

Property-based testing, when combined with fuzzing, covers more input-space and system state-space as a result.

Autonomous testing goes beyond both by generating tests to exercise the whole system. For example, we can write a program that tries different combinations of things that a user or client could do, under different environmental conditions. If we run this in a loop for long enough, it will eventually discover every possible such combination, and hence every possible state of the system.

This would be a relatively crude, impractical version of autonomous testing, but it illustrates the underlying idea.

In practice, most software systems have state spaces so large they’re effectively infinite. But if autonomous testing generates end-to-end tests that exercise the whole system to explore the state space, a lot of interesting behaviors can be discovered in a reasonable span of time.

This also highlights another difference between autonomous testing and LLM-driven testing frameworks. Current LLM-driven testing frameworks tend to focus on generating example-based tests to ensure that a particular piece of code functions as intended, whereas autonomous testing generates tests to see if the system ever doesn’t work.

Strengths and limitations

Autonomous testing saves developer time and increases confidence in your software, because:

  • Instead of predicting test cases that might reveal a bug, developers can instruct the computer to generate tests autonomously, achieving greater coverage of the state-space than other testing approaches, for the same amount of time spent.
  • Randomized exploration often uncovers bugs developers would never anticipate.

However, autonomous testing can be challenging to implement.

  • There aren’t many autonomous testing tools available and setting up an autonomous testing framework is a complex, resource-intensive undertaking.
  • Autonomous testing requires understanding of sequences of operations that are legal and that can be legally run concurrently.

What kind of systems benefit most from autonomous testing?

Autonomous testing is applicable to any kind of software, but particularly excels at testing systems where concurrency, state, and state space exploration matter.

These include:

  • Distributed databases (e.g. FoundationDB, MongoDB, and TigerBeetle).
  • Financial transaction engines (e.g. Formance).
  • Distributed systems infrastructure (e.g. Warpstream, Resonate, and Rising Wave).
  • Blockchains and consensus protocols (e.g. Sui, developed by Mysten Labs).
  • Microservice applications.
  • Asynchronous workflows.
  • Any complex business system built on distributed infrastructure.

Bugs in such systems tend to be difficult to detect with manually written, example-based tests.

Autonomous testing is better than other testing approaches and a scalable way to achieve confidence in today’s software. At Antithesis, we’ve seen firsthand how it uncovers critical bugs in systems where traditional testing falls short.

  • Introduction
  • How Antithesis works
  • Using Antithesis documentation with AI
  • Get started
  • Test an example system
  • With Docker Compose
  • Build and run an etcd cluster
  • Meet the Test Composer
  • With Kubernetes
  • Build and run an etcd cluster
  • Meet the Test Composer
  • Setup guide
  • For Docker Compose users
  • For Kubernetes users
  • Product
  • Test Composer
  • Test Composer basics
  • Test Composer commands
  • How to check test templates locally
  • How to port tests to Antithesis
  • Test launchers
  • Reports
  • The triage reports
  • Findings
  • Environment
  • Utilization
  • Properties
  • The bug reports
  • Context, Instance, & Logs
  • Bug likelihood over time
  • Logs Explorer & multiverse map
  • Multiverse debugging
  • Overview
  • The Antithesis multiverse
  • Querying with event sets
  • Environment utilities
  • Using the Antithesis Notebook
  • Cookbook
  • Tooling integrations
  • CI integration
  • Discord and Slack integrations
  • Issue tracker integration - BETA
  • Configuration
  • Access and authentication
  • The Antithesis environment
  • Optimizing for testing
  • Docker best practices
  • Kubernetes best practices
  • Concepts
  • Properties and Assertions
  • Properties in Antithesis
  • Assertions in Antithesis
  • Sometimes Assertions
  • Properties to test for
  • Fault injection
  • Reference
  • Webhooks
  • Launching a test
  • Launching a debugging session
  • Webhook parameters
  • SDK reference
  • Define test properties
  • Generate randomness
  • Manage test lifecycle
  • Assertion catalog
  • Coverage instrumentation
  • Go
  • Instrumentor
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • Java
  • Using the SDK
  • Building your software
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • C
  • C++
  • C/C++ Instrumentation
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • JavaScript
  • Python
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • Rust
  • Instrumentation
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • .NET
  • Instrumentation
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • Languages not listed above
  • Assert (reference)
  • Lifecycle (reference)
  • Assertion Schema
  • Handling external dependencies
  • FAQ
  • Product FAQs
  • About Antithesis POCs
  • Release notes
  • Release notes
  • General reliability resources
  • Reliability glossary
  • White paper — How much does an outage cost?
  • Autonomous testing
  • Deterministic simulation testing
  • Property-based testing
  • Catalog of reliability properties for key-value datastores
  • Catalog of reliability properties for blockchains
  • Techniques to improve software testing
  • Test ACID compliance with a ring test