What does AI testing done right look like?

July 23, 2025

So A. So I. So AI.

AI-based systems testing is rapidly becoming a thing, driven in part by the sheer volume of new AI-generated code to be tested and the clear quality issues associated with co-pilots and vibe coding. But what’s the best way to use AI for testing complex distributed systems? The approach many vendors are taking – essentially using AI to generate more tests in hopes of achieving better code coverage – is flawed. Instead, we believe combining proven testing methods for distributed systems with targeted AI usage is the best way forward, and enables teams to unlock the full promise of AI.

Perplexity on limitations and risks of vibe coding:

Code quality and security: The AI may generate code that is inefficient, insecure, or hard to maintain, especially if the developer doesn’t review it carefully.
Overreliance on AI: Developers may lose touch with the underlying logic or best practices if they rely too heavily on AI-generated solutions.
Not ideal for complex, production-critical systems: For high-stakes projects, manual review, testing, and optimization remain essential.

Why is just generating more tests with AI flawed?

Albert Einstein once said, “Doing the same thing over and over again and expecting different results is the definition of insanity.”¹ The current model of example-based testing – writing test cases – to explore the things that may go wrong in complex distributed systems was already flawed in multiple ways. Using AI to accelerate this is just putting a flawed approach on steroids.

First, this approach leads to additional complexity and a library of tests that are difficult to maintain and even harder to reason about. People were already complaining about flaky tests and change detector tests, and recognized that poorly written tests were a significant maintenance burden – and that was when a human being had at least reviewed each of those test cases at some point. Now imagine how much worse this gets when AI is churning these out faster than a first year coder on their third can of Monster. Suddenly, human patience is no longer a limit on the number of flaky or poorly designed tests that get added to your code. This is not a recipe for success.

Second, example-based testing only accounts for the things engineers, or AI, think may go wrong. Remember that AI learns to write tests by reading your code and other tests humans wrote. It might spit tests out for you faster, but it’s never going to come up with something that truly explores the far corners of your system’s state space.

Third, even if you blanket your code with AI-written test cases, no amount of testing under “happy path” lab conditions will account for the kind of things that go wrong when code meets production reality. Production, especially with large, complex distributed systems, is a chaotic environment. Ensuring your code is up to the task in the varying weather conditions of production, or the crazy actions your users dream up, simply can’t be done using traditional methods.

Lastly, even if you do find more bugs with extra AI-generated test cases, this approach does nothing to help engineers pinpoint the root cause of failures, or test and validate fixes under the same conditions. More tests detecting more unsolved bugs is not a recipe for quicker, smoother deployments or happier, more productive engineers.

How should you test distributed systems with AI?

Before discussing AI, let’s begin by outlining what it takes to actually address the most critical issues in distributed systems testing.

Some qualities we need from this testing approach are:

It should thoroughly test systems from end to end, without requiring engineers or AI to design and write test cases to cover “all the possibilities.”
It should test as much of the state space – as many code paths – as possible, but under conditions that your code will face in the hostile and unpredictable environment of production.
It should enable engineers to recreate bugs and closely inspect the system state and events on the timeline leading to each failure, facilitating rapid and comprehensive debugging.
It should enable engineers to make fixes and retest them under the same conditions that caused the fault, ensuring the fix works.

With or without AI, these qualities are necessary to identify and resolve the kind of bugs that are difficult to find, can halt production, and consume significant amounts of your top engineers’ time.

The approach we use is an enhanced version of deterministic simulation testing (DST), which is a core pillar of AWS’ strategy for system correctness. The basic idea is to run your software in a simulation – but one that’s far more hostile than even your worst day in production. A “digital twin,” but for a software system.

Rather than generating numerous tests in the hope of covering all bases, your team, with support from our experts, defines what correct system behavior is. For example: “a user will never get billed twice for the same transaction,” or “the system will always recover from a failure within 5 seconds,” or “the server component never returns an HTTP 500 status code.” Our platform then runs many instances of your system in parallel, exploring millions of code paths to see if these properties are ever violated. This property-based testing approach significantly reduces the amount of work required by engineers on the front end.

Our hypervisor provides a deterministic environment that can perfectly replicate the conditions that lead to each fault, allowing engineers to easily diagnose the cause of the bug. And when a fix is complete, they can test it equally thoroughly, to validate that their fix works.

Where can AI be most helpful in testing distributed systems?

The AI magic is baked into this testing model. We employ an AI-powered fuzzer that intelligently injects faults during these numerous test runs – network delays, race conditions, server and service failures – the types of issues commonly encountered in production, but not typically tested with traditional methods. Our AI identifies when your code is doing something unusual, and generate more test sequences to explore an interesting situation further, delving deeper and finding the most pernicious bugs.

AI also drastically simplifies the initial setup. We use AI – and you can too – to generate the calling code that makes the digital twin of your system “do something,” or emulates user behavior. A significant advantage of AI for this use case is that hallucinations become a valuable asset. Your customers will also sometimes enter the incorrect calling code, but that shouldn’t cause your system to crash or become unresponsive.

We’re exploring many other uses of AI in our platform. From the beginning, we decided that our company would ensure our customers could leverage the substantial benefits of property-based testing, deterministic simulation testing, and an AI-powered fuzzer to expose deep-seated bugs, especially the unknown unknowns that can cause issues in production.

Having AI do the testing for you is actually a pretty good idea – you just need to do it right.

You made it to the end! Grab some stickers

Place them anywhere and watch the compliments compile.

Get free stickers