Marco Primi
Chaos monkey at Synadia

Hunting for one-in-a-million bugs in NATS

This is the story of a sneaky, terrifying bug hiding in the Raft layer of NATS, and how Antithesis helped us catch it. It shows the kind of issues lurking in complex codebases that traditional testing methods can miss, and Antithesis can snipe.

Antithesis and Synadia

Vas ist NATS? (and how do you test it?)

NATS is a multi-language toolkit for building and connecting distributed systems. It’s a CNCF open-source project stewarded by Synadia.

When it comes to testing, NATS has a vast surface area. There is so much to test: from access control and permissions all the way to replicated storage. Everything must compose and work smoothly in a variety of different environments, under different loads, and in wildly different use cases. Because NATS sits low in the stack, dependability is of the essence.

We refer to our our overall testing strategy as the 7-layer (testing) cake:

  1. Unit tests – no introduction needed.
  2. Integration tests – ranging from small multi-unit synthetic interactions to large, realistic scenarios involving multiple clusters and clients.
  3. Whole system tests – running on real hardware, often inspired by real-world use cases.
  4. Benchmarks and performance regressions - to ensure bug fixes and features don’t slow things down.
  5. Operational scenarios – upgrades & downgrades, backup & restore, scale up & down, reconfiguration, etc.
  6. Extreme situations – Stress & torture tests, abnormal load, misuse and abuse.
  7. Fault injection (a.k.a. Chaos testing) – targeting critical safety and liveness properties.

We do our best to continuously make progress on all these fronts, but it’s a never-ending task: we can always test more and better. Whenever all our tests are green, it’s time to turn up the temperature and add more complex scenarios and nastier failure modes.

Over the last couple of years, Antithesis has become the frosting that wraps our testing cake, helping us seek and destroy exotic bugs that traditional techniques fail to detect (or make impossibly hard to reproduce).

Mo’ durability mo’ problems

One of the most important primitives NATS provides is durable message queues (similar to Kafka, but with a variety of durability and consistency settings). Replicated durable ordered sequences of messages are a fundamental building block of many systems, so the dependability of this component is critical.

NATS replication is based on Raft. Consensus-based replication protocols are notoriously complex even in their naive textbook form. They rapidly become more complex when optimized for high throughput, low latency, and fast failover & recovery.

In practical terms, we want our Raft implementation to behave like a Total Order Broadcast. The durable message queues built on top must also satisfy Atomic Broadcast semantics (depending on the specific settings). Verifying that both of these subsystems never violate the (informal) formal specification is an aspect we invest a lot of time into.

Memento (no) more

Now that we’ve set the stage, let’s look at a terrifying bug that was hiding in plain sight, which arose as a result of code evolving around it.

The following is a schematic view summarizing logic spread over hundreds of lines of code:

n := newRaftNode(...)  
err := n.joinRaftGroup(...)  
if err != nil:  
  n.cleanup() # Wipes any cruft left on disk

At first, this seems reasonable: if a node fails to join a Raft group, then clean up any cruft left behind (e.g. pre-allocated WAL files). However, this same logic becomes problematic when it executes during recovery (for example, after a server restart).

In this case, we have:

  • A Raft node is running fine, doing its thing, writing state to disk.
  • The containing server restarts.
  • The node fails to rejoin the Raft group after a restart.
  • The Raft node state is wiped.

This kind of stable-storage loss is pure nightmare fuel to anyone familiar with consensus-based replication protocols, as it can lead to violations of crash-stop, crash-restart assumptions that Raft and Paxos are based on.

A series of unfortunate events

For replicated durable message queues, data loss is just about the worst case scenario. So we invest considerable effort in making sure this never happens. One of the testing techniques we employ is invariant properties with randomized fault injection, an approach popularized by Jepsen.

In these tests, we inflict all sorts of nasty failures on a running system and closely monitor it to ensure that 1) we can eventually make progress and commit more messages to a stream, and 2) committed messages are eventually available in the order they were committed.

After thousands upon thousands of testing hours we were quite confident of NATS’ resilience. Yet our very first Antithesis experiment was claiming it had detected data loss!

From looking at logs in the report, we pieced together the sequence of events that Antithesis strung together in order to violate our data loss invariant. Here’s one example:

Let X, Y, and Z be the 3 replicas of a durable message queue.
Let Z be the current leader of the group.
Let the queue be non-empty.

  1. Isolate replica X (follower) via network partition, so it starts lagging behind Z (leader) and Y (follower).
  2. Restart the server hosting replica Y (up-to-date follower).
  3. While Y is restarting and recovering, force another restart of the same server.
  4. Due to shutdown while in recovery state, the replica fails the initialization in a way that leads to state wipe (as per snippet above).
  5. Let replica Y restart successfully (its disk state has been fully erased, but it retains its identifier ‘Y’).
  6. Isolate replica Z (leader and only up-to-date node) via network partition.
  7. Restore connectivity between replica X (outdated) and replica Y (wiped).
  8. X (not fully up-to-date) becomes leader (this is a Raft violation made possible only by Y’s memory loss).
  9. X and Y have quorum and “overwrite” some previously committed values (this is not supposed to happen in Raft).

This is a pretty incredible sequence of events. Everything went wrong in exactly the right way leading to the worst possible outcome. Antithesis found this “path to data loss” without knowing anything about Raft or NATS.

To summarize:

  • A replica gets restarted.
  • During recovery, a specific error is encountered leading to a disk-state wipe.
  • After one more restart, the replica joins the group with empty state but retaining the same node identifier (trouble!).
  • The wiped replica is instrumental in electing a leader that is outdated (Raft violation!).
  • The “illegitimate” leader is able to overwrite messages previously committed (data loss!).

Each one of these events is rare on its own. Their combined probability is a one-in-a-million event (a kind of Drake equation for deeply nested bugs). When tested with our in-house chaos-testing setup, various recovery mechanisms would quickly recover the wiped replica and therefore hide the bug, so it went undetected for a long time.

This is a prime example of the difference between our in-house chaos testing and Antithesis: the former may never stumble upon exceedingly unlikely scenarios, the latter is designed from the ground up to search for them!

To boldly go where no test has taken us before

We like to refer to running workloads in Antithesis as “gambling against the fuzzer” because it goes something like this:

Me: Run <workload> against this <system>. Inject reasonable faults. I bet you can’t observe any data loss. You have 1 hour.
Antithesis: Hold my beer…
1 hour later…
Antithesis: I was able to observe data loss. Here’s an example.
proudly shows a trace where data loss happens within 3 minutes of (virtual) execution time
Me: You’re the best and I hate you.

Antithesis has rapidly become an indispensable tool in detecting and addressing deep, hideous bugs in our code. For dependability zealots like us, this is a significant leap forward in bug-hunting technology. It makes development much more linear – when we make progress, it’s much less likely that we have to go back and fix problems.

In addition to our nightly experiments like the one that turned up this incredible bug, we use Antithesis in other ways. We use short runs to reproduce individual flaky CI tests… Or we use it to reproduce customer issues observed in remote, inaccessible customer environments: quickly write a simple workload based on the customer’s description of the symptoms, and Antithesis can usually reproduce the problem on the first attempt.