Dan Gottlieb
Former Senior Staff Engineer, MongoDB

Working with Antithesis at MongoDB

Snouty catching bugs

This is a guest post from an Antithesis customer, writing about their own experience testing with us. If you are an Antithesis customer and interested in talking about your impressions of our platform, please reach out to us!

MongoDB runs lots of tests: Unit tests and integration tests. Tests which run entire tests (they call them “passthroughs”). Tests that idealize the environment and tests that inject faults. Tests of client side behavior. Tests for internal consistency after a test is verified as well as dozens of times during a test. Tests for each MongoDB process in a three node replica set. Tests for each process in a sharded cluster.

The testing matrix goes on and on (and on).

MongoDB is heavily invested into improving productivity through excellent testing. There’s the normal stuff, like running tests in the cloud against a patch and providing back the results in a slick UI. But there’s also cool custom stuff as well: An online test log viewer that can filter and highlight information, complete with shareable links; or there’s the ability to click a button that spawns an appropriate cloud machine complete with a coredump and the binaries/debug information for that exact build to inspect it in gdb (with custom data structure walking scripts); or a background service that compares test failures in a patch run with the known failures.

However, when so many tests are run, there are also lots of test failures—“red boxes.” Many are false positives (bugs in the test itself), others fail to complete in time due to excessive resource usage or a faulty cloud machine. Of course, tests also fail because there’s a real code bug. It can be hard to tease out which category a red box falls into.

Only one failure mode is dead simple to identify: data corruption. Data is duplicated within a node in the form of indexes and the operation log—as well as across nodes in a replica set. Because so much data is derivative, it’s trivial to see when something has gone wrong. An error message saying an index key is missing (or that nodes in a replica set have diverged) is rarely a false positive. Data corruption is a symptom; the root cause problem already happened. It’s that we’re just now observing the mistake. Unfortunately, the tooling for finding the reason for data corruption can only go so far.

Due to the nature of data corruption, all of the fancy tools to inspect test logs and data files provide clues, but not necessarily the answer. Those tools can speed up the process, and remove barriers for gleaning details, but they’ll rarely provide the crucial detail required to fully understand the problem. And until there’s a reproducer to identify the bug and assess its impact, engineers are left investigating a bug that could be infinitely bad—and could also take an infinite amount of resources to find the root cause of.

Investigating with Antithesis

I led multiple data corruption investigations during my tenure at MongoDB, but the most memorable was one that was found by our testing partner, Antithesis.

Since 2021, MongoDB has used Antithesis to rigorously test MongoDB’s application code. This includes testing MongoDB’s Storage Engine, Core Server, Cluster-to-Cluster Syncing Utility, and upgrade/downgrade compatibility. The Antithesis platform places software in a custom, deterministic hypervisor, injects faults, and uses fuzzing-like techniques to explore code and uncover bugs. Antithesis is then integrated into MongoDB at the CI level, to be run as part of their release process. This helps MongoDB find bugs before they are introduced in production and impact customers. The Antithesis simulation environment gives engineers and developers a perfect reproduction of any found issues, making triaging failures far more productive for them.

In this particular test failure, the symptom was an index inconsistency. An index entry existed, but the document it referred to was missing. This can happen for, broadly, two reasons:

  • There can be a bug with index builds. Particularly when the index is being built concurrently with writes to the collection being indexed.
  • There can be a bug in the storage engine. Outside of index builds, index entries are written transactionally with writes to documents. Thus being able to see them out of sync implies the storage engine lost a write.

This bug was the latter case. The index in question was the _id index. In MongoDB, the _id field is the primary key; that index is created along with the collection. Thus, the index build code was exonerated from any wrongdoing.

The other observations about this failure were:

  • It happened on the config.transaction collection.
    • This was notable as config.transaction is an internal collection. The rules about data and indexes being updated in transactions still hold.
    • But, writes to config.transaction happen in different code paths and it has some other peculiar properties. For instance, some updates to documents in the config.transaction collection are elided.
  • The error was observed soon after a replication rollback.
    • A replication rollback is performed when a node learns it has accepted writes that are not in the leader’s history.
    • A replication rollback has its own special code paths in the storage engine and replication layers, as well as all the layers in between.
  • The data files from the end of the test were available, but there were no clues about the missing data. This bug did a good job hiding its tracks.

This test failure came at an inconvenient time, businesswise. The MongoDB server team was preparing for the next release and was rapidly working through loose ends. Still, we immediately reached out and got on a call with Antithesis to see what our options were for getting to the bottom of this.

Antithesis came prepared. They had rerun their simulations, and had not only reproduced the bug, but found the exact time at which it had occurred. How was this possible?

Antithesis doesn’t just run a test from start to finish one time—and it doesn’t just run a test from start to finish multiple times, either. It runs a simulation for some amount of time and takes a “checkpoint.” It will then add entropy and explore multiple paths rooted at that checkpoint. If one path keeps leading to the same error state, it can record that as a problem and spend resources exploring other paths instead.

This testing technique is super powerful for software like MongoDB. The most concerning bugs that will affect customers are those that happen when there’s lots of data. And inserting data not only takes time, it takes time away from testing the operations we want to do on that data. Longer-running tests can find more problems, but reproducing those problems becomes harder as the iteration cycle goes from a couple minutes to an entire workday. Antithesis’s checkpointing and branching algorithm naturally spends more resources exploring branches after the MongoDB database has been running long enough to get itself into more interesting states.

Ultimately, what Antithesis told us was that they had an analysis that summarized the probability of a failure from any given checkpoint. They explained that when the probability spikes between adjacent checkpoints, we know with confidence that the bug happened in that time window.

This was useful but unfortunately, it still wasn’t good enough. Based on the observations I had made from the logs and prior experience of where we’d had these storage engine bugs before, I already knew the issue would be in the storage engine code related to replication rollback—and that was indeed what Antithesis’s analysis was pointing at. So, while I was super impressed with the quality of the results Antithesis’s software produced, I still didn’t know why the bug was occuring, even if we now were sure of when and where.

So what else could we do? Could we add logging to the database, recompile and rerun the experiment? Not exactly. Antithesis reproduces failures by reproducing the exact (simulated) timing of every instruction the CPU processes, such that context switching happens at precisely the same time. Adding or removing instructions would have a butterfly effect on the outcome. Therefore, instrumenting the binary with diagnostics was off the table… What else was there? Hint: the answer wasn’t glamorous.

We asked for core dumps for every 100 milliseconds within a two second range around the bug.

The ask wasn’t crazy for their software, but that didn’t mean it was simple to do. At that time, each run had to be done by hand (and we were heading into the weekend). It took a while, but we got them.

And those core dumps were gold.

Diving into the core dumps

So what information was in these core dumps that would provide a clue? First, we had to identify what to look for, and when to look for it. First, we analyzed the data files left behind after the test exited. The failure we were seeing was an index inconsistency with a collection on the _id index (MongoDB’s primary key). What the error was telling us was that the index entry was referencing a document that was no longer in the system. Looking at the oplog (the operation log, or history of changes) from the leftover data files, we could tell the document was never intentionally deleted. But our first interesting clue was that this document was updated a lot—thousands of times in rapid succession, a state that doesn’t happen often in our normal testing operation. Lots of writes slow down tests; and our test suite is very large and slow to fully execute.

The other information we had before examining the core dumps were the MongoDB logs that the affected node generated before the failure. It had just gone through a replication rollback. Due to the distributed nature of a replica set, data is first committed locally on a primary node, then replicated to other secondaries. When a majority of nodes have locally committed the same write, we say that write is now majority committed. If a failure happens after data is locally committed, but before becoming majority committed, that data is subject to being erased via a replication rollback. The heavy lifting for this is also supported by WiredTiger and is heavily optimized by necessity. Which incurs some correctness risks.

Armed with those observations, we wanted to get core dumps right around that replication rollback event—and we wanted to check the state of the document that went missing. Getting the exact time was challenging but the people at Antithesis did a remarkable job working with us running multiple experiments a day to catch the bug in the act. A core dump that came in a fraction of a second before the replication rollback event looked completely fine. A core dump that came in a fraction of a second after the replication rollback had lost the data and there was no trace where it went.

But, we did get a core dump at just the right time. There was something a bit different. But to understand, we first need to talk about MVCC, multiversion concurrency control.

In a simpler database, when a document is updated one could lock the document, update it in place, then release the lock. But there are two frequent complaints about that strategy:

  • Taking a lock blocks other operations trying to access that document. (There may be older readers in the system that need access to the old version of the document.)
  • Updating documents in place is (broadly) incompatible with offering snapshot isolation.

So, many database storage engines instead do a “copy on write” algorithm for updates to a document. This can be done without requiring locks on the document, so that the previous version of the document is available for older readers that want it.

But every advanced technique tends to introduce new complexity. The new problem MVCC introduces is that the database has to keep all of these old documents around and delete them when they are no longer needed. What if the database process is running out of memory? The answer may seem obvious—we write it to disk of course!

This core dump in the Goldilocks zone caught WiredTiger in this memory reclamation state. It had moved the thousands of versions of this affected document into a structure suitable for writing it to disk. And in that state we made an observation: the historical contents of this document was being split across multiple WiredTiger containers. With these last key ingredients, we were able to create an isolated reproducer.

A solution is found

After some calibration on timing, we had a great lead. We caught the storage engine replication rollback algorithm in a window no wider than 10ms. It had completed its first phase and the data in question was still present and correct. We narrowed the problem down to the second half of the algorithm and we had the exact input that would lead to an error.

The analysis Antithesis had provided us was so useful, that they ended up turning it into a full-fledged product. What they had created with us was a prototype version of what has become their bug report.

Bug likelihood over time

This wasn’t my bug to diagnose; that code wasn’t my expertise. I handed off the information to Hari (a rockstar) in the Sydney office. My JIRA inbox welcomed me Monday morning with a diagnosis, a reproducer and an example fix, confirming the diagnosis. See here for the gory details on this gnarly bug.

Since that investigation, through to my last days at MongoDB, Antithesis was an incredible partner to work with. They continued to improve how their testing explores paths to maximize coverage. They helped onboard more of our tests into their system. They came to the MongoDB office and talked about their technology (which our engineers loved hearing about). But the other (arguably larger) half of what makes Antithesis great is the company’s commitment to their customers’ success.