Accelerating developers at MongoDB

Since 2021, MongoDB has used the Antithesis continuous reliability platform to rigorously test their application code, uncovering critical bugs before they reach production. Issues which previously blocked annual releases and slowed the progress of hundreds of developers are now found autonomously with Antithesis, increasing MongoDB's engineering efficiency and effectiveness. Antithesis testing has given MongoDB more confidence in their software’s quality and reliability, while also increasing their developer productivity.

This strategic partnership ensures MongoDB's continued delivery of high-quality software, while maintaining an exceptional experience for MongoDB's customers and developers.

Background

About Antithesis

Antithesis is a continuous reliability platform that autonomously searches for problems in whole software systems. This search occurs within a simulated environment where every problem found can be perfectly reproduced, allowing for efficient debugging of even the most complex problems. Antithesis was founded by the creators of FoundationDB, an exceptionally reliable distributed database that is the backbone of cloud infrastructure at Apple, Snowflake, and other companies. Antithesis is inspired by the powerful simulation testing framework that made FoundationDB possible.

About MongoDB

MongoDB is an open-source document database built on a horizontal scale-out architecture that uses a flexible schema for storing data. Initially founded in 2007 as a high-performance NoSQL database, MongoDB now has dozens of products, offering innovative solutions for artificial intelligence, edge computing, and serverless development. In addition to a continually-growing worldwide following in the developer community, over 46,000 customers trust MongoDB with their data. From Fortune 500 companies to government agencies, these customers know that with MongoDB, their data is readily available, complete, and correct.

MongoDB ensures that customers and developers are satisfied thanks to an emphasis on software quality. A focus on rigorous testing is a core part of MongoDB’s engineering culture, which has enabled them to reliably meet their customers’ needs while continually innovating.

Testing at MongoDB

Testing operations at MongoDB

MongoDB’s dedication to software quality and comprehensive testing is exemplified by the development of Evergreen, an open-source continuous integration system with robust testing capabilities. In addition to continually testing their code internally, MongoDB trusts Antithesis to ensure that their products are fault-tolerant, resilient, and correct.

Antithesis’s autonomous testing platform helps MongoDB find and fix critical issues ranging from data correctness and data corruption to server crashes, data races, and memory leaks. Since 2021, MongoDB has used the Antithesis platform to test their core server database code as well as their WiredTiger storage engine.

Server testing with Antithesis

For their core server testing, MongoDB repurposed their comprehensive JavaScript (JS) test library to serve as a workload within Antithesis. These tests perform a broad array of end-to-end operations that emulate what users do with MongoDB in production. They also confirm the correctness and integrity of these operations.

As MongoDB’s server is designed to be fault-tolerant and resilient, the Antithesis platform tests these properties by injecting the following types of faults:

  • Network faults (partitions, packet delays, drops, and reorderings)
  • Node faults (including hanging, killing, and restarting server processes)
  • Hostile thread scheduling
  • Clock jitters
  • CPU modulation

Instead of being purely random, this fault injection is guided by the Antithesis platform, which seeks out interesting or under-explored corners of MongoDB’s state space.

MongoDB puts eight different network topologies under test within Antithesis. These different network topologies ensure that as much as possible of MongoDB’s functionality is under test within Antithesis: from multi-sharded clusters, to version upgrades and downgrades within clusters, to replica sets.

MongoDB builds images of their application code, their workload, and their configuration from their development master branch. These images are then pushed to a dedicated, isolated artifact registry within Antithesis on a nightly basis and continuously tested on the Antithesis platform. Antithesis also supports additional on-demand testing during heavy periods of development leading up to key software life-cycle milestones, such as MongoDB's annual releases.

Tests run at approximately 50x parallelism, so when a certain test runs for six hours of wall time, it’s actually running for 300 hours of CPU time. Moreover, the Antithesis platform is able to “accelerate time” (that is, fast-forwarding the simulation through periods of idleness) while also injecting faults at a rate much higher than they appear in the real world. Consequently, 300 CPU hours of Antithesis testing can find vastly more issues than running a production cluster for 300 hours.

Finding and fixing MongoDB bugs with Antithesis

Thanks to Antithesis's continuous testing and its guided fault injection, MongoDB is able to find novel and difficult bugs quickly, enabling them to be triaged and fixed right away. This helps prevent developers from losing context, improving the overall productivity of the team.

Since this testing cadence began in early 2022, MongoDB has found an average of roughly one bug per 2,500 hours of testing on Antithesis’s platform. 75% of these bugs were only found with Antithesis and not through any other MongoDB testing. Most of these bugs are categorized as having “Major” or “Critical” priority, with roughly 40% of them being release-blocking.

Tests that run on the Antithesis platform generate a comprehensive triage report which is sent directly to the relevant MongoDB team for intake. MongoDB and Antithesis have worked together to develop appropriate and exhaustive test properties, which automatically extract any relevant logs, core dumps, and data directories from the simulation at the moment of the crash.

One benefit of this workflow is that MongoDB can find especially subtle correctness and reliability bugs in their database code within a day after their introduction. In many organizations, these bugs are only found long after deployment to a staging environment—or even in production. This cadence also helps MongoDB resolve the issues quickly; the average time from when an Antithesis bug ticket is opened until its resolution is under three days, which is no small feat given that these bugs are often complex.

While roughly 95% of MongoDB bugs found using Antithesis’s platform can be identified and fixed with the information in the Antithesis triage report, some especially difficult and rare bugs need additional details. For example: when Antithesis found a critical data corruption bug as an annual release was rapidly approaching, MongoDB engineers needed insight and answers fast. Enter the Antithesis bug report.

The bug report uses Antithesis’s deterministic platform to analyze a multiverse of program states, discovering how likely the bug is to arise from various points in the history leading up to the moment the bug is detected. The unique ability of Antithesis to perform this analysis means engineers can not only see the moment a failure was reported, but also work backwards to find when the problem truly occurred.

In the case of this critical data corruption bug, the bug was detected 370 seconds into a particular scenario. However, the figure below shows that its likelihood of occurring rose sharply nearly ten seconds before it manifested—meaning that engineers needed to carefully examine the program state around that time. Antithesis can provide core dumps, log messages, and other information about the program state at that moment, greatly simplifying the debugging effort. When working with instrumented code this analysis is extended to identify the likelihood of the bug occurring when specific functions are executed and vice versa.

figure-1

Without Antithesis’s intelligent exploration capabilities, this bug might not have been found until after the release, perhaps resulting in data corruption for a MongoDB customer. Due to Antithesis’s perfect reproducibility and interactivity, MongoDB was able to quickly qualify and triage this issue, keeping their release schedule on time.

A strategic partnership

Collaborations like this one have strengthened the partnership between MongoDB and Antithesis, resulting in many benefits to both companies.

MongoDB has been an instrumental design partner for Antithesis, and has helped to mature the Antithesis offering in countless ways over the past several years. In addition to having deep expertise in distributed computing, consensus protocols, replication, and developer productivity, MongoDB engineers are obsessive about correctness. This combination of domain knowledge on the one hand and dedication to quality on the other has enabled the dozens of MongoDB engineers that work with Antithesis to provide valuable feedback and recommendations on Antithesis features and functionality.

Antithesis, in turn, has provided MongoDB with critical assistance in keeping to their release schedules. As one Senior Vice President of Engineering at MongoDB put it: “I can say for sure that Antithesis turned around a hard bug with a very high value to us in short order. This is the kind of thing we'd often take a month+ to solve with all hands on deck, getting there in a week improves our release confidence and saves me 3+ weeks of some of my most expensive engineers.” MongoDB’s engineering organization has integrated Antithesis into their testing and qualification tooling, and continues to look for ways to test more parts of their product with Antithesis.

Over the years, the teams at Antithesis and MongoDB have developed a strong partnership. Antithesis especially appreciates MongoDB’s in-house expertise around developer experience and deep knowledge of distributed systems testing. We look forward to maintaining and improving our collaboration in the years to come.

Summary

Over the past three years, MongoDB has tested their application code for hundreds of thousands of hours with Antithesis. MongoDB has built integrations that tie Antithesis directly into its developer workflow, enabling it to find dozens of major and critical bugs that otherwise might not have been found until production. Continuous testing means that MongoDB can find and fix these issues shortly after their introduction, saving developer time on root-cause analyses. Over the course of their multi-year partnership, Antithesis has empowered MongoDB to continue pushing the limits of software quality and of developer experience.


At Antithesis, we want to bring the reliability, confidence and productivity benefits of autonomous testing to your team. Antithesis is a continuous reliability platform that autonomously searches for problems in your software within a simulated environment. Every problem we find can be perfectly reproduced, allowing for efficient debugging of even the most complex problems. Check us out to learn more.

Dive into this public issue if you’d like to understand the bug referenced in this case study in greater detail.