How Antithesis could have prevented the CrowdStrike incident

July 22, 2024

Last week, CrowdStrike put out a release–on a Friday. Within hours, an eerie silence descended across the globe. Millions of screens flickered, then faded to an ominous blue background. In the top left corner of the screen, a simple yet haunting symbol appeared: a sad face emoji, signaling an unexpected, catastrophic failure. The dreaded Blue Screen of Death (BSOD) had returned with a vengeance, bringing the modern world to its knees. Confusion turned to panic as airlines, hospitals, banks, and other services around the world ground to a halt.

Blue Screen of Death

The questions on everyone’s lips: How? What caused it? Could it have been prevented? Is my software safe? This post attempts to answer all of this and more.

The ripple effect of software bugs

Software bugs can have far-reaching consequences that vastly outweigh the effort required to prevent them. The recent CrowdStrike incident vividly demonstrated this, forcing countless operations teams to spend their Friday evening manually taking nodes offline and intervening to recover from the buggy update. The impact of bugs in critical infrastructure products like CrowdStrike extends far beyond major inconveniences and financial costs in the billions—it puts lives at risk.

Before diving any further into the CrowdStrike incident, it’s important to first acknowledge the reality of human errors. Mistakes happen. Being a developer—even one of the renowned 10x developers—doesn’t make you immune to this reality. As developers, we rely on our tools and processes to catch and mitigate errors before they reach production. However, when these fall short, the consequences can be severe.

Classifying the CrowdStrike bug

When news broke that CrowdStrike’s software update was causing widespread system crashes, the developer community sprang into action to understand the root cause. One analysis from Zach Vorhies’ suggested a critical issue: CrowdStrike’s new update included C++ code that attempted to access an invalid memory address. This revelation led to a collective burst of excitement from the Rust community, who exclaimed, “it was a null pointer dereference!” all over ~~Twitter~~X. However, it’s important to note that others like Tavis Ormandy, think that Vorhies is wrong, and that while it was a memory safety error, it was unrelated to dereferencing a null pointer. Either explanation raises more questions than it answers. After all, how could a single file have caused such widespread havoc?

At the heart of the issue lies Falcon, CrowdStrike’s flagship product. Falcon’s design relies on installing a software component directly on users’ computers, granting it privileged access to monitor threats. This level of access introduces significant development risks. At this level, even a small coding error can have catastrophic consequences. In the case of the CrowdStrike incident, the operating system detected the failure and initiated protective measures, resulting in the dreaded BSOD.

The billion trillion dollar mistake

Is it really that simple? Could all of this have been stopped with basic memory safety checks? The answer lies in understanding the broader context of safety in software, and in particular of memory safety.

Memory safety refers to how well a program prevents or mitigates errors related to memory access and usage. These errors include null pointer dereferences, buffer overflows, and double-free errors, among others. The impact of these issues is staggering. Research has consistently shown that a significant portion of vulnerabilities in software stem from memory safety issues, particularly in languages like C and C++. A study by Microsoft in 2019 found that approximately 70% of the vulnerabilities they addressed each year were memory safety issues. This means we all could be unknowingly harboring our own CrowdStrike.

Moreover, this isn’t the first time memory safety issues have caused such pain. Tony Hoare, who introduced null references in the ALGOL W programming language in 1965, infamously described them as his “billion dollar mistake”. But if you add up all the impact of all the outages and breaches caused by memory safety errors, plus all the wasted hours by developers debugging and fixing these issues, it comes out to a lot more than a billion dollars. This seemingly simple programming construct has had far-reaching consequences.

Old code, new risks

Recognizing the risks associated with memory-unsafe languages, there’s been a growing push towards adopting safer alternatives. The U.S. government has taken steps to promote the use of memory-safe languages, with the National Security Agency (NSA) releasing guidance in 2022 encouraging the transition from C and C++ to memory-safe languages like Python, Go, and Rust.

However, the transition is not as simple as it might seem. Languages such as C, C++, and even Fortran continue to power critical systems in finance, healthcare, transportation, and more. These codebases often represent decades of accumulated knowledge and optimizations. Rewriting them in newer, memory-safe languages like Rust is not always feasible or desirable. The reality is that these memory-unsafe languages will remain prevalent in many industries for decades to come.

This leaves us with a crucial question: How can we improve the safety and reliability of existing systems written in memory-unsafe languages while the industry gradually transitions to safer alternatives?

Can we find memory safety bugs with Antithesis?

Yes! While it is easy to find a bug if you know what you’re looking for, most incidents happen from unknown unknowns. At Antithesis, we’ve built a platform that autonomously tests your software for these exact types of bugs by simulating numerous usage scenarios. Once a bug is discovered, it’s consistently reproducible, allowing you to focus on debugging the issue rather than wasting time trying to recreate the right conditions. The result is a new tool that helps developers avoid a situation like CrowdStrike and significantly harden infrastructure products with every release.

Our platform has already helped developers identify numerous potentially catastrophic issues related to memory safety—including null pointer dereferences, concurrency bugs, and much more. Importantly, these issues are caught before users encounter them, and the process doesn’t require a complete codebase rewrite. If you’re interested in learning how Antithesis can improve your software quality contact us!

You made it to the end! Grab some stickers

Place them anywhere and watch the compliments compile.

Get free stickers