The worst bug we faced at Antithesis
The Debate
It was early days at Antithesis, and we had a problem: machines kept crashing. The bug was rare. Frustratingly rare. Like, one crash every couple thousand machine-hours rare. Rare enough to trigger a giant internal debate about whether this bug was even worth fixing.
Some of us argued that we should just live with the bug. After all, one of the ironies of being a software testing company is that you usually aren’t in anybody’s critical path, so your own software doesn’t need to be perfectly reliable.1 We could add some retries, make our scheduling and orchestration system resilient to machine failures, and customers would occasionally have slightly delayed test results that they probably wouldn’t even notice.
The other side argued that even if this specific bug didn’t matter, having it out there would impose an invisible tax on everything we ever did in the future.
Remember, this was early in the business, and the lowest levels of Antithesis were under active development. We weren’t able to test Antithesis with Antithesis yet, and we were occasionally introducing real bugs that could manifest as a machine crash.
One of the highest-productivity things your team can do is not have any “mysterious” bugs, so any new symptom that appears will instantly stand out. That way, it can be investigated while the code changes that produced it are still fresh in your mind.
A rare, stochastic, poorly understood machine crash would completely poison our efforts to eradicate mysterious bugs. Any time a machine crashed, we would be tempted to dismiss it with, “Oh, it’s probably that weird rare issue again.” We decided that with this bug in the background, it would be impossible to maintain the discipline of digging into every machine failure and thoroughly characterizing it. Then more and more such issues would creep into our work, slowly degrading the quality of our systems.2
The Investigation
So it was time to investigate the bug. There was just one problem: it was incredibly rare, completely nondeterministic, and we couldn’t feed Antithesis into Antithesis yet. So we did what everybody does in this situation – we added all the logging we could think of around the problem, and then we waited for it to happen again…
The Antithesis hypervisor is a type 2 hypervisor that runs on FreeBSD and communicates with the rest of our platform over a homegrown TCP-based protocol. It’s important to the economics of our business that the hypervisor instances (which are big machines with a lot of CPUs) are kept fully saturated with work, so the protocol needs to be able to handle deep pipelining and multiple concurrent work queues. We’d also just completed a rewrite of the component that speaks the other side of this protocol from our hypervisor, so our first hunch was a logic bug or desynchronization that resulted in giving the hypervisor nonsensical instructions.3 That in turn could result in a kernel panic that would kill the machine.
Consequently, we focused our intrusive logging and instrumentation efforts around the networking component of the hypervisor, and waited a few thousand machine-hours for the bug to strike again. When it finally did, we cracked open our logs and saw… nothing. None of the messages that the hypervisor received appeared to be corrupt, or desynchronized, or anything. One moment it was happily processing work, and the next it was dead. What on Earth?
Our hypervisor is a big C program running partially inside the kernel, and it takes the strategy of shutting down promptly when it encounters any illegal or unexpected behavior. We don’t go quite to the extreme of crash-only software, because corruption in the hypervisor isn’t catastrophic, as it doesn’t sit on any security boundaries.4 Nevertheless, debugging arbitrary hypervisor corruption is a lot worse than debugging a crash, so we deliberately escalate a lot of things into crashes. Could some of this code have gone haywire?
We thought there was a chance, so we made a special build with all this error-checking and crashing behavior disabled and… waited another few thousand machine-hours. When the bug finally happened again, we saw an error staring us in the face from the one place in the network code that we had not instrumented.
You see, we had previously assumed a protocol bug, so we added logging all over the (quite complex) code that reads messages from the network and decodes them. We hadn’t bothered with the one line that handles sending messages, but there it was:
send() at 160 returned 19
This error is weird for a few reasons. First of all, error code 19 was a new one to me. A little googling turned this up:
19 ENODEV Operation not supported by device.
An attempt was made to apply an inappropriate function to a device,
for example, trying to read a write-only device such as a printer.
ENODEV
? Fun. But man send
on FreeBSD does not list that as an errno it can return! More weirdness: experimentation eventually showed that the error was transient. Later sends would just start working again. How could you transiently lack a device? Even weirder: this was a stream socket! Do you ever expect transient send() failures on a stream socket? I don’t. Lastly, how on earth could we be doing something that was affecting the machine in such a way as to generate this ENODEV
? Also nobody else on the internet had ever complained about this.
So, we did the sensible thing and grepped the entire FreeBSD kernel source for ENODEV
. That turned up this tantalizing code fragment in ena_datapath.c
:
if(unlikely((if_getdrvflags(adapter->ifp) & IFF_DRV_RUNNING) == 0))
return (ENODEV);
So, the documentation was wrong or obsolete: there was some way to get send()
to return ENODEV
. Even better: “ENA” turns out to stand for Amazon’s Elastic Network Adapter, and the crashing hypervisor instances were running on Amazon bare metal machines. But many questions remained. What did IFF_DRV_RUNNING
being false mean? How did the driver get into this error condition, and how did it bubble up to us?
The Lucky Break
Debugging is like life and love – sometimes you get something that you don’t deserve. You’re touched by a muse, inspiration strikes, you win the lottery, you catch something out of the corner of your eye. These events cannot be planned for or predicted; the best you can do is make space for them to happen in your life, and give thanks for them afterwards.
In our case, it happened when we were looking at a comparison of all the occasions that this bug had occurred. For some reason (a miracle, a happy accident, a friendly ghost), somebody had included an irrelevant column in the spreadsheet: machine uptime when the bug hit. Even more miraculously, that uptime had been converted to minutes, and the entries in that column all looked like: 68, 158, 98, 278, 98, 758… That’s right, the bug could only hit on an exact multiple of 30 minutes, following an 8 minute offset. This was obviously bizarre – nothing in our system cared about time in the real world, and 30 minutes was not a round number for a computer either.
Between this surprise discovery, and the tantalizing line of code in the FreeBSD kernel source, we decided it was time to call in the big guns. That’s right, it was time to contact Colin Percival.
Calling for Help
Colin is a legendary FreeBSD hacker, and the maintainer of the officially-supported FreeBSD AMIs, so we wrote him an email, and hoped for the best. Colin had been working on getting FreeBSD working in EC2 and fixing bugs ever since 2010, and odds are that if you run into a problem using FreeBSD in EC2 then Colin will have a solution – and if he doesn’t, he’ll definitely want to know about the problem.5
When we mentioned “occasional errors returned by send() system calls” to Colin, it rang a bell and he immediately wrote back: “Can you check /var/log/messages and see if there’s messages every 30 minutes about ENA going down and then back up?”. We hadn’t even mentioned the timing of when the errors occurred to Colin, so the fact that he mentioned “30 minutes” was a very good sign that he was onto something. And indeed, the messages Colin was expecting were right there.
It turned out this was a bug Colin had first fixed in the Xen netfront driver in 2017: When the SIOCSIFMTU ioctl was used to set the interface MTU, the driver would reset itself – even if the new MTU value was the same as the old one! Resetting the driver when the MTU changes isn’t entirely unreasonable (for example, internal buffers may be sized based on the largest packet size which needs to be handled) but clearly this doesn’t need to happen if there is in fact no change being made.6
Why was the MTU being set every 30 minutes? EC2 sets a DHCP lease time of one hour, and halfway through the lease the DHCP client wakes up and asks for a new lease. Along with being told that its IP address hasn’t changed, the response tells the DHCP client that the EC2 network supports jumbograms – at which point the DHCP client sets the interface MTU… every 30 minutes.
Now, you might reasonably wonder why a network interface resetting itself would cause this sort of error; but here we fall back on an observation of Robert Watson: Concurrent systems are actively malicious. The process of a network interface being reset involves two steps: First the interface goes down and the driver cleans up all of its resources; and second the interface comes back up and the driver re-initializes itself. While this all occurs before the SIOCSIFMTU ioctl returns, there is a window of tens of microseconds in between when the interface is down – or in network driver terms, the IFF_DRV_RUNNING
flag isn’t set.
Now, we don’t normally try to send traffic over an interface which is down, but here we have a second case of active maliciousness: Because the network stack is asynchronous for performance reasons, looking up where to send an IP packet is decoupled from actually sending the packet. So we called send() when the network interface was up, the kernel looked up the destination in its routing tables and picked the ENA network interface while it was still up… and then the interface went down before the kernel tried to actually send the packet. Presto, the kernel realizes that it’s trying to send the packet over an interface which is down, and ENODEV
works its way all the way back up to the application where it should never be showing up. It’s all over in less than a millisecond – but without good tools it can take months to figure out what went wrong.
Lessons
What did we learn from this debugging odyssey? The first thing is obvious: debugging slippery and nondeterministic bugs is abject misery. You have to make a guess, wait an incredibly long time for the bug to pop up again, and pray that whatever instrumentation or logging you added doesn’t somehow make the bug vanish. If we had been able to use Antithesis itself to attack this issue, we would have been able to track it down and fix it in a fraction of the time. This sort of pain was why we built Antithesis in the first place!
But there are other, more subtle, lessons. The argument that keeping this bug in our codebase would slowly degrade everything turned out to be totally correct! How do we know? Because when we finally found and fixed this bug, we immediately discovered a host of other rare issues that produced very similar symptoms and which had snuck in while this bug was live.
If not for this bug, we would have noticed those issues as soon as they were introduced and reverted or fixed the code that introduced them. But with this bug lingering in the codebase and in the backs of everybody’s minds as an excuse, it was too easy to sweep new bugs under the rug and assume that they were all the same issue.
Our industry often assumes a false tradeoff between quality and velocity. Fixing bugs quickly, as soon as they’re introduced, prevents them from piling up. That makes it cheaper and faster to diagnose new bugs, creating a virtuous cycle that supercharges productivity.
The other long-term effect of this bug was that we redoubled our determination to make Antithesis “self-hosting” and able to test itself. This sort of weird, timing-dependent heisenbug involving interactions with the OS is exactly the sort of thing we excel at finding and pinning down. So despite the fact that we’re an extremely challenging customer for ourselves,7 we launched a major effort to replace all of our internal testing with Antithesis itself. You can learn more about that effort at a talk that my colleagues are giving at Systems Distributed '24!
The End
Debugging is hard, but it’s easier when you have a little help from your friends. Check out what Antithesis can do for you.