Will Wilson
CEO

Debugging in the Multiverse

Would figuring out your bugs and outages be easier if you had a time machine?
We are now making a time machine directly available to all of our customers.

Bring back my files

Obviously the first feature I want from my time machine is the same one I want whenever I accidentally delete data from my harddrive, install malware, or say something dumb in a sensitive conversation: sleep -5. We can do that! (See video.)

What exactly is happening here? Antithesis simulates a purely deterministic universe. The reasons we do that are to find bugs faster, and to make them perfectly reproducible once found. But if you can perfectly simulate something, then you can also perfectly simulate it up until 5 seconds from the end.1 Then we can crack open that universe and give you a bash terminal inside of it. The resulting universe is still deterministic, just dependent on what you decided to do.

  1. In practice, we never need to replay the simulation from the beginning, because our hypervisor also supports fast and efficient snapshotting of the state of the guest system. See this talk by Alex Pshenichkin.

Information from the past

Let’s get more concrete. Let’s use this to solve a real problem. My server has crashed and its process has exited! No worries, I’ll just rewind time, attach a debugger to the process, and set a breakpoint or capture a thread dump:

Packets from the past

Or you know what? I can’t count the number of times I was trying to figure out where my consensus protocol went wrong and wished I had a dump of all the network traffic. No biggie, I’ll just go back in time and decide I was capturing the traffic all along:

What was slow?

Strange and transient performance problems are a snap. Once Antithesis has found them for me, I can rewind time and enable profiling for the period of interest. I don’t need to worry about figuring out how to trigger the pathology again:

Back to the future

Like any good time machine, we can travel to the future too. The nice thing about a deterministic universe is that if the thing you’re simulating is mostly idle, you can just simulate it faster. Here we are waiting for 10 minutes to pass in just a few seconds. This kind of time compression is very useful when debugging networked systems:

Change the past

But let’s be real: if you or I had an actual time machine, we wouldn’t be able to resist the temptation to go back to some historic event, change it, and then return to the present and see what’s different. But that’s a pretty useful technique when debugging too! We call it “multiverse debugging.” Let’s rewind time, turn off fault injection on our Kafka cluster, and see if the NPE still happens:

Imagine an extreme version of this. You could rewind a second, explore a thousand tiny variations of the past, and compute the proportion of them that still see the bug. Then you could rewind two seconds and do the same thing. Do that enough times and you’ve just recreated the Antithesis bug report. But with this new tool we’re giving you, you could have invented that bug report yourself. What else could you invent?

A reactive multiverse

You may be wondering what’s up with this interface I’m showing you. It’s just a browser-based reactive notebook. But by connecting the notebook to a deterministic hypervisor, we get access to a definitionally side effect-free world. This means there’s no room to smuggle state, so we can make it truly reactive, even when it’s running commands on the Linux system running inside the hypervisor.

When you change the text in the notebook it immediately reacts, and the hypervisor reacts too. The UI is just the inevitable result of the notebook text causing a deterministic computation. UI = f(code), even when that code is injecting commands into a distributed system.

It’s available now

We haven’t even begun to scratch the surface of this capability, but this post is already too long. We are rolling this out to all of our existing customers today.

If you’re already a customer, get started with our new documentation. If you’re not already a customer, contact us. We’ll find your bugs and then give you a universe-hopping time machine to fix them.