An interview with Mark Logan, Tech Lead for Sui Core
Mark Logan is the tech lead for Sui Core at Mysten Labs, the group tasked with maintaining the core Sui protocol and validator software. He’s also a deterministic simulation testing (DST) enthusiast who built a homegrown DST system at Mysten – and then upgraded to Antithesis.
We sat down with Mark to chat about one of the toughest and most exciting projects he’s tackled at Mysten, the hardest bug he’s tracked down with Antithesis, and why he considers DST an absolute necessity for anyone building a blockchain.
These are excerpts from our conversation, lightly edited for clarity and concision. If you’d prefer to listen to the chat, we’ll be sharing the full audio in the coming weeks.
![Mark Logan](/images/people/partners/mysten_mark_logan.jpg)
My name is Mark Logan. I’m the tech lead on what we call Sui Core here at Mysten Labs, the group tasked with maintaining the core Sui protocol and validator software, excluding consensus. We consider the consensus system to be the backbone of our network, and then we build on top of that all the things related to transaction execution and object storage, authentication, things like that.
![TW Lim](/images/people/069_tw.jpg)
We’ve spoken a little bit about some of the big projects that you were willing to take on because you knew that you were testing with Antithesis. Can you tell us about a couple?
![Mark Logan](/images/people/partners/mysten_mark_logan.jpg)
One of the things about Sui that sets it apart from other blockchains is that it’s built from the ground up for as much parallelism as possible. Sui’s object system is basically a way to provide static parallelism where you can look at two transactions, check if they have inputs in common, and if not, let them run in parallel.
We recently rolled out a project called the writeback cache, a highly specific cache for Sui transaction processing.
The writeback cache system allows transactions executing in parallel to write their inputs into memory safely in parallel. These outputs eventually get flushed to disk once we’ve proved that the validator hasn’t forked from the committee, so we never write potentially forked state to disk. This was the first kind of really tricky piece of multi-threading that we did in the Sui code base.
Most of Sui’s multi-threading is done just by the fact that it’s running on a multi-thread async runtime, so you have a Rust future that can be scheduled on any thread. But there’s not a lot of fine-grained locking and that other kind of tricky stuff.
Obviously nobody can write bug-free multi-threaded code, so we wound up with a number of bugs in the codebase, even though we were testing with our homegrown DST system as well. Antithesis was able to find most of them.
I think if we didn’t have Antithesis I probably at this point would have had to go back and start thinking about, “okay, we need to build our own multi-threaded simulation tester.” So Antithesis was a huge boost there.
And there’s a part 2 of our writeback cache, which is basically a similar system for processing consensus output as opposed to transaction execution, where we keep state in memory until we prove that we’re not forked from the rest of the committee. It makes things faster because you’re going to the database less often.
A lot of times when we release these kinds of complicated features, we enable them with a flag. But this particular feature was very, very hard to put behind a flag without essentially duplicating a large section of the code base.
One of the shortcomings of our DST system is that it’s basically a single process. You build a test target, and it builds inside the simulator framework, and it runs with multiple simulated servers within the process. But they’re all running the same code because it’s just a single binary. So we don’t have a good way in the system to test the interactions between different versions of our software, or for us to test what happens when we upgrade the software from version A to version B.
Since Antithesis is working at the container level, it can do that. We can run two different versions of our code inside Antithesis and make sure that they’re happy to interact with each other.
There’s all kinds of forks that can happen if you accidentally change your execution code, obviously, and if you only test one version at a time, you might not notice that you changed something. With Antithesis we can test multiple versions at a time and make sure that they both agree on the execution results.
![TW Lim](/images/people/069_tw.jpg)
So these are things that you just wouldn’t have felt able to approach without knowing that you could test really complex, multi-threaded processes in a really robust way. You also reached out to us to let us know that you’d recently found a bug in Sui and had been trying to recreate it in Antithesis. Can you talk a little bit about both what that process was like and why you were trying to do it?
![Mark Logan](/images/people/partners/mysten_mark_logan.jpg)
So Antithesis found most of these multi-threading bugs in the writeback cache for us, but there was one that was found in production, but only rarely and under odd circumstances. We had one validator that was just crashing every 20 minutes, and nobody else was hitting the bug. This is a perfect illustration of how hard it can be to test these systems.
If you were testing the software on 105 out of 106 validators, you’d conclude that the software is reliable, and there’s this one that crashes all the time. Same code, right? Same configuration.
We eventually tracked it down using your time travel debugging. There’s two really neat features in Antithesis, beyond some of the stuff I’ve already talked about. One of them is this probability analysis, which is a really cool idea.
You take a failing run of the test, and then you replay the simulation. Say 10 seconds into the run, the crash happens. Antithesis can rerun the simulation up to nine seconds, then change the random seed that’s being used internally to drive all the randomness inside the simulation, and see if the bug still happens. You do that many times to basically see at t=9
, how likely is the bug, then you do the same thing at t=8
, and if we vary the random state at t=8
, how likely is the bug. And then you plot these over time.
![Bug probability from Antithesis report](/img_opt/2EKWrBwp0r-2000.png)
For a lot of bugs, this pinpoints the point in time at which the bug happened, which might not have anything to do with the point at which the system crashed, because your system crashes when it detects that an invariant has been violated, but the invariant could have been violated much earlier.
What I’ve seen is basically your bug probability goes from 0.001% up to 99% in a very short span of time, because there usually is a specific event that happens that ensures the bug is baked in. So once we had a crashing run that reproduced the bug, we used that analysis to figure out where the bug occurred.
Then we used the time travel debugging system to run up to the point in time just before where we think the bug was getting baked in, even though the crash was much later.
Trace logging produces gigabytes and gigabytes of logs, so we can’t leave it enabled for an entire test run, but now we were able to turn it on for the very short period of time where we really cared about it and get all the information we needed from the trace logs.
And this showed us that the thing causing the bug was these two different threads that were both caching the same object ID at the same time. One of them was a writer thread and one of them was a reader that was just warming the cache to read it from disk. That didn’t directly show the results of the bug, but it proved to me that’s where the bug was.
And this was actually a piece of code that I’d read and re-read like 15 times, and every time I convinced myself that it was correct. But now the proof was staring me in the face. It’s like, “the bug is there, so go back and find it. Don’t convince yourself that it’s not there this time.” And then I was able to go back and actually understand how the bug happened.
This was a really interesting debugging experience that I think would have been quite difficult otherwise. I mean, it was already quite difficult, but we’d have had to add another order of magnitude in there.
![TW Lim](/images/people/069_tw.jpg)
You mentioned earlier that Mysten uses a homegrown DST system as well, because you have this really ambitious standard of shipping zero bugs. Can you talk a little about how that came about, and why, given that you already had a testing setup way more advanced than most other companies out there, you started working with Antithesis as well?
![Mark Logan](/images/people/partners/mysten_mark_logan.jpg)
I’ve been very interested in DST for a very long time, and one of the first things that I did at Mysten was I looked at the testing setup we had, and said, “I don’t think this is going to work. We really need something better.”
So I started trying to retrofit the system into a deterministic simulation tester. It wasn’t actually clear that it was possible, but I kept going and I got lucky, and it was possible, and eventually it started finding bugs. And then it was finding a bug a day. So I want to talk about why Antithesis is different, and why we’re using it in addition to our homegrown system.
You may have a lot of people reading who’ve dabbled or worked extensively in Rust. One of the things you find with Rust is you can just do things that’re too scary to do otherwise. Like if you have a really big C++ project and you want to make a big change, it’s scary. You have to go very slowly, very carefully. Really think about every single thing that you’re changing so that you’re not introducing all kinds of undefined behavior or use-after-free, blah, blah, blah, right?
With Rust, you just do whatever you want and focus on your actual application logic, like the semantics of your program. And you trust that Rust is going to catch the data races and the use-after-free and anything like that. You just don’t have to think about it. And so you get a lot of velocity in Rust.
Going to a DST approach has a very similar effect, which is there’s all the things that Rust, or if you’re still in C++, sanitizers and things like that, there’s still all the things that those things can’t catch for you. And these are things where you have incorrect behavior. You don’t have any memory safety issues or undefined behavior or anything like that, but your program produces incorrect results. Obviously, a programming language can’t save you from that because it doesn’t know what “correct” is.
So you have a similar kind of effect with Antithesis where you can make these pretty bold changes that ordinarily you would have to approach so carefully in order to do safely, and you can just kind of write the thing.
I mean, you can’t be stupid. You do still have to think. You have to have an approach that makes sense and that will work in principle. But you don’t have to spend as much time thinking about, “OK, is this specific line safe? Do I have a race condition here?” It just allows you to be a lot more fearless with the kinds of changes that you make or even the kinds of changes that you contemplate.
If you have an old system, like a four or five-year-old system that’s accreted a lot of changes and bug fixes and things, it gets really scary to change that kind of code, unless you really believe in your tests. And for some systems, there are approaches that everybody knows.
If you have a 10-year-old compiler, it’s not going to be easy to change it, but you can have enough of a regression testing suite that you have some confidence in your ability to make larger changes. But for most people, when it comes to distributed systems, or even just transaction processing systems, they don’t have that kind of confidence. And that’s what DST can give you.
To read more about the technology behind DST, check out the talk we gave at FreeBSD.