Depth is all you need: how Antithesis crushes Gradius
In the early days of this company, the intro project we gave to new engineers was to use the Antithesis platform to beat a game for the NES. This might seem like an odd choice, but finding bugs in production code is a lot like beating NES games: victory screens and bugs are both hard-to-find events.
An important difference which makes exploring NES games more approachable than finding bugs is that NES games come with advanced, usability-optimized data visualization software to display the current state (and available state transitions) of the program. It’s much easier to understand progress toward the victory screen of Super Mario Bros. than to reason spatially about a rare bug in a distributed system – but in both cases, to reach the desired state, the player has to traverse the world of the program, making a number of unlikely events happen along the way.
But what about other kinds of interesting states? Consider a UTF-16 to UTF-8 converter with an off-by-one.1 What does it take to discover this bug? How hard is it to find?
In some sense, it’s worse than one in a million: there are 1,114,112 (216 × 17) possible code points in Unicode, but only one can trigger the bug. If you give the converter the right code point, it finds the bug; if you give it any of the 1,114,111 wrong ones, it doesn’t. This is especially challenging for an exploration-based system like ours, since there’s no incremental progress toward the bug: either you find it or you don’t.
In another sense, it’s trivial: for computers, a million is a small number. The converter offers no way to build up history over time – all it does is take in a code point encoded in UTF-16 and output its encoding in UTF-8 – so a computer could enumerate and test every possible input in a second.
Now consider the opposite extreme: the infamous 49-day crash in Windows 95. Windows 95 counted its uptime in milliseconds and stored it in a uint32
; if it ran continuously for 232 milliseconds (49.7 days), this counter would overflow and the whole system would crash. To find this bug, only two choices are needed: the choice not to shut down and, much less trivially, the choice not to crash it some other way. If Windows 95 runs for 49.7 days, it crashes no matter what; if it doesn’t, it doesn’t, and you don’t find the bug.
What these examples show is that software state spaces come in different shapes. The Unicode converter and the 49-day crash are at opposite extremes: one shows breadth without depth, a million options at startup but no way to build history, and the other is a bug where depth – the progression of time – is (almost) all you need. Most programs are somewhere in between. In Super Mario Bros., for example, the victory screen – the rare event that we’re looking for – is a small plot of land at the end of a mostly-linear world. To get to the victory screen, you first have to get to point P, and to get to point P, you first have to get to point P-1, and so on until the beginning of the game. There are different routes between these points, and there are other dimensions in state space (powerup state, time left in the level, enemy positions, etc.), but the most important dimension is Mario’s X coordinate in the game world, which we tell the Antithesis platform to maximize.
But what about Gradius?
Gradius
Gradius (1985) is a side-scrolling shooter for the NES. The player controls a spaceship behind enemy lines, dodging bullets, shooting nasties, and blasting their way through hostile terrain.
data:image/s3,"s3://crabby-images/174ef/174eff1306c36159c6be7499736aafa2c7951d1d" alt="Gradius gameplay"
Some enemies drop power orbs when they’re defeated. The player can collect these power orbs and cash them in for ship upgrades – stronger weapons, more bullets, missiles, speedups, even a shield – which make it easier to thin the waves of enemies and survive, but if the player’s ship dies, the upgrades die with it. The more upgrades you collect, the easier it is to clear screens of enemies, and the fewer bullets you have to dodge – but if you die, your powerups die with you, and you respawn with nothing.
For human players, then, Gradius is a game of reflex, memorization, and upgrade strategy. Which enemies drop power orbs? How many upgrades do I have? How do I manage my orb stockpile to ensure that I can buy a new shield if I’m hit? While it’s possible to beat the game with an unimproved ship, it’s very difficult:2 the ship is glacially slow without speed upgrades, and its only weapon is a peashooter that fires straight ahead, making it hard (in some cases impossible) to hit the many ground turrets.
Beating Gradius the first time
Antithesis’ approach to exploring a state space has two components. Tactics are how we generate new inputs to feed into the system under test, producing states; strategies are how we judge those states, because only some of the states we generate are interesting enough to warrant further exploration.3
So when the time came to beat Gradius, we told our system about powerups, about upgrades, about the different weapons it could have, about the ship’s position on screen and the screen’s position in the level, and – most importantly – how to tell when the death animation was playing. (More on this later.) In other words, we set up a complex strategy to try to make the Antithesis platform play Gradius like an expert human, and we paired this with a simple tactic which generated random bytes.
This quickly ran into a problem: the ground. While a human with a physical NES controller can’t press two opposite direction buttons at the same time, a computer sending random bytes can. Game developers typically didn’t anticipate that their games would be played by a computer, so these combinations often do strange things. In Gradius, pressing both Up and Down makes your ship move down, so random inputs make moving down more likely than moving up. The results are predictable.
An advantage of our two-component approach is that strengths in one component can compensate for deficiencies in the other. What could a strategic approach to the sinking ship problem look like? A reward for staying close to the top of the screen? Then we’d struggle with any point in the game that requires flying under something. An attempt to equally distribute exploration energy across Y positions at each point in the level? Then we’d have to define ‘each point’ and teach the strategy how to determine the ship’s position in relation to the terrain.
A tactical approach seemed easier: we supplemented the random bytes tactic with a random “deltas” tactic, which chose on each frame whether to change the previous state of a button. This didn’t alter the relative probabilities of moving down and moving up, but it made it more likely for the ship to lurch upward, which helped it to avoid the ground.
Solving the sinking ship problem was enough to get us to the first boss – which was the next problem, because our system hadn’t been taught how to aim. This turned out not to matter, though: if the player stays alive for long enough, the boss will get bored and disappear. And it wasn’t just the first boss that did this – each of Gradius’s seven bosses eventually gets bored. It wasn’t until after we’d already beaten Gradius with this complex strategy that we realized the importance of this.
Seeing like a fuzzer
Somewhat unusually, Gradius is an autoscroller: the player has no choice but to progress or die. It’s less like Super Mario Bros. than like the 49-day bug in Windows 95: to find the rare state, all you have to do is decline to stop progressing. If you don’t shut down Windows 95, it’ll eventually reach the uptime overflow; if you don’t die in Gradius, your ship will eventually reach the end of the game.
When we realized this, we decided to see how simple a strategy we could get away with. This wasn’t just curiosity. It’s actually very important that Antithesis can succeed with a minimal strategy. A system that requires a complex strategy is hard to scale, because complex strategies are hard to write – we’d forever be limited by the need to teach our platform about each individual system under test.
So we deleted everything about powerups, everything about the ship’s coordinates, everything about score – and ran again with the much simpler strategy of maximizing depth without dying.4
data:image/s3,"s3://crabby-images/4e30f/4e30fd5450766de3049979196c093a4574d3787f" alt="Somehow not dying."
What does “maximizing depth without dying” mean? Depth just means time since console power-on, so of the words in that sentence, the harder one to define is “dying”. Gradius has a lives counter, but it only ticks down after the death animation has finished, so defining “without dying” as “without lives ever decrementing” would trick our agent (which doesn’t know what a death animation is, or that Gradius is about a spaceship, or that the NES has a display) into thinking some of the most promising states are ones where the death animation has started – the same ‘doomed states’ problem that we would later run into in Castlevania. When we discovered this, we found the memory location that tracks whether or not the death animation is currently playing, and we used that to define “without dying.”
There are two other memory locations we need to track: one to ensure that we’ve started a single-player game, the other to see whether the game is currently paused. From our agent’s perspective, Gradius is a game about maximizing time since console power-on while ensuring that memory byte 0x03 is equal to 64 (we’re in a single-player game), memory byte 0x15 is equal to 0 (we aren’t paused), and memory byte 0x100 has never been equal to 2 (the death animation has never played) – and this is all it needs to know.
Because the agent knows so little about the game, the gameplay is very different from what a human would produce. It burns most of its powerups on making the ship faster, then uses that speed to clip through walls without noticing. It deals with the stage 6 boss by hiding in a corner that bullets can’t reach and waiting for the timeout. And, because we only told it not to roll out from states that are currently paused, it hammers the pause button (more on this in the technical addendum). But it beats the game.
It might seem surprising that such a simple strategy can beat Gradius, but the simplicity of the strategy is a feature and a strength.5 A linear agent (such as a human) has no choice but to use a complex strategy, because it has one shot at the problem. It has to stay alive, so it has to track the positions of enemies, bullets, and damaging terrain, and think about ways to make the job of staying alive easier, like collecting and using powerups. The complexity of the strategy required grows with the complexity of the game you’re trying to play.
But our system doesn’t have to act like a linear agent. The rest of the platform around it is a mechanism for giving an arbitrary program save slots that can be written to and reloaded at arbitrary points. This translates a one-shot problem into a problem on which incremental progress is possible. We’re not sure if there’s a term for such an agent, but internally, we refer to this as “tree fuzzing,” and it enables our system to effectively test complex programs with a comparatively minimal understanding of their internal workings.
Buy somethin’ will ya!
Because it’s our anniversary week, I’ll just add – we were fuzzing Gradius while we were still in stealth, in 2020, when we had a single customer in beta. Today, we have far more customers than we could possibly support if we had to write a custom strategy for each piece of software that we’re testing. Most of our customers onboard (and start finding dangerous bugs) within 2-3 weeks, because Antithesis is effective even with a simple strategy. When we taught our system to play Gradius, we had no idea if this was really feasible. But it is, and you can try it for yourself today.
Technical addendum
The pauses were a little surprising to us, but what we actually told it was not to continue from a paused state. Our tactics generate sequences of inputs, which we call ‘rollouts,’ and our system sends each input in sequence to the system under test. This is partly pipelining for performance, but there are also actions that can only be expressed as sequences of inputs, like holding down the A button to jump over a pipe in Super Mario Bros. (we explain this in the video here). If we generated a random byte, materialized it, evaluated it, and generated a totally new random byte from scratch with no carried-over state, we’d have a 1 in 2N chance of holding down the A button for N frames, and our system would need a term in its strategy metrics for Y position when next to a pipe to have any hope of progressing. Because we generate whole rollouts at once, though, we can produce sequences of correlated inputs, or sequences of inputs that exhibit some higher-level structure. This in turn allows us to overcome local maxima and dips in the strategy/reward landscape.