Solving Zelda with the Antithesis SDK

June 3, 2024

Zelda world

When we began using Antithesis to explore NES games, The Legend of Zelda (1986) was at the top of our list. Developed at the same time as Super Mario Bros., Zelda is in many ways its polar opposite. Mario has highly linear gameplay and a single overwhelmingly important mechanic, which makes it easy for our platform to beat. But Zelda features a non-linear open world, a combat system with multiple types of weapons and enemies, and multiple special items that are necessary to complete the game.

Exploration-heavy games like Zelda are famously difficult for artificial agents to master. To fully explore the game world of Zelda we would have to reach 128 overworld screens and over 230 rooms across nine dungeons. Even Montezuma’s Revenge, a far smaller game (with only three truly distinct levels) for the Atari 2600, posed serious challenges to AI researchers for years after early work showing human-level or better performance on other Atari games like Breakout!

Does autonomously beating Zelda sound like a tall order? In this post I’ll give you a preview of some new features I’ve been working on in our SDK that enable us to totally crush Zelda, and that will enable you to test your software better as well.

`GO TO THE NEXT ROOM.`

The Zelda map is big and complicated and full of enemies. If our platform just blindly fed inputs into Zelda, we would overweight areas of the game close to the start screen, while underweighting areas that are far away. Instead, we want to roughly equalize exploration over all screens and caves.

Our SDK already offers a special type of assertion called a Sometimes assertion. Think of this as the opposite of a “regular” assertion – instead of asserting that something always or never happens, you’re instead asserting that it sometimes can happen. That isn’t much use in a unit test, but it’s VERY useful in a platform like ours, acting both as a sensitive measure of test coverage and as an actual assertion that you didn’t accidentally make some situation unreachable.

The cool thing is each of these assertions also does double-duty as a piece of guidance. When our platform hits a Sometimes assertion, it uses the program state that hits it as a starting point for further exploration. So all we need to do to explore Zelda is write a bunch of Sometimes assertions that Link, the player character, can reach each map location. But that would be tedious, which brings us to the first new SDK feature I want to mention.

We’re adding a function called SOMETIMES_EACH. Think of SOMETIMES_EACH(foo) as a large, potentially infinite family of Sometimes assertions, one for each value of foo that we ever see. We will pay extra attention to distinct values of foo when this assertion is hit. So now, to explore Zelda thoroughly, we just need to call SOMETIMES_EACH() and pass in the current screen identifier. In practice, since some screens have more than one connected component where Link can move, we get more detailed and track all 16 pixel-by-16 pixel regions of the screen.

Link can't reach the blue area from the red area where he's standing

While we’re exploring the game, we’d like to have a full inventory all else being equal. Many rooms in dungeons have doors that need to be unlocked with a key, or walls that need to be bombed to reveal a passage to the next room. Fortunately, SOMETIMES_EACH can take a compound parameter, like SOMETIMES_EACH({room_id, (bool)bombs>0}), which will try exploring each room with and without bombs in the inventory.¹

`IT'S DANGEROUS TO GO ALONE! TAKE THIS.`

Another problem that had to be solved early on was combat. Link comes across various kinds of enemies while exploring the overworld and dungeons. While many of the overworld enemies can be dodged, they drop useful items (like bombs or rupees) when they’re killed. And in the underworld, there are rooms where you have to kill all the enemies to advance – and every dungeon has a final boss that also must be killed.

The first challenge was figuring out how to tell when we were making progress at killing enemies. Zelda has specific variables that store the health of the enemies in the room Link is in. We tried summing them and attempting to minimize that number, but sometimes when you kill an enemy its health value goes to 128 instead of 0! So we took a more complicated approach: we noticed every time an enemy’s health decreased, and added up the values of those decreases. It might be tempting to then throw that number into SOMETIMES_EACH, but the trouble is we want to spend more time in situations where more damage has already been done, rather than equalizing over every damage state.

This brings me to the second set of new SDK features. We’re adding functions like ALWAYS_LESS_THAN(x, 5) and SOMETIMES_GREATER_THAN(y, 3). These are just like ALWAYS(x < 5) or SOMETIMES(y > 3), but the new versions tell our platform that you care about the maximum value of the numeric quantity represented by x or y, and it will explore more from states where those values are higher². In addition, you automatically get graphs in your triage report tracking the highest value from night to night. Adding a SOMETIMES_GREATER_THAN(summed_damage, 0) was enough for us to start clearing rooms.

`ONES WHO DOES NOT HAVE THE TRIFORCE CAN'T GO IN.`

Zelda is a highly nonlinear game, but there are prerequisites that keep you from exploring certain sections before you’ve finished others. Most notably, you have to collect all eight Triforce pieces from the first eight dungeons before you can enter the final dungeon and defeat final boss Ganon. You also have to obtain several items, such as the candle, the bow, and the bait. These are all sub-goals that Antithesis had to achieve as part of the ultimate goal of beating the game.

This brings me to the last of the new SDK features I want to talk about – SOMETIMES_ALL(a,b,c,...). This is an assertion that’s equivalent to SOMETIMES(a && b && c && ...), but once again you’re telling us a bit more about the property you want to check, and that lets us explore more efficiently. In this case, our platform treats each of the terms in SOMETIMES_ALL as a sub-goal, and tries to make progress along the frontier of how many of them have been simultaneously achieved.³

In the case of Zelda, our sub-goals were things like getting the whistle, the Triforce piece, and the heart container out of level 5, or getting the Magical Sword. Our platform would spend most of its time exploring from states where we’d achieved the most sub-goals so far. In all, we ended up defining 25 sub-goals that we had to accomplish before beating the game.

I’d love to be able to say that all we did was throw a SOMETIMES_ALL around our list of high-level goals and then we beat the game, but unfortunately the state space was still too big. There are many possible orders⁴ in which you can complete the eight Zelda dungeons, but we deliberately linearized them and only explored one order. By providing hints about how to finish the game we made the state space smaller and much easier to explore, but in theory we could have missed a bug that only happens when you do dungeon 3 before dungeon 2. A good reminder that you can always make your testing stronger!

Below is a representation of one of the many paths that our platform took through Zelda:

Zelda world

`IT'S A SECRET TO EVERYBODY.`

Although we didn’t set out to find bugs while playing Zelda, the high level of randomness generated by our platform meant we found them anyway! We discovered several glitches that are often exploited by speedrunners, such as block clipping⁵ and khananakey.⁶ We also saw Antithesis clip into the HUD, which is not useful for speedrunning but is pretty funny to watch:

Antithesis was able to find these bugs because the SDK allows us to specify and accomplish high-level goals without ever specifying the low-level implementation. Not to sound grandiose, but one way to think about these new SDK features is as a new programming paradigm; you start with top-level goals, and expect the computer to just figure it out, and you only recurse into sub-goals when it isn’t able to.

This paradigm would obviously save time anywhere, but when writing tests it has two additional benefits. The fact that you’re leaving the low-level details of how to do the thing you asked for unspecified simultaneously makes your tests more powerful and less fragile. They’re more powerful, because rather than just testing one way to achieve the goal you set out, you’re testing many. They’re less fragile, because a low-level detailed set of steps is likely to break whenever your software changes, but a “please figure it out” won’t, so long as we can still figure it out.

`BUY SOMETHIN' WILL YA!`

NES games from the 1980s are very different from the distributed systems that are our bread and butter at Antithesis – but the tools we built to solve one set of problems have often proved useful for the other as well. If you’ve got a bug that seems “Nintendo hard” to track down, or want to talk about how we could explore the state space of your software, contact us.

Oh, and if you were waiting for the full 3-hour video of Antithesis playing Zelda from start to finish, here it is (sped up 8x), you sickos:

`TECHNICAL ADDENDUM (OPTIONAL)`

There are a number of subtle issues that I glossed over in the body of the post, in the interest of making it readable. But you, the true elite, are reading this Technical Addendum, where we will explore some of the nitty-gritty details involved in actually beating Zelda.

First of all, let’s talk about Cartesian products and state explosion. Suppose you have two variables, x and y, which can each take on any integer value from 0 to 9. Superficially, it might seem like SOMETIMES_EACH(x); SOMETIMES_EACH(y) and SOMETIMES_EACH({x,y}) are doing the same thing, but they are actually very different. The former is saying: “please find (and then explore from) each possible value of x, and each possible value of y.” The latter is saying: “please find (and then explore from) each possible combination of values of x and y.”

Is the latter just purely better? Not quite. It’s more sensitive, in the sense that if you need a particular combination of x and y to make progress, the latter expression is guaranteed to spend some time exploring in that situation. This is why we used this approach in the discussion above about exploring rooms with particular items in our inventory. We wanted the combination of “in room 123” AND “have bombs in inventory” to sometimes be explored.

The trouble is that there are now 100 different possible distinguished states (the 100 ordered pairs {x,y}) that our platform will invest energy into searching. As the number of such distinguished states rises, the total search energy is being divided over a larger and larger number of starting states, meaning we do less searching from each one. This isn’t a real problem for our bombs example, because we’re just doubling the state space. It isn’t even a huge problem for SOMETIMES_EACH({x,y}). But throwing more and more fields into the same call to SOMETIMES_EACH will increase the space multiplicatively each time.

To avoid making this a giant footgun, our platform bounds the amount of energy it invests in any given SOMETIMES_EACH assertion. There is nothing you can do with any SOMETIMES_EACH assertion to ruin the overall effectiveness of your Antithesis testing. But it’s totally possible to ruin the assistance provided by a given assertion! In the future, we will be providing more tools for diagnosing and alerting you about this situation. We also have some exciting research in the pipeline that we think will enable us to reduce or even eliminate this tradeoff.

Another subtlety is that we wanted to beat Zelda without heavily or intrusively instrumenting it, but Zelda clears many relevant counters every time you enter and leave a room. This includes things like whether the "Puzzle solved” theme. 🔊⁷ has played, and the counters that track enemy hit points. So we built an EVER_SINCE() combinator that let us write temporal assertions about whether something happened before or after another thing. This one is also coming to the SDK (it’s a lot more useful in distributed systems than it is in Nintendo games), but it’s not coming right away, so consider this a super-extra preview.

Lastly, it’s true that just maximizing damage done in each room is enough to beat Zelda, but progress could be painfully slow. When Link is at full health, swinging his sword⁸ produces “sword beams”, which deal damage at a distance to any enemy that gets in their way. So we wanted to incentivize exploring from states where Link is at full health. But there’s a tradeoff: if we required Link to be at full health to explore from a state, then exploration would be much slower, because we’d be throwing away all the states where we made real progress but lost a bit of health. So we split the baby: we said SOMETIMES Link was at full health, and our platform immediately started doing a judicious amount of its exploration from those states. This produced a good balance, with good combat abilities and freedom to explore.

Wow, you made it to the end of this section. Do you want to work here?

You made it to the end! Grab some stickers

Place them anywhere and watch the compliments compile.

Get free stickers