Michael Victor Zink pic
Software Engineer, Readyset

Catching a caching bug at Readyset

Customer stories

Readyset is a database scaling platform that sits in front of your MySQL or Postgres database, speeding up queries with zero code changes by caching only the results you need. Their new product, QueryPilot, automatically detects high-impact queries to cache.

This blog post is based on a recorded conversation with Michael Victor Zink, Software Engineer at Readyset.

As the classic line goes, “There are two hard problems in computer science: cache invalidation and naming things.”1 At Readyset, caching is at the core of what we do, so we think about cache invalidation a lot.

So there’s some irony in the fact that Antithesis found a cache invalidation bug in our system. Not in the core product, but in our new product, QueryPilot, which we wanted to get under test in Antithesis right from the beginning. We wanted to ship something we knew would work, and we also knew that catching issues early in development would actually help us ship faster.

QueryPilot is simpler than our core product, but even so, it was easy to fall into a cache invalidation trap. Our existing test suite and internal manual testing didn’t find the bug. We could have picked it up if we’d written exactly the right tests to force that behavior — but of course, we didn’t know to write those tests until after we found the bug in Antithesis.

The problem was that we had an implicit mental model of how our software would be used. Previously, engineers using Readyset would search for high impact queries when they were doing performance optimization. Our tests mostly reflected this pattern of use. With QueryPilot we were moving from manual queries to automated use, which broke our assumptions. But Antithesis still found the bug naturally in the course of a test run.

What QueryPilot does

To explain the bug, I should first say a bit about QueryPilot. It’s a proxy that sits between your application and database:

Diagram showing the QueryPilot architecture.
Architecture diagram from the Readyset blog.

QueryPilot automates the process of query caching. It looks at all the queries in your upstream database, and runs our EXPLAIN CREATE CACHE command as a “dry run” to see if each query is supported for caching by Readyset. QueryPilot then picks high load, high impact queries to send to the caching sidecar, and forwards other queries directly to the database. The caching sidecar receives a replication stream from the database so that it stays up to date.2

QueryPilot also has its own cache!3 A metadata cache that records the results of the EXPLAIN commands. This is where the bug comes in…

The bug

As a starting state, let’s say that we have a table called example in the primary database. QueryPilot runs EXPLAIN CREATE CACHE on a query like SELECT * FROM example to check if it’s cacheable. It sends the query to the Readyset caching sidecar, which replies that yes, this query is supported for caching. So far, so good.

However, there’s a short period of time before the “query supported” result gets stored in QueryPilot’s metadata cache. Now imagine that the application deletes the example table during this time with a DROP TABLE example command.

This deletion will get replicated to the caching sidecar, which will remove the example table. However, QueryPilot doesn’t know about this, and happily stores “query supported” in its metadata cache.

If you now try and cache the SELECT * FROM example query for real with CREATE CACHE, you’ll get an error saying that the example table is not found. But the QueryPilot metadata cache is never invalidated, and stays permanently stale.

If QueryPilot thinks this is a very impactful query to cache, it will try and cache it over and over, and fail, so you’ll constantly see this error.

If you’d rather have your bug stories in diagram form, here’s a sequence diagram showing the order of events:

Sequence diagram showing the order of events that caused the bug.

How Antithesis caught it

To explain how Antithesis helped us find this bug, it helps to understand why it slipped past us in the first place. The EXPLAIN CREATE CACHE command was originally created for humans to run, to see if a specific query they wanted to cache was supported. So it would be happening relatively rarely.

Deleting tables, or running other DDL4 commands that could trigger the bug, is normally also pretty rare – normally the database schema is in a steady state. The chances of both running in the same tiny window are low.

The point of QueryPilot, on the other hand, is to automatically find queries to cache. So it was running these EXPLAIN CREATE CACHE statements over and over again. This meant we were much more likely to be running one while some DDL was running, like DROP TABLE example, triggering the race condition.

We didn’t think of this specific scenario when we tested QueryPilot in Antithesis. But we were looking to test what happens in general if we throw an arbitrary workload at QueryPilot that isn’t tailored for Readyset at all, with a mix of supported and unsupported queries – all kinds of random stuff. To do this we used SQLancer, which generates these sorts of arbitrary queries off the shelf.

We ran this workload in Antithesis and it found the bug easily. Antithesis simulates a “multiverse” of different paths through your software. Each branch tries different combinations of operations in different orders. In our case, Antithesis ran lots of different SQLancer queries along with lots of QueryPilot dry run and caching attempts. This quickly found the right interleaving of events that triggered the bug – an EXPLAIN CREATE CACHE query running while a DDL command is in flight.

Once we’d hit the bad state, we used the multiverse debugger to understand the sequence of events in that branch. We ran interactive debugging sessions just before and after the bug started, changed the log level, and ran SQL commands against Readyset to see what was going on. This showed us that the cached result of the EXPLAIN query was causing QueryPilot to retry continuously.


After getting into the technical details of this bug, it’s interesting to zoom out and think about it from the developer psychology side. We were trapped by our implicit models of how our software should, or would, be used. The shift from manual to automated queries changes background assumptions in ways that are hard to notice from the inside.

It would be nice if we could anticipate the new problems we’ll run into just by thinking harder, but in practice it works a lot better to throw everything at the database and see what happens. Antithesis doesn’t care about your mental models, which is what lets you find unknown unknowns like this one. So it isn’t just about the time you spend writing tests, or even about the time you’d need to spend thinking about the tests, it’s also about the peace of mind – do you want to launch while being pretty sure that there were gaps in your testing?