How Antithesis lets Clément Salaün of Formance sleep soundly at night
Formance is a EU-based provider of open-source financial infrastructure, enabling platforms and fintech companies to build modern financial applications across payments, banking, lending, investing, and insurance. Formance’s agnostic, scalable, and extensible infrastructure allows businesses to build and operate sophisticated fund flows with the financial service providers of their choice, ensuring flexibility, ownership, and faster time-to-market.
This blog post is based on a recorded conversation with Clément Salaün, Co-Founder & CTO of Formance, transcribed and edited for clarity.
At Formance, we have five core services: Ledger, Connectivity, Flows, Wallets, and Reconciliation. Obviously, as a fintech infrastructure company, one of the most critical components is the core ledger model we have. It’s where accounts and balances live and transactions can be tracked. For example, if you’re building a new banking application, a user’s USD balance could be represented in a ledger deployment.
Most of the time, you’re going to be using a sophisticated chart of accounts with many accounts. Each account holds a balance, while transactions record changes in that balance, representing the movement of value between them. Essentially, a transaction object acts as a container for these updates, ensuring accurate tracking of financial activity.
Our users need an easy way to understand how changes to an account are explained in the ledger, even if what’s happening in the background is much more complex. So, we originally opted for a simple approach: transaction identifiers as integers that increase monotonically, with no gaps in the sequence.
We put a lot of effort into the design and implementation of this because it’s such an important function in the system. The microservice responsible for the ledger handles generating an ID when a transaction is committed, and we wrote a bunch of unit and integration tests around it. Because this function had such a small surface area we couldn’t imagine a scenario where the system would fail, so we had full confidence in the system we built.
…And then one day we saw on one of our cloud environments that a customer had a gap in their transaction ID sequence. It went: 17, 18, 19, 20, and then jumped straight to → 24. So, it looked like we had missed four transactions. This kind of problem is really not great to see, because even if you find an explanation for why it’s happening, it’s proof that despite your confidence, your system is actually not functional. It throws everything into uncertainty regarding your whole testing strategy.
On top of that, it was impossible to reproduce with our test suite – no matter how hard we tried, we failed to produce another environment with a gap in the transaction ID sequence, so we were faced with this single isolated case from a customer environment. Nothing similar in any other data set, and no way to replicate.
After passing the investigation back and forth between three different engineers for over a week, we finally traced the issue back to a specific way this customer used a dry run parameter. This parameter allowed users to preview transactions without committing them. On its own, the parameter wasn’t sufficient, but the way they combined it with their actual transaction commit flow caused the problem to surface.
But just because you know of one way to produce a bug, doesn’t mean it’s the only way. So now you’re stuck in a situation where you think you’ve fixed the problem, but you’re not sure. You know you’ve tackled one problem out of an uncertain number of other problems, and you still have unknown unknowns.
So for the sake of our sanity, we had to reproduce it.
Now, at this point, we were technically using Antithesis, but we hadn’t covered this part of our code base with it, which is how the bug got by, undetected by our tests. Thankfully, we changed that after we found the bug, and set up a shim specifically targeting the transaction ID sequence and trying to identify gaps in the sequence. We added a bunch of assertions to check whether the ID sequence generated as part of a test run was consistent, and that’s when the fun thing happened:
So after hunting down and fixing this bug – which took 3 people more than a week – we tested again with Antithesis, and we found the bug again on the first run. And this time, Antithesis found the bug triggered due to a totally unrelated part of the code base.
Another component was doing some batching at the storage level when we were committing transactions and in some pretty rare occurrences, the batching queue would die. Of course, it was designed to be able to withstand dying, but the main process would fail to see that the batching queue had died and would continue to hand out IDs without resetting the sequence to account for when the batching queue had died. Same effects, totally different cause, even more chaos.
I mean, this is the kind of bug that keeps you up at night, wondering if the castle you’ve built may actually be a house of cards. In my work, I’m always trying to create an atmosphere of certainty with proof points related to why things work, how to protect systems if they fail, etc., just to provide myself with some peace of mind.
We have a small surface area for what this component does, and it’s a telltale sign of a bigger design problem somewhere. It’s not like you actually have to start from zero, but it almost feels like when these kinds of bugs occur.
Ironically, the fix itself was pretty simple; it took just a day to make the code change and redo the testing with Antithesis. It was just really hard to identify in the first place without knowing where to look.
With Antithesis we know that once we identify a problem, our fix is actually going to solve the problem. It lets us prove to ourselves that something has been mitigated and will not show up anymore. And this is super valuable because it’s not just the time you spend to find and fix the problem, it’s the fear and uncertainty that really slow you down. The confidence Antithesis enables us to move so much faster.
Antithesis saves us a ton of time on writing tests, because instead of having to be so explicit and exhaustive in our testing stack, we can rely on it to find those unknown unknowns for us.
So it’s not just about how quickly we find the bugs or how quickly we can debug, though Antithesis makes a significant difference there as well.