- Tests that your software does the right thing,
- Throughout all of your code base, and
- All of this under a wide variety of conditions.
Why you should use reliability properties
Even if you have not given much specific thought to reliability properties, all software bakes in quite a few assumptions about reliability. Reliability properties help you reason about the properties your system already has — some of which you may have unintentionally chosen. Suppose you make an internet-connected lock for people’s homes. Customers surely want to know that if they successfully send a request to lock the doors, then the doors will be locked. This might sound trivial, but this is a consistency property! Suppose that, while using your app to lock their door, a user accidentally hits the “unlock” button before quickly correcting the mistake by hitting the “lock” button. Now these two requests are racing and, very rarely, the unlock request will get to your service after the request to lock. The door is now unlocked because you implemented (a totally reasonable) last-write-wins policy. Reliability properties are relevant for many kinds of software — not just formally specified distributed algorithms. Reliability properties fall into two broad categories: functional properties, or properties which ensure that functions behave identically in all circumstances (including in the presence of faults), and concurrency properties, or properties which ensure that concurrent invocations of functions will be handled properly.Functional properties
Functional properties ensure that a single invocation of a function (or method, etc.) will return the correct value regardless of the circumstances in which it is invoked. In particular, you would like to know that your functions still behave correctly even while the system is restoring itself following a crash, or during a leader election, and so on. Functional properties assert that a given function always has a certain behavior, no matter what else is happening; you can think of them by looking through your existing unit tests and documentation. You likely are familiar with unit tests. Commonly they set up some initial state, pass a single function one particular set of parameters, and assert that the functional output equals the theoretically correct output. We might write this as:Invariants and how to find them
Invariants are properties about your data’s logical consistency. These properties are often documented in your code comments, commit messages, and existing assertions. Good workloads will repeatedly check and re-check these invariants. The best way to accomplish this is with the assertion functionality in the Antithesis SDK, but the native assertions in your programming framework or language runtime will work too. The ideal invariant can be checked often, and will alert on error right away. The ideal check is also cheap to do — and not destructive to a running system. However Antithesis is uniquely capable of working with invariants that are expensive to validate or that are destructive to the test setup, though they may require additional customization.Determinism and purity
Determinism and purity properties are properties which assert that a function’s behavior depends only on its parameters. These properties cannot be written as functional properties because they rely upon comparing all possible states of the program, rather than looking only at a single state. You cannot assert these properties “locally”, but they can be written as properties in Antithesis. In Antithesis you can assert things like: “every time this function ran with these inputs, no matter what the situation was, it always gave the same output.”Differential properties
You can write a determinsm or purity assertion by comparing two different implementations of a function — for instance a tricky and optimized one made for your production system, and a test implementation that’s designed to be straightforward to reason about. Any time you have two different implementations that ought to have theoretically identical output, you can assert that the first implementation should always be equal (along the relevant axis) to the second implementation.Purity properties
Even if you don’t have a second implementation of your function… you can still compare it to itself! A functional point test asserts that some particular parameters give some specific output. No need to hard-code either the input or the output; we are only concerned that the output remains consistent. Be sure to pair functional unit tests with purity properties, however! You need a few unit tests to reassure yourself that it is behaving consistently correctly rather than consistently wrongly.Concurrency properties
Concurrency properties assure correct behavior when a function is called simultaneously by multiple clients. Almost all modern software which serves multiple customers over a network needs to be concerned about this behavior.ACID
ACID properties originate from reasoning about databases and their internal consistency, but can be generalized far beyond this context. Most APIs and other interfaces of modern systems have things which are akin to database transactions, in that they are trying to do something, do it all-or-nothing, with proper concurrency control, and record that they have done it. There are four ACID properties; a single program may have all four. We will illustrate with a generic API call.Atomic
Atomicity asserts that an API call will either be totally successful or totally failed, and will not leave your system in some in-between state.Consistent
Consistency asserts that an API call will not leave data in an inconsistent state. ACID consistency refers to logical consistency or data integrity, as opposed to CAP consistency described below.Isolated
Isolation asserts that if an API is called at the same time as another API call, the two calls will not interfere with one another. The most effective way to test this is to make sure your workload has components that can check the results of a specific API call, and not just global properties.Sometimes people refer to the property of being serializable. This is the strongest form of isolation. It asserts that, when an API is called multiple times simultaneously, it will be as if all transactions were done in some order one-at-a-time. It requires having a strong form of CAP consistency, see below.Not all APIs are serializable. Your API or service might offer a weaker form of isolation even if it isn’t designed to be serializable.
Durable
Durability asserts that a successful API call will never be rolled back. A common way to test durability is for the workload to have some notion of forward progress such as a counter that ticks upwards and never downwards. This way it can check that its work is not undone under fault conditions.CAP
CAP properties originate from reasoning about distributed systems. There are three CAP properties; a single program, at any one moment, may have, at most, two. (Since all distributed systems must have partition tolerance (P), your remaining choice is to choose whether you wish to also have, during a partition, consistency (C) or availability (A). So you may choose one of the following two properties.)Consistency
Consistency asserts that all nodes have data that is consistent with one another. As nodes might update data separately from one another, they need to eventually form a view of timelines that is consistent. A strongly consistent system will, no matter what node is contacted, always return the same result.- Linearizability (atomic consistency): Every operation takes place in some ordering (even one that is only theoretical) that is consistent with the real-time ordering of operations.
- Causal consistency: An operation that saw effects of, or depended on, another operation is sequenced after that operation.
- Eventual consistency: With some resolution function, all operations are eventaully applied (in some order) for all the nodes to see.
Availability
Availability asserts that servers will respond to calls if they are “alive” and reachable from a client.- Total availability: calls to a live server always return, regardless of whether that server can communicate with others.
- Eventual availability: a system should not be made permanently broken by transient faults. This is easy to mess up and very important to get right, since all distributed systems will be exposed to faults.
The CAP theorem states that a distributed system must pick either consistency or availability in the face of a partition. It does not follow that you are guaranteed to have one or the other! Many people have designed systems with relaxed consistency models, but not all such systems are totally available.