Reliability Properties#

All distributed systems need to be concerned with reliability—and nearly all modern software is (to some extent) distributed. But what does it mean to be reliable? Simply put, reliable software behaves as expected under all circumstances. A test suite tests reliability if it:

Tests that your software does the right thing,
Throughout all of your code base, and
All of this under a wide variety of conditions.

Antithesis helps you test your software for reliability using test properties for defining appropriate behavior, a workload for exercising all of your software, and fault injection to apply a wide variety of conditions.

How can you decide, for your software, what is the right behavior? This document lists some desirable properties in the form of a menu. You should consider which properties your system ought to have and then figure out whether you are testing for them. In particular, you should ask if your Antithesis workload, fault injection, and test properties jointly test for the desired property. Sometimes it can be hard to tell whether a given workload would actually alert you to a violation of the properties you care about. A brown M&M is often a good way to tell for sure.

Your software probably has many obligations to fulfill. That could be a lot of tests, right? Often a single test can validate many properties so there need not be a one-to-one correspondence between your test code and these abstract properties or invariants.

Why you should use reliability properties#

Even if you have not given much specific thought to reliability properties, all software bakes in quite a few assumptions about reliability. Reliability properties help you reason about the properties your system already has—some of which you may have unintentionally chosen.

Suppose you make an internet-connected lock for people’s homes. Customers surely want to know that if they successfully send a request to lock the doors, then the doors will be locked. This might sound trivial, but this is a consistency property! Suppose that, while using your app to lock their door, a user accidentally hits the “unlock” button before quickly correcting the mistake by hitting the “lock” button. Now these two requests are racing and, very rarely, the unlock request will get to your service after the request to lock. The door is now unlocked because you implemented (a totally reasonable) last-write-wins policy. Reliability properties are relevant for many kinds of software—not just formally specified distributed algorithms.

Reliability properties fall into two broad categories: functional properties, or properties which ensure that functions behave identically in all circumstances (including in the presence of faults), and concurrency properties, or properties which ensure that concurrent invocations of functions will be handled properly.

Functional properties#

Functional properties ensure that a single invocation of a function (or method, etc.) will return the correct value regardless of the circumstances in which it is invoked. In particular, you would like to know that your functions still behave correctly even while the system is restoring itself following a crash, or during a leader election, and so on.

Functional properties assert that a given function always has a certain behavior, no matter what else is happening; you can think of them by looking through your existing unit tests and documentation.

You likely are familiar with unit tests. Commonly they set up some initial state, pass a single function one particular set of parameters, and assert that the functional output equals the theoretically correct output. We might write this as:

[setup]
assert f(some input) == theoretically correct output
[teardown]

We can think of this as being a property, namely that the function should always give the correct output. The traditional way to test this would be by testing many individual inputs with many possible setups.

You can instead make this more general by writing a slightly broader property:

always f(some input) == theoretically correct output

The unit test has now been converted to an always property. You can now rely on Antithesis to create many possible setups for testing the assertion.

Here are some other ways to think about functional properties:

Invariants and how to find them#

Invariants are properties about your data’s logical consistency. These properties are often documented in your code comments, commit messages, and existing assertions.

Good workloads will repeatedly check and re-check these invariants. The best way to accomplish this is with the assertion functionality in the Antithesis SDK, but the native assertions in your programming framework or language runtime will work too. The ideal invariant can be checked often, and will alert on error right away. The ideal check is also cheap to do—and not destructive to a running system. However Antithesis is uniquely capable of working with invariants that are expensive to validate or that are destructive to the test setup, though they may require additional customization.

Determinism and purity#

Determinism and purity properties are properties which assert that a function’s behavior depends only on its parameters. These properties cannot be written as functional properties because they rely upon comparing all possible states of the program, rather than looking only at a single state. You cannot assert these properties “locally”, but they can be written as properties in Antithesis. In Antithesis you can assert things like: “every time this function ran with these inputs, no matter what the situation was, it always gave the same output.”

Differential properties#

You can write a determinsm or purity assertion by comparing two different implementations of a function – for instance a tricky and optimized one made for your production system, and a test implementation that’s designed to be straightforward to reason about. Any time you have two different implementations that ought to have theoretically identical output, you can assert that the first implementation should always be equal (along the relevant axis) to the second implementation.

Purity properties#

Even if you don’t have a second implementation of your function… you can still compare it to itself! A functional point test asserts that some particular parameters give some specific output. No need to hard-code either the input or the output; we are only concerned that the output remains consistent. Be sure to pair functional unit tests with purity properties, however! You need a few unit tests to reassure yourself that it is behaving consistently correctly rather than consistently wrongly.

Concurrency properties#

Concurrency properties assure correct behavior when a function is called simultaneously by multiple clients. Almost all modern software which serves multiple customers over a network needs to be concerned about this behavior.

ACID#

ACID properties originate from reasoning about databases and their internal consistency, but can be generalized far beyond this context. Most APIs and other interfaces of modern systems have things which are akin to database transactions, in that they are trying to do something, do it all-or-nothing, with proper concurrency control, and record that they have done it.

There are four ACID properties; a single program may have all four. We will illustrate with a generic API call.

Atomic#

Atomicity asserts that an API call will either be totally successful or totally failed, and will not leave your system in some in-between state.

Consistent#

Consistency asserts that an API call will not leave data in an inconsistent state. ACID consistency refers to logical consistency or data integrity, as opposed to CAP consistency described below.

Isolated#

Isolation asserts that if an API is called at the same time as another API call, the two calls will not interfere with one another. The most effective way to test this is to make sure your workload has components that can check the results of a specific API call, and not just global properties.

Note

Sometimes people refer to the property of being serializable. This is the strongest form of isolation. It asserts that, when an API is called multiple times simultaneously, it will be as if all transactions were done in some order one-at-a-time. It requires having a strong form of CAP consistency, see below.

Not all APIs are serializable. Your API or service might offer a weaker form of isolation even if it isn’t designed to be serializable.

Durable#

Durability asserts that a successful API call will never be rolled back. A common way to test durability is for the workload to have some notion of forward progress such as a counter that ticks upwards and never downwards. This way it can check that its work is not undone under fault conditions.

CAP#

CAP properties originate from reasoning about distributed systems.

There are three CAP properties; a single program, at any one moment, may have, at most, two. (Since all distributed systems must have partition tolerance (P), your remaining choice is to choose whether you wish to also have, during a partition, consistency (C) or availability (A). So you may choose one of the following two properties.)

Consistency#

Consistency asserts that all nodes have data that is consistent with one another. As nodes might update data separately from one another, they need to eventually form a view of timelines that is consistent. A strongly consistent system will, no matter what node is contacted, always return the same result.

Linearizability (atomic consistency): Every operation takes place in some ordering (even one that is only theoretical) that is consistent with the real-time ordering of operations.
Causal consistency: An operation that saw effects of, or depended on, another operation is sequenced after that operation.
Eventual consistency: With some resolution function, all operations are eventaully applied (in some order) for all the nodes to see.

Availability#

Availability asserts that servers will respond to calls if they are “alive” and reachable from a client.

Total availability: calls to a live server always return, regardless of whether that server can communicate with others.
Eventual availability: a system should not be made permanently broken by transient faults. This is easy to mess up and very important to get right, since all distributed systems will be exposed to faults.

Note

The CAP theorem states that a distributed system must pick either consistency or availability in the face of a partition. It does not follow that you are guaranteed to have one or the other! Many people have designed systems with relaxed consistency models, but not all such systems are totally available.

Fault model#

Another dimension of reliability properties is what fault model they are expected to hold under. For example, your system might be durable in the presence of normal network errors, but not designed to preserve all data if three or more nodes go down simultaneously. Or your system might be eventually available, but only if a quorum of servers is alive after the faults cease.

Even if your system doesn’t have a formally-specified fault model, you should think about the promises you have made to customers, and what situations you expect your software to be able to handle without problems. You can then look the sorts of fault injection Antithesis performs, and configure your software to test the properties you want, in the situations where they should hold.