A distributed systems reliability glossary
Introduction
This glossary is an overview of the concepts that you’ll need to think about distributed systems reliability. We’re writing chiefly for industry practitioners – software developers who are learning about distributed systems testing at any stage of their careers.
It’s meant as a handy guide, bringing together information that was previously scattered all over the internet – because the concepts here originate in many different disciplines (and naturally everyone’s too shy to talk to people outside their field, us included). To the best of our knowledge, it’s the first resource to do so. At the same time, we hope that simply putting all these ideas together in one place starts to show how they all fit together.
But! And we cannot stress this enough – this is a reference, not required reading!
We’re not saying you need to understand every one of these concepts in order to test a distributed system. Every time you write an integration test, you’re testing a distributed system already! This glossary is here to encourage you to get deeper into a topic that’s increasingly important for every developer committing production code today.
So our goal is to provide intuitive explanations, with pointers to more formal definitions should you need them. We present clear, univalent definitions of terms that are actually messy and contested, like “process”, “repeatable read” or “eventual consistency.” In such cases, we attempt to nod at the diversity of definitions and usages that exists, but our priority is to give a reader something that’s directionally correct and actually useful for a learner.
We’ve also included essential concepts for which there are no formally defined or widely accepted terms in existing literature, like “garbage reads” and “g-nonadjacent.” Maybe the names will stick?
We know it’s incomplete, and if you care about software reliability or distributed systems, we’d love your help!
This glossary is organized as follows:
- Preliminaries: concepts used in defining phenomena and consistency models.
- Consistency models: which define what systems are allowed to do.
- Availability models: which describe different ways systems can be available.
- Phenomena: something a system does which someone, somewhere, thought was a bad idea.
- Faults: fault models to which you might want your system to be resilient.
- Testing techniques: ways to test whether your system actually obeys these models, or experiences these failure modes.
- Further reading: a reading list of key reading lists.
Regardless of whether you’re working on your first distributed system or your fiftieth, we hope this will help you make it more reliable.
Yours, Jepsen & Antithesis
Preliminaries
These concepts are used often in defining phenomena and consistency models.
- Dependency
- In consistency models, a dependency is a relationship between two operations (e.g. transactions). For example, a single process could execute one operation before another: a process dependency. One operation could read data that was written by another: a write-read dependency.
- Definite error
- A definite error is returned by an operation which definitely did not happen. For instance, a transaction abort error is usually a definite error: the state of the system should be as if the transaction never happened at all. By contrast, an indefinite error may mean that the requested operation did or did not happen, or might happen later.
- Indefinite error
- An indefinite error is returned by an operation which may or may not have happened, or might happen later. For instance, a timeout is an indefinite error: the operation may not have been received at all, or it may have taken place without an acknowledgement, or it may be in-flight and execute five minutes later. By contrast, a definite error is known to have not executed.
Distinguishing between definite and indefinite errors is a key challenge in distributed systems design and testing. If one writes a unique value x = 3, receives a definite error, and later reads x = 3, that very likely signals an invariant violation. If the write receives an indefinite error, it is legal to read x = 3 now, at some later time, or never at all. Checkers must account for all possible outcomes. - Object
- In consistency models, a database usually contains a set of distinct objects, also called items. Each object often has a unique identifier, and over time, goes through a series of versions.
- Operation
- Operations are “the things a system does.” Exactly what constitutes an operation depends on the system or formalism.
For instance, the operations performed by a queuing system might be “enqueue” and “dequeue.” A counter system might have “increment” and “read.” A user-registration system might have operations like “register a user,” “log in,” and “change password.” Systems can also have other kinds of operations. For example, a test of a database system might define an operation for “compact old files”, or “add a node to the cluster.” Similarly, faults can be viewed as operations: “kill a process” or “partition the network” are things one can do to a system.
Transaction systems can be interpreted a few ways. Much of the transaction consistency literature uses the term “operation” to refer to (e.g.) an individual read or write of some object. These small operations are then grouped into transactions. On the other hand, it can also be convenient to think of a transaction as a single operation; this makes symmetries between single-object consistency models like Linearizability and multi-object models like Serializability clearer. - Predicate
- A predicate identifies a set of objects, rather than identifying a single object by primary key. For example “every box containing a cat” is a predicate; so is “all boxes.” A system may offer different levels of safety for operations which access individual items, vs operations which use a predicate.
- Process
- For our purposes, a process (which depending on context, may also be called an agent, actor, node, machine, replica, server, session, etc.) is a logically single-threaded state machine which participates in a distributed system.
We can often interpret a single database client as a process. However, once a client requests an operation which results in an indefinite error, it can perform no further operations -- that operation might take place at some later time. Were it to perform another operation, it might execute the two concurrently, and no longer be “logically single-threaded”. - Process dependency
- A process dependency relates two subsequent operations performed by the same process. If process P performs operation A, then B, we say that B process-depends on A.
It can be convenient to identify a session as a single process and vice-versa (making session dependencies process dependencies), but
a single process could execute more than one session, and a session could in some systems be handed off between two processes (thereby making it useful to distinguish the two types of dependency). - Real-time dependency
- A real-time dependency relates two operations which were not concurrent in real-time. If operation A ends before operation B begins, B real-time-depends on A. Real-time dependencies are useful for finding violations of real-time consistency models like Strong Serializable.
- Read-write dependency
- A read-write dependency relates two operations A and B, where A reads some version v1 of an object x, and B writes the next version v2 of x. Speaking loosely, we can say A did not see some part of B, and therefore must have executed before it. Read-write dependencies are used to define stronger consistency models like Repeatable Read, Snapshot Isolation, and Serializability.
- Session dependency
- Some systems provide a notion of a session: a sequence of operations performed in total order, usually by a single client or process. A session dependency relates two operations performed within a session: if the session performs operation A, then B, we say that B session-depends on A. Session dependencies can be used to define session consistency models like Strong Session Serializable.
A single process could execute more than one session, and a session could (in some systems) be handed off between two processes. A test harness might choose to view each session as a single process; in this case, session dependencies and process dependencies are equivalent. - Version
- In consistency models, a version refers to a specific state that some object took on. Formalisms differ, but in broad terms, every write of an object generally produces a new version of that object, and every read generally returns one (or more) versions. Two different versions may have the same value: if one operation sets x = 3, and another also sets x = 3, those are two different versions of x.
- Version order
- A version order is an order over versions of objects in a database, which encodes the sequence of versions each object took on during some execution. In Adya’s formalism, the version order is total over the versions of any single object.
- Write-read dependency
- A write-read dependency relates two operations A and B, where A writes some version v1 of an object x, and B reads v1. Speaking loosely, B observes A’s write. Write-read dependencies help trace the way information flows between operations, and are used to define consistency models like Read Committed.
- Write-write dependency
- A write-write dependency relates two operations A and B, where A writes some version v1 of an object x, and B writes v2: the next version of x in the version order. Speaking informally, B overwrites A’s write. Write-write dependencies help trace the way information flows between operations, and are used to define consistency models like Read Uncommitted.
Consistency models
These properties define what systems are allowed to do. Each consistency model usually proscribes some set of phenomena.
We draw many of our definitions of consistency models and phenomena from papers like Berenson et al’s A Critique of ANSI SQL Isolation Levels, Adya’s 1999 thesis Weak Consistency, and Cerone et al’s A Framework for Transactional Consistency Models with Atomic Visibility. In a broad sense, this line of research is an attempt to fix the incomplete and ambiguous definitions in the ANSI SQL standard. Other models and phenomena are adapted from Fekete et al’s Making Snapshot Isolation Serializable, Cerone & Gotsman’s Analyzing Snapshot Isolation, Bailis et al’s Highly Available Transactions and Scalable Atomic Visibility with RAMP Transactions, and so on.
- Consistency model
- A consistency model is a set of allowed histories. It constrains which operations are permissible, and in which orders. We say that one model is stronger than another if the stronger model’s allowed histories are a strict subset of the weaker model’s allowed histories.
For example, Monotonic Read is a consistency model which requires that for each process, reads never “go backwards”. Serializability is a consistency model which requires that the outcomes of operations are equivalent to if those operations had been executed in total order. - Causal consistency
- Causal consistency ensures that when a single process (sometimes called a “node” or “session”) performs a series of operations on an object, other processes observe those operations in the same order. A variant of Causal consistency, Real Time Causal, is one of the strongest totally available consistency models.
- Eventual consistency
- Eventual consisistency requires that after updates cease, given sufficient time and network messages, every node reaches the same final value. Eventually consistent systems provide total availability: reads and updates can always be performed locally, without network communication.
- History
- Speaking loosely, a history is a set of operations performed by some system, along with information like “what process performed each operation” and “when did each operation begin and end. A history is often concurrent: at any given time, several operations may be executed by different processes.
Conceptually, a consistency model is a set of allowable histories. In practice, distributed systems tests often record a history, then try to determine whether it satisfied some properties. To do this, the test may establish a mapping between the recorded history as seen by the test harness, and the abstract history used to define some consistency model. - Linearizability
- Linearizability is a consistency model which ensures that on a single object, operations appear to execute atomically (i.e. not interleaved with other operations), in a total order, and that order is consistent with the real-time order of operations. If operation A ends before operation B begins, A must appear to execute before B. This holds regardless of which process performs an operation.
Relaxing Linearizability’s real-time constraint gives Sequential Consistency. If one identifies a transaction as an operation, and the entire database as a single object, Linearizability is equivalent to Strong Serializability. - Monotonic Atomic View
- Monotonic Atomic View is a less-common, totally available transactional consistency model. It strengthens Read Committed by requiring that: once a transaction T1 observes any effect of some other transaction T2, T2 must observe all effects of T1. This is particularly helpful for preventing missing foreign keys, or for ensuring indices and materialized views don’t miss their underlying objects.
Monotonic Atomic View is weaker than Read Atomic, Repeatable Read, and Snapshot Isolation. - Monotonic Reads
- Monotonic Reads is a consistency model which guarantees that each process observes a monotonically advancing view of the system. Once a process observes the effects of some write, it must always observe those effects.
- Monotonic Writes
- Monotonic Writes is a consistency model which guarantees that if a single process performs two writes, all processes observe those two writes in that order. The second write should “overwrite” the first.
- Read Atomic
- Read Atomic is a transactional consistency model which ensures atomic visibility: either all or none of a transaction’s updates are observed by other transactions. Read Atomic prevents Fractured Read, Aborted Read, and Intermediate Read; it is stronger than Monotonic Atomic View in that it prevents Fractured Read. Read Atomic is weaker than Snapshot Isolation.
- Read Committed
- Read Committed is a relatively weak transactional consistency model which strengthens Read Uncommitted. It proscribes Write Cycle, Aborted Read, Intermediate Read, and Cyclic Information Flow. However, it allows all forms of Anti-Dependency Cycle: cycles where one transaction fails to observe another’s effects. Read Committed can be totally available.
Read Committed strengthens Read Uncommitted by proscribing Aborted Read, Intermediate Read, and Cyclic Information Flow. It is weaker than Snapshot Isolation (which proscribes Non-Adjacent Anti-Dependency Cycle) and Repeatable Read (which proscribes Anti-Dependency Cycle). - Read Uncommitted
- Read Uncommitted is one of the weakest transactional consistency models. Definitions vary, but in Adya’s formalism, Read Uncommitted proscribes G0: transactions should not overwrite each other’s effects. However, Read Uncommitted allows G1, and more. Read Uncommitted can be totally available.
Read Uncommitted is weaker than Read Committed, which, in addition to G0, proscribes G1. - Read Your Writes
- Read Your Writes is a consistency model which ensures that if a process performs a write, any subsequent read performed by that same process must observe that write’s effects.
- Repeatable Read
- Repeatable Read is a transactional consistency model which strengthens Read Committed. It proscribes G0, G1, and G2-item: cycles between transactions which fail to observe each other’s effects, based on single-object reads.
Repeatable Read is stronger than Read Committed, which allows G2-item. It is weaker than Serializability, which proscribes G2 in general, not just on single-item reads. It follows that Repeatable Read and Serializable are indistinguishable from Serializable for transactions which interact with objects by primary key; they differ only with respect to predicate phenomena.
Definitions and implementations of Repeatable Read vary widely. We use Adya’s PL-2.99, which is often used in research on consistency models. The ANSI SQL specification offers a much weaker, ambiguous definition which, unlike PL-2.99, allows G0 (write cycle). - Serializability
- Serializability is a relatively strong transactional consistency model which requires that all transactions appear to execute in some total order. It is stronger than Repeatable Read and Snapshot Isolation, proscribing G0, G1, and G2. It does not require that the apparent order of transactions be consistent with the per-process or real-time order: for that, see Strong Session Serializability and Strong Serializability, respectively.
- Sequential (Consistency model)
- Sequential consistency is a relatively strong consistency model which guarantees that all operations appear to take place in some total order, and that order is consistent with the order on each process. Sequential is stronger than Causal, in that it requires equivalence to a total order of operations. It is weaker than Linearizable, in that it does not require real-time order.
- Snapshot Isolation
- Snapshot Isolation is a transactional consistency model which strengthens Read Committed. It proscribes G0, G1, and G-nonadjacent: a kind of dependency cycle where some transactions fail to observe each other’s effects. Informally, Snapshot Isolation provides each transaction with an isolated snapshot of the entire database. At commit time, the transaction’s writes are applied atomically to produce a new version of the database. The transaction only commits if no other transaction has written the same keys since the transaction’s snapshot was taken.
Snapshot Isolation strengthens Read Committed by prohibiting G-nonadjacent. It is weaker than Serializability, which prohibits G2 in general. It is neither stronger nor weaker than Repeatable Read: Repeatable Read allows some predicate phenomena which Snapshot Isolation doesn’t, and Snapshot Isolation allows some single-item phenomena, like Write Skew, which Repeatable Read doesn’t. - Strong consistency
- There is no one accepted definition of “strong consistency”, and the term is used to cover a broad range of consistency models. It can refer to many consistency models, including Sequential, Linearizable, Serializable, Strong Session Serializable, Strong Serializable, and so on. In general, “strong” and “weak” are relative terms.
- Strong Serializable
- Strong Serializable is one of the strongest transactional consistency models. It strengthens Serializable by proscribing real-time phenomena: the apparent order of transactions must be consistent with their real-time order. If transaction A commits before transaction B begins, B must appear to execute after A.
Strong Serializability is also stronger than Strong Session Serializability, which allows processes to skew relative to one another. If one considers a transaction as a single operation, and the database as a single object, Strong Serializability is equivalent to Linearizability. - Strong Session Serializable
- Strong Session Serializable is a relatively strong transactional consistency model. It strengthens Serializable by proscribing session phenomena: the apparent order of transactions must be consistent with the order within each session. If we identify each session as a process, Strong Session Serializable requires that the graph of write-write, write-read, read-write, and session dependencies between transactions is acyclic.
Strong Session Serializable is weaker than Strong Serializable, which ensures that all transactions, regardless of which process performs them, appear to execute in real-time order. - Strong Session Snapshot Isolation
- Strong Session Snapshot Isolation is a relatively strong transactional consistency model. Just as Strong Session Serializable strengthens Serializable, Strong Session Snapshot Isolation strengthens Snapshot Isolation by proscribing session phenomena: the apparent order of transactions must be consistent with the order within each session. If we identify each session as a process, Strong Session Snapshot Isolation requires that the graph of write-write, write-read, read-write, and session dependencies has no cycles, except for cycles which have adjacent read-write edges.
Strong Session Snapshot Isolation is weaker than Strong Snapshot Isolation, which enforces real-time order between processes, not just within each process. - Strong Snapshot Isolation
- Strong Snapshot Isolation is a relatively strong transactional consistency model. Just as Strong Serializable strengthens Serializable, Strong Snapshot Isolation strengthens Snapshot Isolation by proscribing real-time phenomena: the apparent order of transactions must be consistent with their real-time order. If transaction A commits before transaction B begins, B’s snapshot must reflect A’s commit. Strong Snapshot Isolation requires that the graph of write-write, write-read, read-write, and real-time dependencies is acyclic, except for cycles which have adjacent read-write edges.
Strong Snapshot Isolation is also stronger than Strong Session Snapshot Isolation, which only enforces order within each process, rather than across all processes. - Writes Follow Reads
- Writes Follow Reads is a consistency model which ensures that if a process performs a read r observing some write w1, any later write w2 that process executes must take effect after w1. Reading a value “seals” the past.
Availability Models
These are different ways systems can be available.
Model | Operations succeed when… | Achievable consistency models |
---|---|---|
Total | The node is non-faulty | Monotonic Atomic View, Read Committed, Read Uncommitted, Monotonic Read, Monotonic Write, Writes Follow Reads |
Sticky | A non-faulty client is connected to a non-faulty server | Causal, Read Your Writes |
Majority | The node is non-faulty and can communicate with a majority of nodes | Sequential, Linearizable, Snapshot Isolation, Repeatable Read, Serializable, Strong Serializable |
- High availability
- Many systems claim to offer high availability, but the term is not well-defined. The Megastore paper uses “high availability” to mean majority availability, but the Dynamo paper uses the term to mean total availability.
One working definition is that a highly available system is available much of the time; this is the definition adopted in, for example, Ladin, Liskov, Shrira, and Ghemawat’s “Providing High Availability Using Lazy Replication”. Another definition is that a highly available should generally be available more often than a single node. - Total availability
- Totally available systems ensure that every non-faulty node can execute any operation. Operations continue to execute even when other nodes, or the network between them, fail.
Total availability is particularly useful for mobile apps, geo-replicated systems, and IOT devices with limited power or connectivity. Total availability also has important performance implications. Regardless of node or network failures, a totally available system can process requests without waiting for any network communication. This is particularly useful for latency-sensitive applications.
Consistency models like Read Uncommitted, Read Committed, Monotonic Atomic View, Writes Follow Reads, Monotonic Reads, and Monotonic Writes can all be implemented in totally available systems.
Bailis et al, Abadi, and DeCandia et al’s Dynamo paper all call this model of availability “highly available”. Gilbert & Lynch simply call it “available” in their CAP Theorem paper, as do the PNUTS and COPS papers. We call it “total availability”, in an attempt to disambiguate. - Majority availability
- Majority available systems ensure that if a majority of non-faulty nodes can communicate with one another, those nodes can execute operations. Operations may fail on faulty nodes, or when nodes cannot exchange messages with a majority. Many systems use “just over half” as their majority, but some Byzantine-fault-tolerant systems use a different fraction, like a two-thirds majority. Some systems can use large quorums for (e.g.) leader election in exchange for smaller quorums when committing operations using a stable leader.
Consistency models like Sequential, Linearizable, Snapshot Isolation, Repeatable Read, Serializable, and their stronger variants can only be, at best, majority available. In practice, systems like Zookeeper, Raft, Consul, MongoDB, and CockroachDB are (broadly speaking) majority available.
The distributed systems literature uses varying names for this model. Bailis et al’s HAT paper calls it “unavailable”, whereas Megastore calls it “highly available”. We opt for “majority available”. - Sticky availability
- Sticky availability means that if a client executes an operation against a database state that reflects all the client’s past transactions, it eventually receives a response – even in the presence of indefinite node or network failures. Broadly speaking, stateless clients must “stick” to one server, which ensures the consistency of that client’s operations. In the limiting case where every process is a server (or maintains its own state) this reduces to total availability.
Consistency models like Read Your Writes and Causal consistency can be implemented in sticky available systems, or in totally available systems in which all processes retain state about the past.
Phenomena
A phenomenon is something a system does which someone, somewhere, thought was a bad idea. For example, G0 (Write Cycle) describes a set of transactions which all overwrite each other. Since some names for phenomena have multiple interpretations, we try to use the less-ambiguous, short names presented in the literature – e.g. G1a for “aborted read”.
- A5A (Read Skew)
- A5A (also known as Read Skew) is a phenomenon where a transaction observes part, but not all, of another transaction’s writes. A5A is a special case of G-single. It is allowed by Read Committed, but forbidden by Snapshot Isolation and Repeatable Read.
- A5B (Write Skew)
- A5B (also known as Write Skew) is a phenomenon where two transactions write different objects, and each fails to observe the other’s write. A5B is a special case of G2-item; it is allowed by Snapshot Isolation, but is forbidden by Repeatable Read.
- Fractured Read
- Fractured Read is a phenomenon where one transaction writes two objects x and y, and a second transaction reads that version of x, but an older version of y. Fractured Reads are prevented by Read Atomic.
- G0 (Write Cycle)
- G0 (also known as Write Cycle) is a phenomenon where a set of transactions overwrite one another, forming a cycle linked purely by write-write dependencies. G0 is proscribed by Read Uncommitted.
- G1
- G1 is the conjunction of three phenomena: G1a (Aborted Read), G1b (Intermediate Read), and G1c (Cyclic Information Flow). G1 is used to define Read Committed.
- G1a (Aborted Read)
- G1a (also known as Aborted Read) is a phenomenon where an aborted transaction’s write is visible to a committed transaction. G1a is prohibited by Read Committed.
Checking for G1a is straightforward: one builds a set of the versions written by every aborted transaction, then scans every committed read. If any sees an aborted transaction, one has a direct example of G1a. - G1b (Intermediate Read)
- G1b (also known as Intermediate Read) is a phenomenon where a transaction reads a version of some object from the middle of a different transaction; i.e., a version other than that transaction’s last write of that object. G1b is prohibited by Read Committed.
Checking for G1b is straightforward. Start by scanning through every transaction, building up a set of intermediate versions: those versions which were overwritten by writes later in the same transaction. Then scan every committed read: if any observes an intermediate version, it constitutes an example of G1b. - G1c (Cyclic Information Flow)
- G1c (also known as Cyclic Information Flow) is a phenomenon where a set of transactions all observe or overwrite each other’s writes. G1c is prohibited by Read Committed.
- Garbage Read
- Garbage Read is a phenomenon where a database returns a version of some object that was not the product of any write(s). Garbage reads can occur due to memory corruption, serialization errors, race conditions, and more. We are not aware of a formal name for this phenomenon, but it occurs in a surprising number of real systems.
Checking for garbage reads is relatively simple. - G-single (Single Anti-Dependency Cycle)
- G-single (also known as Single Anti-dependency Cycle) is a phenomenon where a set of transactions are linked in a cycle whose edges are write-write, write-read, or read-write dependencies, and there is exactly one read-write dependency. This is one of the cycles proscribed by Snapshot Isolation.
- G-nonadjacent (Non-Adjacent Anti-Dependency Cycle)
- G-nonadjacent is a phenomenon where a set of transactions are linked in a cycle whose edges consist of write-write, write-read, or read-write dependencies, and no two read-write dependencies appear adjacent to each other. This is not a standardized term in the literature, but it is the general kind of cycle proscribed by Snapshot Isolation.
- G2-item (Item Anti-Dependency Cycle)
- G2-item (also known as an Item Anti-Dependency Cycle) is a phenomenon where a set of transactions are linked in a cycle whose edges consist of write-write, write-read, or read-write dependencies purely on single objects, rather than predicates. This is the general kind of cycle proscribed by Repeatable Read.
- G2 (Anti-Dependency Cycle)
- G2 (also known as an Anti-Dependency Cycle) is a phenomenon where a set of transactions are linked in a cycle whose edges consist of write-write, write-read, or read-write dependencies. These cycles are proscribed by Serializability.
- Internal Consistency Anomaly
- An Internal Consistency Anomaly occurs when a transaction fails to provide sequential semantics within a single transaction. For read-write registers, an internal violation occurs when a read of object x fails to observe that transaction’s most recent write of x, or the version of x from the start of the transaction. This comes from Cerone, Bernardi, and Gotsman’s A Framework for Transactional Consistency Models with Atomic Visibility, where it is formalized as the axiom INT. Similar behaviors are required in Adya’s and Alvisi et al’s formalisms.
Internal consistency is relatively easy to verify. Within each transaction independently, one steps through each sub-operation and keeps track of what the database state should have been, given that operation. For instance, once one observes a write of x=5, any later reads of x should observe 5, until another write of x supervenes. - Long Fork
- Long Fork is a phenomenon where two transactions make concurrent updates to separate objects, creating a “fork” in the timeline of the database, and two other transactions each observe one of those forks before they are logically merged.
- Lost Write (Phenomenon)
- At the level of a distributed system, Lost Write is a phenomenon where the system acknowledges a write as complete, but after some time, the effects of that write are no longer visible.
A write could be lost immediately. For instance, a server might internally abort a write transaction, but due to a bug, inform the client that the transaction had actually committed.
A write could also be visible for some time, then vanish. For example, a write might be acknowledged by a server and stored in memory, but not written to disk before a crash. Alternatively, a write could be replicated to, say, two out of five nodes in a cluster, visible in reads on those nodes, but later overwritten by the remaining three nodes. - P0 (Dirty Write)
- P0 (also known as Dirty Write) is a phenomenon where one transaction overwrites a version written by another transaction before that transaction completes. See also the generalized definition, G0.
- P1 (Dirty Read)
- P1 (also known as Dirty Read) is a phenomenon where one transaction reads data from another transaction before that transaction completes. See also the generalized definition G1.
- P2 (Non-Repeatable Read)
- P2 (also known as Non-Repeatable Read or Fuzzy Read) is a phenomenon where a transaction modifies data that another, ongoing transaction has read. See also the generalized definition G2-item.
- P3 (Phantom)
- P3 (also known as Phantom) is a phenomenon where one transaction modifies data that another, ongoing transaction has read via a predicate. See also the generalized definition G2.
- P4 (Lost Update)
- P4 (also known as Lost Update) is a phenomenon where the effects of a committed transaction are effectively lost due to another transaction’s concurrent write. Specifically, two committed transactions read the same version of some object, and each writes to it. Since neither observed the other’s write, one of their updates is effectively lost. P4 is prohibited by Snapshot Isolation and Repeatable Read.
P4 is relatively straightforward to test for: scan through all transactions in a history, finding any that read and then wrote the same object. Construct a map of object versions to sets of those transactions. If any set has more than one element, it indicates P4. - Real-Time Phenomena
- Real-Time phenomena are variants of cycle phenomena which involve one or more real-time dependencies between transactions. For instance, in a Serializable system, one process could perform a write, and after that transaction commits, a second process could begin a transaction which fails to observe the write. The two transactions form a cycle involving a read-write and a real-time dependency: a violation of Strong Serializability.
We are not aware of “real-time phenomena” as a term in the literature, but the concept arises in work on Strong Serializability and Strong Snapshot Isolation, along with Adya’s formalism. Real-time phenomena are permitted by most consistency models, but proscribed by Linearizable, Strong Snapshot Isolation, and Strong Serializability. - Process Phenomena
- Process phenomena (or session phenomena) are variants of cycle phenomena which involve one or more process (session) dependencies between transactions. For instance, a Serializable system could allow a single process to execute a transaction which writes something, then a second transaction which observes the state before its earlier write.
We are not aware of “session phenomena” or “process phenomena” as terms in the literature, but the concept arises in work on Strong Session Serializability and Strong Session Snapshot Isolation, along with Adya’s formalism. Process/session phenomena are allowed by most consistency models, but proscribed by Sequential, Strong Session Serializability, and Strong Session Snapshot Isolation. - Stale Read
- Stale Read is a kind of Real-Time phenomenon where one operation begins after another ends, but appears to execute before the prior operation. Note that both transactions can execute on any process. Stale reads are allowed by most consistency models, but proscribed by Linearizable, Strong Snapshot Isolation, and Strong Serializability.
Faults
These are things that can go wrong in distributed systems.
- Amnesia
- Amnesia means a node has forgotten something. Total amnesia occurs when a node forgets everything it has received. Most nodes undergo partial amnesia when they crash: information in memory is lost, but some state (depending on sync calls, the filesystem, kernel, and hardware) persists on disk.
- Bit rot
- Bit rot describes a gradual storage corruption where small, sometimes single-bit errors accumulate over time. Causes include cosmic rays, thermal stress, vibration, chemical breakdown in storage media, and electromagnetic noise.
- Byzantine fault
- A Byzantine fault allows a node (or in some cases, the network) to take any action, including malicious actions. For example, a Byzantine node could vote twice in elections, corrupt user data, or impersonate another node in an attempt to steal funds. Byzantine faults are some of the most challenging to protect against, and usually impose higher costs in latency and quorums.
The term “Byzantine” comes from Lamport, Shostak, and Pease’s “The Byzantine Generals Problem”, which posed an allegorical challenge: how should a group of generals, some of whom may be traitors, agree upon a battle plan? Their paper shows that if the generals can communicate only by messenger, more than two-thirds of the generals must be loyal to reach consensus. - Crash
- A crash fault means just what it sounds like: a node stops executing code until it is restarted.
- Crash-stop
- A crash-stop fault (also known as halt or fail-stop) occurs when a node stops executing code and never recovers.
- Crash-recover
- A crash-recover fault (also known as crash-restart) occurs when a node stops executing code and, after some time, begins again.
- Clock drift
- Clock drift is the difference in the rates at which the clocks on two processes advance. All real clocks have some degree of drift, but when it becomes large enough to interfere with the application, clock drift can be considered a fault.
Clock drift causes skew: as one node’s clock runs faster, its clock grows further and further ahead. Even if the system is insensitive to clock skew, clock drift alone can cause problems – particularly with timeouts. For example, a Raft implementation might use timeout-based leases to allow leaders to respond to read requests without checking with other nodes first. If clocks were to drift, an old leader could believe that it still held a lease while a new leader had actually been elected – resulting in Stale Reads. - Clock skew
- Clock skew is the instantaneous difference between the clocks on two processes. All clocks have some degree of skew, but when it becomes large enough to interfere with the application, clock skew can be considered a fault.
For example, imagine a system like Cassandra, in which nodes stamp their local clock on each write to some object x, and preserve the version of x which has the highest clock. If node A’s clock were to run three seconds faster than node B’s clock, then node A could write a value two seconds before B, but B’s later write would be discarded in favor of A’s: a Lost Write. If, on the other hand, the two clocks were only one second apart, B’s write would overwrite A’s, and the phenomenon disappears.
Some systems are insensitive to clock skew, because they rely purely on logical clocks for safety. Others, like the Cassandra example above, use wall clocks for safety; these systems may be safe so long as clock skew is small, but unsafe when clock skew is large.
Clock skew can be measured between any pair of nodes in a system, or between a node and some reference clock. - Latency
- Latency means “how long it takes to do something.” Anything that takes time can have a latency: systems often involve network latency (message delay), disk latency, memory latency, database query latency, queue latency, an overall request latency for whatever the system does as a whole, and so on.
- Latent sector error
- A latent sector error, also known as a block error, is a type of storage fault which occurs when one or more disk sectors have been damaged – for example, due to a head crash, bit rot, or buggy firmware. Storage devices generally detect latent sector errors and either correct or report them to the application.
- Lost write (Storage fault)
- In storage systems, a lost write occurs when some layer of the system informs the layer above it that a write has been performed when, in fact, it has not. For example, a hard disk may store a write in volatile memory on the disk itself, acknowledge the write, and only write it to the durable storage medium after some time.
- Torn write (Storage fault)
- In storage systems, a torn write occurs when the storage system stores part, but not all, of a requested write.
- Memory corruption
- In the context of distributed systems faults, memory corruption (also known as a memory error) means that the data read from some memory location differs from what was written there. These can be due to cosmic rays, electrical noise, hardware bugs, or software errors – for instance, in a hypervisor’s memory management system.
- Message corruption
- Message corruption is a type of fault where the message which arrives at some node differs from the message which was sent. These could be due to degraded cabling, electromagnetic interference, a fault in a network switch, an error in a hypervisor TCP stack, and more. Messages could also be corrupted by a malicious actor, as in a man-in-the-middle attack.
- Message delay
- A message delay (also called “network latency”) is the length of time it takes for a message to travel from sender to recipient. All real-world networks have non-zero latency. In asynchronous networks, latencies are unbounded. If a system assumes that messages arrive in a timely manner, a larger-than-expected delay can induce surprising behavior. In this sense, high network latencies can be interpreted as a fault.
- Message duplication
- Message duplication happens when a process sends a single message, and multiple copies of that message arrive. IP networks may duplicate packets – for instance, when a router has multiple paths available, it may opt to send a message along both of them. Messages may also be duplicated when processes retry, operations are replayed from a log, or messages resent as a part of a crash-recovery procedure. Depending on how an application views the network, message duplication may be considered normal, or a fault.
- Message reordering
- In asynchronous networks, variable latency allows two messages to arrive in the opposite order from which they were sent. This is a normal behavior in IP networks, but many systems, for instance those relying on TCP, assume that messages are delivered in the order they are sent. Introducing message reordering in a system which assumes ordered delivery can be a kind of fault.
- Message omission
- A message omission fault (also known as a dropped or lost message) occurs when a message is sent but not received. Most networks drop messages in normal operation – for instance, due to network congestion. Message corruption often manifests as omission – for example, when ethernet, IP, or TCP checksums detect a corrupted packet, they usually drop the packet altogether. Omission can also be indistinguishable from delay – indeed, a dropped message is equivalent to a message with infinite latency.
- Misdirected write (Storage fault)
- A misdirected write is a type of storage fault where data is written to the wrong location. This overwrites whatever data may have existed at the wrong address, and leaves the intended address with its original contents, as if the write had never occurred.
- Misdirected Read (Storage fault)
- A misdirected read is a type of storage fault where the storage system returns data from the wrong location.
- Network partition
- A network partition is a pattern of message omission: after some time, all messages on a particular network link are lost. Partitions may be temporary or permanent. They may completely isolate one or more nodes, or leave some links intact. They may lose messages traveling in one direction, but not the other.
- Pause
- A pause is a kind of timing fault: a process pauses for some time, then resumes processing. In theory, asynchronous systems can pause for arbitrarily long durations. In practice, POSIX processes running on typical x86-64 hardware can pause for a few milliseconds to several minutes – for example, due to garbage collection, the OS scheduler, a recoverable fault in the IO subsystem, a hypervisor migration, and so on.
- Storage corruption
- Storage corruption is a type of storage fault where the storage device silently returns incorrect data for reads of some location. The data may have been written to the wrong location (a misdirected write), read from the wrong location (a misdirected read), damaged by cosmic rays (bit rot), corrupted in transit across the storage bus, and so on.
- Storage fault
- A storage fault occurs when a node’s storage (e.g. a hard disk) fails to store data correctly. Storage faults include corruption and latent sector errors. “Disk fault” is often used as a synonym, but other media (e.g. tape, network storage)
Testing techniques
Since this is a glossary, this list focuses only on techniques for which there are well-established names. In reality, effective distributed systems testing methods tend to combine several of these approaches while exercising the system under test in complicated ways for which there aren’t well-established names. Those tactics will be covered in a forthcoming resource – if you’d like to contribute, we’d love your help!
- Concurrent testing
- Distributed systems are fundamentally concurrent: they involve multiple nodes running at the same time. Even single-node systems are often built to execute multiple operations at once. For example, SQL databases typically execute concurrent transactions from multiple clients. To verify the safety of concurrent systems, our tests should usually be concurrent as well.
Concurrent tests execute more than one logical operation at the same time. Ideally these operations are distributed across all the nodes in the system; if only a single node executes requests, it may conceal concurrency control errors. - Constraint programming
- Constraint programming finds solutions to problems with constraints, or shows that no solution exists. Since distributed systems are fundamentally concurrent and often involve indefinite failures, a system may be in one of many legal states during a test. One can encode the history of operations as a constraint problem, feed it to (e.g.) a SAT solver, and ask it to search for a legal interpretation. If none exists, the system has violated safety.
There are two major challenges with this approach. First, constraint solving is NP-complete; it may only work for short histories. Second, if a constraint problem is unsatisfiable, a solver generally provides no insight as to why. This makes it difficult to localize a safety violation to a specific time, object, or operation. - Cycle detection
- Many consistency models can be defined in terms of dependency cycles. If the test can infer the dependencies between operations in a history, it can search for cycles in that graph to show whether the consistency model holds. Proving the existence of cycles is a linear-time problem, and when a cycle is found, it localizes the problem to a specific set of transactions (often small). This is the technique behind Elle, Jepsen’s transactional consistency checker.
- Deterministic Simulation Testing
- A Deterministic simulation test is a simulation test where the simulated layers are deterministic, rather than random. Ideally the entire test becomes deterministic, making it possible to reproduce bugs.
One approach, popularized by FoundationDB, is to design the system under test so that all nondeterministic components are pluggable. For example, rather than using the language’s multithreading facilities directly, one might use a concurrency library with two implementations. In production, it runs code on regular threads and uses the normal, nondeterministic thread schedulers of the operating system and runtime. In simulation mode, it runs everything on a single thread, deterministically switching between concurrent tasks. Similarly, one could use a network library, using TCP/IP in production, and a deterministic, in-memory network in simulation mode. The clock, disk, and entropy sources could all be intercepted as well.
Another approach is to run regular non-deterministic software inside a deterministic hypervisor, a la Antithesis. - Example-based testing
- Example-based testing (in contrast with property-based testing) means writing down exactly which inputs should be applied to the system under test, and expecting a specific result. This is the approach taken by most software testing.
For instance, one might test an addition function by asserting that add(1, 2) = 3, add(0, 5) = 0, and add(-2 + 1) = -1. These are three examples of how addition should work. By contrast, a generative test might generate pairs of random integers x and y, and verify that add(x, y) = x + y, or check commutativity by validating that add(x,y) = add(y, x). - Fault injection
- Distributed systems are characterized by recurrent, partial failure. Moreover, it is difficult to handle failure correctly. Tests of distributed systems should therefore deliberately inject failures into the system during the test, and check to see whether those failures cause the system to violate a claimed invariant.
- Fuzz testing
- Fuzz testing (also known as fuzzing or random testing) involves submitting random inputs to a program and checking that some property holds. Early fuzz testing looked for crashes or hangs in the system under test. More sophisticated fuzzers use oracles (properties) which look for subtle safety violations, timing errors, or differences from a reference implementation.
Fuzzing and generative testing are essentially the same concept, but come from different lines of research and industry work. The term “fuzzing” has historically been used more in tests that focus on crashes or security, whereas “property-based” and “generative” are used more often in tests focusing on logical correctness – often for unit tests, verification of data structures, or end-to-end correctness tests. Today the terms are often used interchangeably.
See also: Guided Search - Guided Search
- Generative tests provide random inputs to a system, but subtle bugs may require particular inputs which are unlikely to be generated by a purely random process. A guided search (also called directed fuzzing) tries to choose “interesting” inputs by gathering feedback from the system through an oracle – e.g. through code instrumentation, tracing system calls, or inspecting memory. The oracle may indicate how far the system is, in state space, from encountering a certain codepath or error. Alternatively, it may simply indicate roughly where the system is, allowing the fuzzer to search for inputs that take the system to a meaningfully different state. A related approach is Lineage driven fault injection, which uses traces of correct outcomes to target faults.
- Metamorphic testing
- Metamorphic testing is a generative testing technique which works by taking a test input, transforming it in some way, and showing that the transformed output is somehow related to the original output.
For example, one might test an addition function by taking a list of numbers [a, b, c, …] and summing them to S. Since addition is associative and commutative, one can construct many permutations of that list, such as [b, a, c, …]; these too should all sum to S. Note that we do not have to know whether S is correct in order to test commutativity and associativity. Another example of metamorphic testing is evaluating an optimizer: one can check that for some generated program, the un-optimized and optimized forms produce equivalent results. - Oracle
- An oracle provides information to a test, helping determine whether a given test run was correct. A simple oracle could detect a crash or deadlock. A more complex oracle could take all data from a test run – all inputs and outputs – and decide if the test is valid. An oracle could also be a reference implementation of, for instance, a data structure: the test generates inputs, applies them to the oracle and to the implementation under test, and compares their results.
Another kind of oracle provides information to guide a test. A test harness can apply some random inputs to the system and ask the oracle for insight into the state of the program. Based on that state it can generate further inputs, or go back and try different inputs, in an attempt to reach “interesting” behaviors.
A reverse oracle takes a generated output and returns some inputs which should produce that output; those inputs are then applied to the system under test. This is particularly useful for testing deterministic functions, but can also help test complex systems, like databases. Rather than generate random queries (almost all of which would return nothing), the test harness estimates what data the database currently contains, and works backwards to generate a query which is likely to return results. - Property-based testing
- Property-based testing (also known as generative testing, or fuzzing) checks correctness by generating random inputs, applying those inputs to the system under test, and verifying that some property holds. By contrast, example-based testing writes down specific inputs and expects specific outcomes. Generative tests are particularly useful in testing distributed systems, where concurrency and indefinite failures make it difficult or impossible to write robust example-based tests. Generative tests also tend to try interesting inputs that a human tester wouldn’t.
Generative tests often incorporate a shrinking system which, having found a large input that breaks a program, seeks to find a shorter version of that input. Their generators may also be guided by instrumentation or static analysis. - Shrinking
- Shrinking is a property-based testing technique which reduces large sets of failing inputs to smaller – and hopefully more understandable – ones.
Generative tests often rely on a “long-world hypothesis”. There may be a short series of operations which cause the system to fail, but exhaustively searching for short inputs via (e.g.) breadth-first search might take too long. For example, a bug might only manifest after 32 elements have been appended to a list, but it would take far too long to generate all lists of 32 elements or fewer. Instead, one appends millions of elements to a single list, affording many chances to hit the bug.
This creates a needle in a haystack problem: which of those millions of elements caused the bug? A shrinking system generates shorter and shorter lists, searching for ones which still fail the property being tested. Ideally, it finds one just long enough to trigger the bug, and aid a human in fixing the problem. - Simulation Testing
- Simulation testing simulates some or all of a distributed system under test, rather than running the test on real hardware, networks, operating systems, etc. Simulating the network allows the test harness to drop, duplicate, delay, or reorder packets. Simulating the clock allows the test harness to induce clock drift and clock skew. Simulating the disk allows one to inject disk faults, like misdirected reads. Simulating multiple nodes in a single process allows one to test an entire distributed system, and to explore pathological execution schedules. Deterministic simulation testing makes these simulated components – and ideally the test itself – reproducible.
Further Reading
Many of the links here go to Jepsen.io.
These reading lists survey useful papers and blog posts in the field of distributed systems and testing.
- Christopher Meiklejohn’s Readings in Distributed Systems offers an excellent introduction to consistency, consensus, eventual consistency, and systems design.
- Dan Creswell has a fantastic distributed systems reading list which includes sections on often-cited industry systems, theoretical models, consistency models, consensus, storage, gossip, and peer-to-peer systems.
- Ting Su’s reading list covers property-based and metamorphic testing.
- The jqwik project has a helpful list of property-based testing resources.
- Ivan Yurchenko maintains a list of blog posts and case studies involving deterministic simulation testing.