Types of faults
Antithesis injects faults at a pre-defined unit of isolation. A unit of isolation is sometimes also referred to as a node.
In systems using docker compose, the unit of isolation is a container. In systems using Kubernetes, the unit of isolation is a pod. Everything inside a unit shares the same fate. If you put two services in a unit, no network fault will ever fall between them. If you put them in separate units, they can be partitioned, lose their connection, or experience asymmetric loss between them.
There are two exceptions to this:
- Clock jitter affects every isolation unit simultaneously because Antithesis simulates a shared wall clock, called virtual time, for the whole system.
- Thread pausing acts at the thread level inside an isolation unit, and requires Antithesis instrumentation to take effect.
Network faults
Network faults relate to problems with packet delivery. Incoming and outgoing packets are treated independently, so all network faults may be asymmetric. One direction may be disrupted while the other works normally.
Baseline latency
This fault type simulates the normal variation in packet delivery time and a low baseline packet drop rate for all network traffic that flows through the Antithesis environment.
Partitions
This fault type simulates temporary failure of rack switches and similar hardware, which results in a network partition. Nodes in the network are temporarily separated into groups; nodes may communicate within their group, but communication between different groups is disrupted. The disruption takes one of three forms: Stopped (packets dropped entirely), Slowed (packets delayed with latency), or Jammed (packets queued for later delivery). Partitions can be configured by frequency, symmetry, number of partitioned groups, and duration.
Clogs
This fault type simulates failure of the networking stack of one or more machines, resulting in their inability to communicate with other machines. Affected nodes will have their packet delivery disrupted for inbound packets, outbound packets, or both. The disruption takes one of three forms: Stopped (packets dropped entirely), Slowed (packets delayed with latency), or Jammed (packets queued for later delivery).
Restore
This fault type simulates a network recovery event, which stops all ongoing network faults across all network links in the system.
Node faults
Node faults relate to the problems with individual nodes in the system, independent of the network links between them.
Node throttling
This fault type limits one or more nodes’ CPU resources or the bandwidth (in microseconds of processor time) they’re permitted to use. This fault surfaces bugs that manifest under load.
Node hang
This fault type simulates a container frozen in place for a temporary period of time. It remains on the network but cannot process anything, so other nodes will see timeouts when trying to communicate with it. The node is restored to proper functioning after a set period of time has elapsed.
Node termination
This fault type simulates a node terminating either by shutting down gracefully (node stop) or by crashing (node kill). If the node is a container, it’ll be restarted after a set period of time has elapsed. If the node is a pod, Antithesis cannot restart it and it’s fully managed by Kubernetes.
Restarted nodes may change IP addresses upon rejoining the network. They may keep durably synced filesystem changes, or may lose all of their modified data and boot fresh from their image. If you are testing a data storage system, and want to configure customised durability behavior for particular directories, talk to your forward-deployed engineer.
Other faults
Thread pausing
This fault type simulates one individual thread taking an unexpectedly long time to execute. Threads are paused for small periods of time. These small pauses cause threads to interleave unexpectedly, and differently in each test run. For thread pausing to work, your system under test must be instrumented. Thread pause events are not recorded in the log to avoid excessive logging.
Clock jitter
This fault type simulates changes to the system clock by moving it forward and backwards and affects all nodes equally. Jumps may be temporary (reversed after a set duration) or permanent, and multiple jumps stack cumulatively. These leaps in time surface bugs in processes that depend on small, consistent changes in system time.
Calls to intrinsics — such as __rdtsc() — will not be affected.
CPU modulation
This fault type simulates running your entire cluster on different hardware. The clock speed of the simulated processor can be changed, and the relative performance of different low-level processor operations may also change (for example, changing the speed of particular instructions while keeping the same overall clock speed). This fault surfaces bugs by changing the order in which concurrent threads or processes execute and can trigger some of the same bugs as our thread pausing fault without incurring the overhead of our instrumentation. CPU modulation events are not recorded in the log to avoid excessive logging.
Custom
You can create your own faults — you know your software best! If you write a script or program that simulates some sort of a fault, you can have our fault injector invoke it at random intervals.
Popular custom faults include connecting to administrative APIs and changing system configuration, triggering compaction or garbage-collection events, or forking troublesome background processes. Note that your custom fault can invoke our SDKs and, in particular, can use the SDK to draw random numbers which your custom fault can rely on. Your custom fault might therefore execute some default behavior 90% of the time, but a more pathological behavior 10% of the time.
If you write a test action that simulates a fault, you could choose to define it as a custom fault, or you could have it be invoked by your test template. A custom fault is more flexible, as you will be able to run it in any one of your isolation units, whereas your workload will be in its own unit and will be less able to affect the rest of the system. However, making the workload in charge of invoking your custom fault will give you fine-grained control over when it runs. This might be better if you believe that the fault ought to be invoked in very specific circumstances or as part of a custom validation step.
An example test run
Here’s what a 90 second window of a test run with all faults enabled might look like on a single execution branch from the fault injector’s perspective. Times below use virtual time (vtime), Antithesis’s deterministic global clock. Fault events in logs has more details.
- vtime 10 - test starts, system under test being set up, no active faults.
- vtime 12 -
node hangevent targets isolation unit B (max_duration: 8); B becomes unresponsive on the network. - vtime 18 -
partitionevent splits{A, C}from{B, D}; all cross-partition packets are dropped (disruption_type: Stopped,max_duration: 10). - vtime 20 - B’s node hang expires.
- vtime 28 - partition expires; cross-partition links restored.
- vtime 40 -
skipevent jumps the clock forward by 30 s (max_duration: 1). - vtime 41 -
skipevent jumps the clock back by 30 s to reverse the offset. - vtime 75 - your workload runs
ANTITHESIS_STOP_FAULTS 10before a liveness assertion; all faults pause until vtime 85. - vtime 85 - faults resume.
Every one of those events appears in the JSON log with a virtual timestamp, and you can reconstruct the same picture from the log alone or look at the multiverse map by filtering fault injector events.