Types of faults

Antithesis injects faults at a pre-defined unit of isolation. A unit of isolation is sometimes also referred to as a node.

In systems using docker compose, the unit of isolation is a container. In systems using Kubernetes, the unit of isolation is a pod. Everything inside a unit shares the same fate. If you put two services in a unit, no network fault will ever fall between them. If you put them in separate units, they can be partitioned, lose their connection, or experience asymmetric loss between them.

There are two exceptions to this:

  • Clock jitter affects every isolation unit simultaneously because Antithesis simulates a shared wall clock, called virtual time, for the whole system.
  • Thread pausing acts at the thread level inside an isolation unit, and requires Antithesis instrumentation to take effect.

Network faults

Network faults relate to problems with packet delivery. Incoming and outgoing packets are treated independently, so all network faults may be asymmetric. One direction may be disrupted while the other works normally.

Baseline latency

This fault type simulates the normal variation in packet delivery time and a low baseline packet drop rate for all network traffic that flows through the Antithesis environment.

Partitions

This fault type simulates temporary failure of rack switches and similar hardware, which results in a network partition. Nodes in the network are temporarily separated into groups; nodes may communicate within their group, but communication between different groups is disrupted. The disruption takes one of three forms: Stopped (packets dropped entirely), Slowed (packets delayed with latency), or Jammed (packets queued for later delivery). Partitions can be configured by frequency, symmetry, number of partitioned groups, and duration.

Clogs

This fault type simulates failure of the networking stack of one or more machines, resulting in their inability to communicate with other machines. Affected nodes will have their packet delivery disrupted for inbound packets, outbound packets, or both. The disruption takes one of three forms: Stopped (packets dropped entirely), Slowed (packets delayed with latency), or Jammed (packets queued for later delivery).

Restore

This fault type simulates a network recovery event, which stops all ongoing network faults across all network links in the system.

Node faults

Node faults relate to the problems with individual nodes in the system, independent of the network links between them.

Node throttling

This fault type limits one or more nodes’ CPU resources or the bandwidth (in microseconds of processor time) they’re permitted to use. This fault surfaces bugs that manifest under load.

Node hang

This fault type simulates a container frozen in place for a temporary period of time. It remains on the network but cannot process anything, so other nodes will see timeouts when trying to communicate with it. The node is restored to proper functioning after a set period of time has elapsed.

Node termination

This fault type simulates a node terminating either by shutting down gracefully (node stop) or by crashing (node kill). If the node is a container, it’ll be restarted after a set period of time has elapsed. If the node is a pod, Antithesis cannot restart it and it’s fully managed by Kubernetes.

Restarted nodes may change IP addresses upon rejoining the network. They may keep durably synced filesystem changes, or may lose all of their modified data and boot fresh from their image. If you are testing a data storage system, and want to configure customised durability behavior for particular directories, talk to your forward-deployed engineer.

Other faults

Thread pausing

This fault type simulates one individual thread taking an unexpectedly long time to execute. Threads are paused for small periods of time. These small pauses cause threads to interleave unexpectedly, and differently in each test run. For thread pausing to work, your system under test must be instrumented. Thread pause events are not recorded in the log to avoid excessive logging.

Clock jitter

This fault type simulates changes to the system clock by moving it forward and backwards and affects all nodes equally. Jumps may be temporary (reversed after a set duration) or permanent, and multiple jumps stack cumulatively. These leaps in time surface bugs in processes that depend on small, consistent changes in system time.

Calls to intrinsics — such as __rdtsc() — will not be affected.

CPU modulation

This fault type simulates running your entire cluster on different hardware. The clock speed of the simulated processor can be changed, and the relative performance of different low-level processor operations may also change (for example, changing the speed of particular instructions while keeping the same overall clock speed). This fault surfaces bugs by changing the order in which concurrent threads or processes execute and can trigger some of the same bugs as our thread pausing fault without incurring the overhead of our instrumentation. CPU modulation events are not recorded in the log to avoid excessive logging.

Custom

You can create your own faults — you know your software best! If you write a script or program that simulates some sort of a fault, you can have our fault injector invoke it at random intervals.

Popular custom faults include connecting to administrative APIs and changing system configuration, triggering compaction or garbage-collection events, or forking troublesome background processes. Note that your custom fault can invoke our SDKs and, in particular, can use the SDK to draw random numbers which your custom fault can rely on. Your custom fault might therefore execute some default behavior 90% of the time, but a more pathological behavior 10% of the time.

If you write a test action that simulates a fault, you could choose to define it as a custom fault, or you could have it be invoked by your test template. A custom fault is more flexible, as you will be able to run it in any one of your isolation units, whereas your workload will be in its own unit and will be less able to affect the rest of the system. However, making the workload in charge of invoking your custom fault will give you fine-grained control over when it runs. This might be better if you believe that the fault ought to be invoked in very specific circumstances or as part of a custom validation step.

An example test run

Here’s what a 90 second window of a test run with all faults enabled might look like on a single execution branch from the fault injector’s perspective. Times below use virtual time (vtime), Antithesis’s deterministic global clock. Fault events in logs has more details.

  • vtime 10 - test starts, system under test being set up, no active faults.
  • vtime 12 - node hang event targets isolation unit B (max_duration: 8); B becomes unresponsive on the network.
  • vtime 18 - partition event splits {A, C} from {B, D}; all cross-partition packets are dropped (disruption_type: Stopped, max_duration: 10).
  • vtime 20 - B’s node hang expires.
  • vtime 28 - partition expires; cross-partition links restored.
  • vtime 40 - skip event jumps the clock forward by 30 s (max_duration: 1).
  • vtime 41 - skip event jumps the clock back by 30 s to reverse the offset.
  • vtime 75 - your workload runs ANTITHESIS_STOP_FAULTS 10 before a liveness assertion; all faults pause until vtime 85.
  • vtime 85 - faults resume.

Every one of those events appears in the JSON log with a virtual timestamp, and you can reconstruct the same picture from the log alone or look at the multiverse map by filtering fault injector events.

  • Introduction
  • Welcome to Antithesis
  • How Antithesis works
  • Using Antithesis with AI
  • Get started
  • Setup guide
  • Overview
  • For Docker Compose users
  • For Kubernetes users
  • Test an example system
  • Overview
  • With Docker Compose
  • Overview
  • Build and run an etcd cluster
  • Add a test template
  • With Kubernetes
  • Overview
  • Build and run an etcd cluster
  • Add a test template
  • Product
  • Test templates
  • Overview
  • Creating test templates
  • Test commands
  • How to check a test template locally
  • How to port tests to Antithesis
  • Test launchers
  • The triage report
  • Overview
  • Findings
  • Environment
  • Utilization
  • Properties
  • Logs Explorer & multiverse map
  • Debugging
  • Overview
  • Causality analysis
  • Multiverse debugging
  • Simple Multiverse debugging
  • Advanced
  • Overview
  • The Antithesis multiverse
  • Querying with event sets
  • Environment utilities
  • Using the Antithesis Notebook
  • Cookbook
  • Tooling integrations
  • CI integration
  • Discord and Slack integrations
  • Issue tracker integration - BETA
  • Configuration
  • Access and authentication
  • The Antithesis environment
  • Best practices
  • Docker best practices
  • Kubernetes best practices
  • Optimizing for testing
  • Concepts
  • Properties and Assertions
  • Overview
  • Properties in Antithesis
  • Assertions in Antithesis
  • Sometimes Assertions
  • Properties to test for
  • Fault injection
  • Overview
  • Types of faults
  • Pausing faults
  • Fault events in logs and reports
  • Reference
  • Webhooks
  • Overview
  • Launching a test
  • Launching a debugging session
  • webhook reference
  • Antithesis API
  • Handling external dependencies
  • SDK reference
  • Overview
  • Define test properties
  • Generate randomness
  • Manage test lifecycle
  • Assertion catalog
  • Coverage instrumentation
  • Go
  • Go SDK
  • Instrumentor
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • Java
  • Java SDK
  • Using the SDK
  • Building your software
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • C SDK
  • C++
  • C++ SDK
  • C/C++ Instrumentation
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • JavaScript
  • Python
  • Python SDK
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • Rust
  • Rust SDK
  • Instrumentation
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • .NET
  • .NET SDK
  • Instrumentation
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • Languages not listed above
  • Fallback SDK
  • Assert (reference)
  • Lifecycle (reference)
  • Assertion Schema
  • FAQ
  • Product FAQs
  • About Antithesis POCs
  • Release notes
  • Release notes
  • General reliability resources
  • Reliability glossary
  • Techniques for better software testing
  • Autonomous testing
  • Deterministic simulation testing
  • Property-based testing
  • White paper — How much does an outage cost?
  • Catalog of reliability properties for key-value datastores
  • Catalog of reliability properties for blockchains
  • Test ACID compliance with a ring test