Fault Injection

Software testing is inherently a race. Can you find and fix bugs before your customers find them, or will embarrassing and expensive bugs be found in production? Unfortunately, developers are starting from behind – customers use the released product for many more hours than developers will spend testing it. Antithesis helps developers regain the lead: it offers parallelism to lessen this distance between customers and developers. But Antithesis also helps you to go faster through fault injection.

Antithesis can artificially inject real-world faults, such as network errors or machine crashes. But unlike real-world fault injection techniques such as Chaos Testing, Antithesis can perfectly reproduce any bugs that are found. Fault injection will exercise new sections of code, such as those related to error-handling and retry logic. (This is especially relevant for distributed systems.) However – even for the code you are already testing – it intentionally creates rare conditions. This will rapidly find bugs that otherwise have a low probability of surfacing.

Fault injection is useful for all classes of software, as it makes it both simple to find common bugs and possible to find rare bugs. With fault injection, every build will be tested far off the happy-path to immediately find newly-introduced bugs. By contrast, most customers will be using your software in fairly standard and expected ways. This contrast means Antithesis will be testing your software much more deeply than your customers will, which enables you to reliably win the race against customers to find bugs.

Types of faults

All faults are injected at the pod level unless otherwise stated. That is, pods represent parts of your system that are isolated: faults will affect everything on this isolation unit without necessarily affecting anything else. These isolation units might happen to represent physical machines, virtual machines, AWS Lambda, etc. In Antithesis, placing software together in a pod ensures that everything in the pod will share the same fate, whereas placing software in separate pods does not. (It is not necessarily the case that faults will affect everything in the pod equally. For example, you might inject thread-pausing faults into a pod. This fault might happen to pause only some threads in the pod but not others.)

A list of the types of faults used by Antithesis follows. Any type may be disabled upon request.

Network faults

Network faults relate to problems with packet delivery. Modern cloud-deployed systems will see a wide variety of network abnormalities. Antithesis simulates many of the adverse conditions that your applications will see on a daily basis.

Incoming and outgoing packets for each pod are treated independently. All network faults are potentially asymmetric, so that faults may be injected on only one or the other direction of packet flow. Under asymmetric faults, a pod can experience problems sending packets while receiving packets without problems.

Our network faults are applied at the pod level, which means that separate pieces of software running inside a single pod will never see network interruptions between them. This also means that when getting started, placing two pieces of software in a single pod is equivalent to assuming that they will never see any faults in their communication with one another – a risky bet!

Baseline latency

Every packet sent across the network in the Antithesis environment experiences some baseline delay. This delay simulates the normal variation in delivery time which you experience with fully-functional networking equipment. Packet latency follows a configurable distribution and can be set up to match typical LAN or WAN conditions. The simulated network will also experience a very low baseline rate of dropped packets.

Network congestion

The congestion fault type simulates periods of increased packet loss or reduced packet throughput. Packet latencies will be drawn from a special high-latency distribution rather than the baseline distribution described above. Latency is randomly applied to each individual packet, which means that packets are often reordered. Packet loss rates also increase when this fault type is active.

Basic network faults

Basic network faults simulate temporary degradations or failures of networking equipment within a rack or availability zone which result in two computers momentarily losing a connection. Thus, two pods temporarily lose a connection. In particular, this fault is applied to a stream of packets, which is a pair-wise connection between pods in a single direction. Within a stream, the fault can slow or drop packets or intermittently suspend delivery of packets.

Partitions

The partition fault type simulates temporary failure of rack switches and similar hardware, which results in a network partition. Pods in the network are temporarily separated into groups; pods may communicate within their group, but communication between different groups is disrupted. Partitions can be configured by frequency, symmetry, number of partitioned groups, and duration.

In the event of a partition, other network faults (such as packet latencies) will still be active.

Bad nodes

The bad nodes fault type simulates failure of the networking stack of one or more machines, which results in their inability to communicate with other machines. Affected pods will have their packet delivery disrupted for inbound packets, outbound packets, or both. The disruption may consist of placing the packet in a queue for later delivery or of dropping it entirely. This fault may affect multiple pods simultaneously.

Other faults

Node throttling

The node throttling fault type simulates one or more nodes being overloaded or temporarily less responsive. A targeted pod is limited in the amount of CPU resources it is permitted to use, which limits its responsiveness (as if it were under a heavy load). Pods may be throttled by limiting the total percentage of CPU they are permitted to use or by limiting the bandwidth (in microseconds of processor time) that they are permitted to use.

Throttling turns up bugs that would otherwise be found with load testing or that would be exposed by a DOS attack.

Node hang

The node hang fault type simulates a node hanging and becoming totally unresponsive for a temporary period of time. After the hang begins, we will restore the node to proper functioning after a random period of time has elapsed.

Thread pausing

The thread pausing fault type simulates one individual thread taking an unexpectedly long time to execute. Threads are paused for small periods of time. These small pauses cause threads to interleave unexpectedly, and differently in each test run. For thread pausing to work, a program must be instrumented.

Clock jitter

The clock jitter fault type simulates changes to the system clock, such as daylight savings time, leap seconds, and administrators changing the timezone. The system clock can be moved forward and backwards. This fault affects all pods equally, which is an exception to our general rule of applying faults at the pod level.

These leaps in time surface bugs in processes that depend on small, consistent changes in system time. Calls to intrinsics – such as __rdtsc() – will not be affected.

CPU modulation

The CPU modulation fault type simulates running your entire cluster on different hardware. The clock speed of the simulated processor can be changed, and the relative performance of different low-level processor operations may also change (for example, changing the speed of particular instructions while keeping the same overall clock speed). This fault surfaces bugs by changing the order in which concurrent threads or processes execute and can trigger some of the same bugs as our Thread Pausing fault without incurring the overhead of our instrumentation.

Node termination

The node termination fault type simulates a node terminating either by shutting down gracefully or by crashing. Cloud-deployed architectures will always need to be robust to machine transience. After each of these events, we will restore the node to proper functioning after a random period of time has elapsed.

Nodes which have restarted may change IP addresses upon rejoining the network. Nodes which have restarted may keep durably synced filesystem changes, or may lose all of their modified data and boot fresh from their image. If you are testing a data storage system, and want to configure more detailed durability behavior for particular directories, Contact us and we can customize this functionality.

Custom

You can create your own faults – you know your software best! The above types are all available to enable you to immediately begin injecting common faults. But you can also write your own script or program that simulates some sort of a fault, and our fault injector will invoke it at random intervals.

Popular usages include connecting to administrative APIs and changing system configuration, triggering compaction or garbage-collection events, or forking troublesome background processes. Note that your custom fault can invoke our SDKs and, in particular, can use the SDK to draw random numbers which your custom fault can rely on. Your custom fault might therefore execute some default behavior 90% of the time, but a more pathological behavior 10% of the time.

If you write a test action that simulates a fault, you could choose to define it as a custom fault, or you could have it be invoked by your test template. A custom fault is more flexible, as you will be able to run it in any one of your pods, whereas your workload will be in its own container and will be less able to affect the rest of the system. However, making the workload in charge of invoking your custom fault will give you fine-grained control over when it runs. This might be better if you believe that the fault ought to be invoked in very specific circumstances or as part of a custom validation step.