Antithesis logomark
DOCS

Pausing faults

There are times during a test when you need a window of normal operation, without faults, to test a system’s liveness properties, or check an invariant, or the system’s ability to recover from a failure. Antithesis provides a mechanism to request a “quiet period”.

Antithesis injects a ready-to-use binary on every container or pod path in a test run. A request to stop faults affects all fault types (except clock jitter, which continues normally during a quiet period) and all containers or pods, so the entire system is granted a recovery period, not just select services. You must provide a quiet period duration in seconds.

Invoke it from any test command or script (provided in bash but easily translatable into any programming language):

[ "${ANTITHESIS_STOP_FAULTS}" ] && "${ANTITHESIS_STOP_FAULTS}" <duration_seconds>

When you invoke it:

  1. All fault injection stops and no new faults are scheduled.
  2. The simulated network and killed containers are restored, but just like any container restart operation, restored containers will take some time to be fully operational.
  3. Fault injection will automatically resume after the requested duration_seconds has elapsed.
  4. Any overlapping quiet period requests will be merged to reflect the biggest interval.
Overlapping requests

Pattern: mid-run liveness check

A common workload pattern uses ANTITHESIS_STOP_FAULTS to assert an invariant in the middle of a run without giving up the rest of the test budget:

  1. Run your workload operations while faults are active.
  2. Call ANTITHESIS_STOP_FAULTS <SECONDS> with enough time for the system to stabilize.
  3. Poll for health (retry reads until they succeed, wait for replicas to converge, etc.).
  4. Assert your liveness property — for example, “all replicas converge to the same value”, “queued work eventually drains”, “every committed write is readable”.
  5. Resume the workload. Faults will restart automatically when the quiet period elapses.

This pattern is particularly useful during rolling operations — upgrades, schema migrations, config rollouts — where you want to verify the system is healthy at each step before continuing.

Anti-patterns

  • Don’t use a quiet period to hide flakiness. If a property only passes during quiet periods but fails during normal operation, that’s a real bug.
  • Don’t assume restarted containers are immediately reachable. A quiet period restores killed containers but they need time to come up. Add retry loops to ensure successful restoration before your liveness check.