Pausing faults
There are times during a test when you need a window of normal operation, without faults, to test a system’s liveness properties, or check an invariant, or the system’s ability to recover from a failure. Antithesis provides a mechanism to request a “quiet period”.
Antithesis injects a ready-to-use binary on every container or pod path in a test run. A request to stop faults affects all fault types (except clock jitter, which continues normally during a quiet period) and all containers or pods, so the entire system is granted a recovery period, not just select services. You must provide a quiet period duration in seconds.
Invoke it from any test command or script (provided in bash but easily translatable into any programming language):
[ "${ANTITHESIS_STOP_FAULTS}" ] && "${ANTITHESIS_STOP_FAULTS}" <duration_seconds>When you invoke it:
- All fault injection stops and no new faults are scheduled.
- The simulated network and killed containers are restored, but just like any container restart operation, restored containers will take some time to be fully operational.
- Fault injection will automatically resume after the requested
duration_secondshas elapsed. - Any overlapping quiet period requests will be merged to reflect the biggest interval.
Pattern: mid-run liveness check
A common workload pattern uses ANTITHESIS_STOP_FAULTS to assert an invariant in the middle of a run without giving up the rest of the test budget:
- Run your workload operations while faults are active.
- Call
ANTITHESIS_STOP_FAULTS <SECONDS>with enough time for the system to stabilize. - Poll for health (retry reads until they succeed, wait for replicas to converge, etc.).
- Assert your liveness property — for example, “all replicas converge to the same value”, “queued work eventually drains”, “every committed write is readable”.
- Resume the workload. Faults will restart automatically when the quiet period elapses.
This pattern is particularly useful during rolling operations — upgrades, schema migrations, config rollouts — where you want to verify the system is healthy at each step before continuing.
Anti-patterns
- Don’t use a quiet period to hide flakiness. If a property only passes during quiet periods but fails during normal operation, that’s a real bug.
- Don’t assume restarted containers are immediately reachable. A quiet period restores killed containers but they need time to come up. Add retry loops to ensure successful restoration before your liveness check.