Optimizing for Antithesis

Configure for testing

The best way to run a system in production and the best way to run it in a testing environment can be very different. In production, you want everything to be as efficient and stable as possible. In testing, you want to find bugs quickly, and sometimes the best way to do that is to configure your software in an unusual way. For example:

  • A heartbeat or failure-detection threshold that is set to a value on the order of minutes in production to avoid spuriously triggering during a GC pause. If failover code is something that you hope to test in Antithesis, it would make sense to set this to a much lower value, on the same timescale as the network faults that we inject.

  • A periodic background process, such as data compaction, data validation, or garbage collection, that runs every 48 hours in your production environment. Since most Antithesis tests simulate a much shorter realtime duration, it might make sense to cause this process to run every minute in our tests instead.

  • A distributed storage system might split data into shards whenever the volume on a single node exceeds 1TB. If we run the entire test with no more than 1MB of data, this would mean that the shard splitting and data movement logic can never be exercised by our tests, which would be a serious weakness. In this case you might enable shard splitting every 1KB in the Antithesis test environment.

A great way to tell if there are opportunities like this for improving your testing is to create Sometimes Assertions for especially tricky or sensitive pieces of your code, especially ones that are rarely exercised in production. You can then define custom properties around those assertions, and monitor their status in your triage report. Is the code in question exercised in every Antithesis run? How many times is it seen across the whole run? These questions will often guide you to a configuration option or tuning parameter in your software that should be set differently in our tests.

Avoid excessive output

When running in production, there’s a temptation to log everything that could conceivably be useful, just in case some rare issue occurs and is never seen again. But the deterministic replayability of the Antithesis environment makes this kind of caution unnecessary. If Antithesis provokes a rare issue, we can guarantee that it will be replicated as many times as you need to solve the issue. Consequently, the best practice when running within Antithesis is to use moderate amounts of logging, and to selectively enable verbose logging for particular issues where it will be helpful.

The best way to do this is if your application supports dynamic logging control. In this case, we can run with moderate logging all the time, but when we find a bug, we can rewind the simulation a couple of seconds and enable verbose logging.

If your application does not support dynamic control of logging, an alternative approach is to filter your logs so that verbose information is not captured by Antithesis. Then, when an issue is found, we can rewind the simulation a few seconds and selectively disable the filter. Contact us for more details.

Your triage report will automatically surface information about excessive output to you. Specifically, under the Setup property group, there is a default property called Output was limited. If this property is failing, it means we detected excessive output from your software, and that the performance of the Antithesis platform is likely being degraded. The details section of this property will list the total output from your program across the duration of the run, the number of test hours consumed in the run, and the ratio of output/test hours. For best results, this ratio should be kept under 200MB/test hour.

Do more with less data

If your test setup is a lift-and-shift of your production environment, or a port of an existing load test, it may involve large data volumes. Using large data volumes in an Antithesis test is often well-intentioned, because there may be rarely-exercised pieces of code that only run in the presence of large amounts of data. For optimal test efficiency, we recommend instead configuring your tuning parameters differently so that you continue to get good coverage with a smaller volume of data.

Antithesis does not permit you to use real customer data or data including PII/PHI in your tests. If your system is designed to process customer information, please ensure that you are using synthetic data.

You may want to cap the memory used by your system to make errors easier to surface. You can use your triage report to monitor how close you are to the system’s memory limits. Specifically, in the Performance property group, there is a default property called Peak memory usage. The value of this property tracks the highest instantaneous memory utilization of your software, as a percentage of the total memory available to the system. If this value ever goes over 95%, the property will switch to Failing, to alert you to the fact that you are close to the limit. In addition to catching misconfiguration, this property can also catch real bugs, such as memory leaks or potential denial-of-service attacks!

Monitor simulation efficiency

The more efficient your software is, the more quickly Antithesis can simulate it. In the best case, your tests will actually execute faster in Antithesis than they will in the real world, because our simulation can fast-forward through periods of low CPU usage. You can track your simulation efficiency in the triage report. Specifically, in the Test Efficiency property group, there is a default property called Virtual time per wall time. This tracks the ratio of time passing in the simulation to time passing in the real world. If this number is greater than 1, it means we can run your tests very efficiently.

There can be good reasons for this number to be low. Remember that we are simulating all of the services you provided to us simultaneously, plus all of their dependencies. If you provided us with a very large and complex software architecture, it may take a lot of work to simulate all of its components, and that’s fine. A good rule of thumb is that if you multiply the value of this property by the number of containers in your setup, this value should definitely be greater than one.

If after normalizing the value in this way, you are still experiencing poor simulation efficiency, some common causes include:

  • Code which “busy waits” when idling. This prevents us from ever fast-forwarding the simulation, but it’s also a common performance problem in the real world!
  • Heavy usage of non-deterministic CPU instructions. The Antithesis platform must trap and emulate CPU instructions such as RDTSC or RDRAND in order to maintain its deterministic replayability. This incurs some performance overhead, which can be reduced by minimizing the use of these instructions.

Antithesis can help diagnose simulation-impacting performance issues, or real-world performance bugs that we trigger, with our built-in performance analysis features. Contact us about enabling system profiling for your environment.