Optimizing for Antithesis
The best way to run a system in production and the best way to run it in a testing environment can be very different. In production, you want your system to be stable. In testing, you want to find bugs quickly. This page suggests ways to optimize your system for testing in Antithesis.
Configure for testing
You may want to configure your software in an unusual way. For example:
-
Your production environment might have a periodic background process, such as data compaction, data validation, or garbage collection, that runs every 48 hours. Since most Antithesis tests simulate a much shorter realtime duration, it might make sense to run this process every minute in our tests instead.
-
Your production code might have a heartbeat or failure-detection threshold that is set to a value on the order of minutes to avoid spuriously triggering during a GC pause. In Antithesis, you may want to set this to a much lower value, on the same timescale as the network faults that we inject, so that you can test your failover code.
-
A distributed storage system might split data into shards whenever the volume on a single node exceeds 1TB. If Antithesis runs the entire test with 1MB of data, the shard splitting and data movement logic will never be exercised by our tests. In this case you might enable shard splitting every 1KB in the Antithesis test environment.
A great way to tell if there are opportunities like this for improving your testing is to create Sometimes Assertions for especially tricky or sensitive pieces of your code, especially ones that are rarely exercised in production. You can then define custom properties around those assertions, and monitor their status in your triage report. Is the code in question exercised in every Antithesis run? How many times is it seen across the whole run? These questions will often guide you to a configuration option or tuning parameter in your software that should be set differently in our tests.
Avoid excessive logging output
When running within Antithesis, it’s best to use moderate amounts of logging, and to selectively enable verbose logging for particular issues where it will be helpful. In production, there’s a temptation to log everything that could conceivably be useful, just in case some rare issue occurs and is never seen again. But the deterministic replayability of the Antithesis environment makes this kind of caution unnecessary. If Antithesis provokes a rare issue, we can guarantee that it will be replicated as many times as you need to solve the issue.
The best way to do this is if your application supports dynamic logging control. In this case, we can run with moderate logging all the time, but when we find a bug, we can rewind the simulation a couple of seconds and enable verbose logging.
If your application does not support dynamic logging control, an alternative approach is to filter your logs so that verbose information is not captured by Antithesis. Then, when an issue is found, we can rewind the simulation a few seconds and selectively disable the filter. Contact us for more details.
Your triage report will automatically surface information about excessive output to you. Go to the Setup property group and view the Output was limited default property. If this property is failing, it means we detected excessive output from your software, and that the performance of the Antithesis platform is likely being degraded. The details section of this property will list the total output from your program across the duration of the run, the number of test hours consumed in the run, and the ratio of output/test hours. For best results, this ratio should be kept under 200MB/test hour.
Use smaller data volumes
If your test setup is a lift-and-shift of your production environment, or a port of an existing load test, it may involve large data volumes. When running in Antithesis, we recommend that you use smaller data volumes.
You may be using large data volumes because you have rarely-exercised pieces of code that only run in the presence of large amounts of data. For optimal test efficiency, we recommend instead configuring your tuning parameters differently so that you continue to get good coverage with a smaller volume of data.
Antithesis does not permit you to use real customer data or data including PII/PHI in your tests. If your system is designed to process customer information, please ensure that you are using synthetic data.
Use less memory
You may want to cap the memory used by your system to make errors easier to surface. You can use the triage report to monitor how close you are to the system’s memory limits. Go to the Performance property group and view the Peak memory usage default property The value of this property tracks the highest instantaneous memory utilization of your software, as a percentage of the total memory available to the system. If this value ever goes over 95%, the property will switch to Failing
, to alert you to the fact that you are close to the limit. In addition to catching misconfiguration, this property can also catch real bugs, such as memory leaks or potential denial-of-service attacks!
Monitor simulation efficiency
The more efficient your software is, the more quickly Antithesis can simulate it. In the best case, your tests will actually execute faster in Antithesis than they will in the real world, because our simulation can fast-forward through periods of low CPU usage.
You can track your simulation efficiency in the triage report. Go to the Test Efficiency property group and view the Virtual time per wall time default property. This tracks the ratio of time passing in the simulation to time passing in the real world. If this number is greater than 1, it means we can run your tests very efficiently.
There can be good reasons for this number to be low. Remember that we are simulating all of the services you provided to us simultaneously, plus all of their dependencies. If you provided us with a very large and complex software architecture, it may take a lot of work to simulate all of its components, and that’s fine. A good rule of thumb is that if you multiply the value of this property by the number of containers in your setup, this value should definitely be greater than one.
If after normalizing the value in this way, you are still experiencing poor simulation efficiency, some common causes include:
- Code which “busy waits” when idling. This prevents us from ever fast-forwarding the simulation, but it’s also a common performance problem in the real world!
- Heavy usage of non-deterministic CPU instructions. The Antithesis platform must trap and emulate CPU instructions such as
RDTSC
orRDRAND
in order to maintain its deterministic replayability. This incurs some performance overhead, which can be reduced by minimizing the use of these instructions.
Antithesis can help diagnose simulation-impacting performance issues, or real-world performance bugs that we trigger, with our built-in performance analysis features. Contact us about enabling system profiling for your environment.