Bug likelihood over time

The Bug Likelihood Over Time section of the bug report graphs bug probability over time – it helps find bugs that take time to surface while your software is running. Some particularly painful bugs can be triggered early during execution, but only cause errors or crash the software later. For example, consider a system that doesn’t perform input validation for one of its APIs, and accepts arbitrarily long inputs. It might receive a very long input and store it successfully, but crash later when a different request attempts to query that data again. In a given execution history of this software, whether a test case or a real production scenario, a crash is baked in the moment the long input was stored, but the actual crash happens potentially much later. This is painful to debug because the logs at the moment of the crash might not reveal the origin of the problem – the later request to query the data was perfectly valid. Antithesis uses multiverse debugging to generate probabilities about when the bug was baked in during the execution history. This guides you to the section of the logs where the bug became inevitable.

This graph takes a single test case, and graphs over time the probability of the bug being baked in. It does not address the question of when a bug was committed to the program’s source code. The property history in the triage report helps identify which commit to version control introduced a bug. The bug likelihood over time in the bug report takes a single bug instance and tries to figure out when it became baked in during that test case. The property history tells you which version control commit to examine, the bug likelihood over time tells you which section of the logs to examine.

Consider the example below:

Time series

This graph plots the likelihood over time of two different instances of the same bug. We see that in both instances, the probability of the bug dramatically increases in the time period from 20–30 seconds before the bug was detected. At ten seconds before bug detection, the bug likelihood is 60%: if we rewind to ten seconds before the bug and use multiverse debugging to try 100 alternate histories from that point onward, in 60 of them the bug happens. At 45 seconds before bug detection, the bug probability is only 5%: if we rewind to start at that point instead and try 100 alternate histories, only 5 of them ultimately see the bug. We thus conclude that in the original history of the bug, something very important happened 20 to 30 seconds before it ultimately surfaced – somewhere around here would probably be a valuable section of the logs to examine.

Since this debugging capability is unusual and somewhat counterintuitive, let’s walk through how this happens concretely, using the example above of the system with a bug that causes it not to validate input lengths.

The events leading up to the bug being detected are as follows:

An API request inserts an excessively long name (at 0 seconds).
A different request inserts a normal record at 5 seconds into testing.
A third request creates a new field in a form at 10 seconds into testing.
A final request queries the data inserted by the first request, and the system crashes at 15 seconds into testing.

The bug is baked in from the very first request, but only surfaces at the very end of the test case. Antithesis can rewind time and try variations on the above sequence of events to estimate the likelihood of the bug at various points in that test case.

First, Antithesis rewinds to just before the crash and tries different queries. The crash will still always happen. The bug likelihood at 0 seconds before bug detection is 100%.
Then, Antithesis rewinds further and tries many different requests. The crash still always eventually happens. The bug likelihood at 5 seconds before bug detection – or ten seconds into testing – is 100%.
Then, Antithesis rewinds still further and tries another large set of possibilities. The crash is already baked in, so it always eventually surfaces. The bug likelihood at 10 seconds before bug detection – or 5 seconds into testing – is 100%.
But, when Antithesis rewinds to the beginning and enters a new (possibly shorter) name and then runs many requests, the crash will not necessarily happen. Now the bug likelihood at 15 seconds before bug detection – the beginning – is lower, possibly 20%.

Thus, we infer that the most important event for understanding the bug happened at the beginning of the test case. Scrolling the logs back to that point shows the very long field name.

The above example discusses time travel simulation of a single original instance of a bug. In the bug report, Antithesis selects a sample of several instances of a bug for time travel debugging. In the linked report, there are two instances with two separate graphs showing when the bug was probably baked in. Clicking on a different bug instance will highlight the other graph, while clicking on a point on the graph will scroll the logs to the points corresponding to that time. Here you should click at 20 or 30 seconds before bug detection to examine the logs – it appears the bug was baked in somewhere around there.

Remember that you can click at a point on the likelihood graph to go to the corresponding time in the logs.