Gabriel Guerra
Developer Relations Lead

Software reliability, part 1: What is property-based testing?

Anteaters testing in a lab

As developers, we all strive to build reliable software – software that works correctly even when faults occur. This series aims to educate developers and team leads about our approach to software reliability.

Letting computers do the work

The first step of verifying that your software is reliable is to test it. We write unit tests. We write integration tests. We (sometimes) write end-to-end tests. We come up with examples like a user creating a valid input and then an invalid input. We then make a small change to the code. We update the examples to support these changes. We modify the code some more. We update the examples again. Wash, rinse, and repeat. This initial writing and constant updating of examples is boring and repetitive. When we run into repetitive, boring tasks, what do we say? Computers can do it for us.

Even if you are one of those unique individuals that get a thrill out of manually maintaining example after example, most times you will try your best but will not be able to imagine all the edge cases. The human brain is just very bad at exhaustively enumerating the negative space of all possible ways a system could be broken. It’s actually the opposite of the kind of task that we’re optimized for! Computers are much better at this – so why not let them do the work?

More or less, that is the philosophy of Property-Based Testing (PBT): hand off the burden of coming up with examples to computers because computers are faster and will think of more edge cases that are likely to find bugs than you are! Instead of writing examples, we tell the computer how our software should behave overall by creating properties.

Catch me if you can, Snouty

Consider a scenario where you’ve been working on this search function for a while, and you’re feeling pretty good about it.

def search(lst, target):
    left, right = 0, len(lst) - 1
    while (left < right): 
        mid = (left + right) // 2
        guess = lst[mid] 
        if guess == target: 
            return mid 
        elif guess < target: 
            left = mid + 1 
        else: 
            right = mid - 1
    return -1 

You glance at the clock. It’s almost noon, and you know if you don’t hurry, you’re not going to have enough time to eat lunch before your next meeting. So you write a bunch of test cases, covering what you think are all the edge cases. “Just one more test,” you mutter to yourself. You run the tests, see all the green checkmarks, and with a satisfied nod, you head off to claim your food.

def test_search():
    assert search([1, 2, 3, 4, 5], 3) == 2, "Test case 1 failed"
    assert search([1, 2, 3, 4, 5], 6) == -1, "Test case 2 failed"
    assert search([1, 2, 3, 4, 5], 1) == 0, "Test case 3 failed"
    assert search([1], 2) == -1, "Test case 4 failed"
    print("All example-based tests passed!")

# Run example-based tests
test_search()

Fast forward an hour. You’re back at your desk, slightly foggy from your food coma, staring at your screen. All tests are still passing! “lgtm,” you think, and without a second thought, you commit your changes and push your PR. Little do you know, lurking in your code is a bug that your drowsy brain missed. A one-line bug that your example-based tests, thorough as they seemed, failed to catch. Can you spot the bug in the first code block above?

Hunting a one-line bug with PBT

This scenario, while only a bit exaggerated, isn’t far from reality. We’ve all been there – thinking we’ve covered all our bases, only to find out later we missed something crucial. This is where PBT can be a real lifesaver.

To illustrate how PBT can help us, we can use the hypothesis Python library. Here is one way we could write a property-based test for our search function:

from hypothesis import settings, given 
from hypothesis.strategies import lists, integers

@settings(max_examples=1000)
@given(lists(integers()), integers())
def test_search_pbt(lst, target): 
    lst.sort() 
    resultIdx = search(lst, target)
    if resultIdx == -1: 
        assert target not in lst, f"{target} is in {lst} but wasn't found"
        return
    assert lst[resultIdx] == target, f"Found incorrect index for {target} in {lst}"

# Run property-based tests
test_search_pbt()
print("All property-based tests passed!")

What is this code doing? Conceptually, there are only two parts that matter to us. The first part is the function definition with all its decorators (e.g. @settings and @given). These specify how to generate random inputs by defining both the types of the inputs (e.g. lst is a list of integers) and the number of inputs we want the PBT library to generate for us, in this case, a thousand. The second important part is the assertions we define. These assertions establish the properties we believe should always be true for this search function:

  1. If the function returns -1, the target should not be in the list.
  2. If the function doesn’t return -1, the element at that index should be the target.

With these two parts, we end up with a single test that generates a thousand random lists of integers and checks if our function ever fails given our properties. Running this property-based test, we quickly find that our implementation does indeed fail for certain inputs! Do you see the bug now?

AssertionError: 0 is in [0] but wasn't found
Falsifying example: test_search_pbt(
    lst=[0],
    target=0,
)

It’s in the loop condition! The loop condition should include the case where left and right are equal. This causes the properties to not hold when the array is of length one and the target is index 0. Fixed!

while (left <= right): 

Great, we’ve defined our function’s behavior using properties instead of examples. This allowed us to automate our test cases. But more importantly, notice how you can now continue changing the implementation of the search function to meet your requirements without changing all the tests. PBT not only helped us find our bug but also resulted in less work and more maintainable code.

Beyond simple functions

While this example might have seemed a bit contrived at first, it helps illustrate a crucial point for us:

If even in this simple function an obvious bug could escape all of our many test cases, imagine the potential for errors in an entire software application comprised of thousands of functions, or worse, in a distributed systems composed of multiple microservices.

This explosion of complexity and risk is a challenge we’ve taken to heart. At Antithesis, we take PBT principles and apply them to entire concurrent, stateful, and distributed systems, randomly varying not only the inputs to a software program but also the environment within which it runs.

Admittedly, jumping from a single function to an entire distributed system might seem like a stretch at this point. To better understand how we do this, in the next part of this series, we’ll first explore how to reason about the set of properties that defines the behavior of our software – its specification.

If you’re intrigued by our approach to reliability and want to learn how it can directly benefit your software, don’t hesitate to reach out to us! Till next time.