Catalog of reliability properties for key-value datastores

Key-Value (KV) datastores are non-relational databases that store data as a collection of keys and associated values. Commonly used KV stores include etcd, Memcached, Redis, R2, S3, TiKV, and ZooKeeper.

This is a catalog of reliability properties which are applicable to most, if not all, non-transactional KV datastores. These high-level holistic properties, ensure the system as a whole behaves as expected and delivers on all promised guarantees. If you’re designing or working on a KV store, this reference helps you think about what properties your KV store should have. We’d love your help adding more properties to this list, please reach out to us at support@antithesis.com.

Every property here is accompanied by a pseudocode workload that checks the property. As we go down the list, we extend the workload, so that by the end, the workload implements all properties listed here.

Also check out our reference implementation of this workload for TiKV.

Property catalog

  1. No data corruption (Safety)
  2. No data loss (Safety)
  3. No stale reads (Safety)
  4. CAS is never violated (Safety)
  5. Clients can read and write data (Liveness)

No data corruption (Safety)

Clients never receive corrupt data.

How to test this
  • Make read and write requests to the datastore using a predefined pool of keys.
  • Choose a key from the pool, choose either a read or write request.
  • For every write request, generate random data for the key and send the request.
  • For every read request, send the request.
  • For every write request, store the checksum of the data along with the value.
  • During a read, calculate the checksum of the read data and compare it with the stored checksum.
  • Any mismatch indicates data corruption.
FUNCTION write(client, key):
data = random_value(MIN_DATA_SIZE, MAX_DATA_SIZE)
checksum = hash(data)
start_time = current_time
success = client.put(key, {data, checksum})
IF !success:
RETURN false
RETURN true
FUNCTION read(client, key):
start_time = current_time
success, data, checksum = client.get(key)
IF !success:
RETURN false
IF checksum != hash(data):
Error("Data corrupted")
RETURN true
FUNCTION main():
client = connect(...)
keys = "key_0", ..., "key_100"
operations = [read, write]
LOOP forever:
key = random_choice(keys)
operation = random_choice(operations)
success = operation(client, key)

No data loss (Safety)

Committed data should not get lost.

How to test this
  • During a write, store the successful requests in an in-memory hashmap.
  • Compare read responses against the in-memory storage to verify durability.
  • Any mismatch indicates data loss.
FUNCTION write(client, key, local_checksum_storage):
data = random_value(MIN_DATA_SIZE, MAX_DATA_SIZE)
checksum = hash(data)
start_time = current_time
success = client.put(key, {data, checksum})
IF !success:
RETURN false
local_checksum_storage[key] = checksum
RETURN true
FUNCTION read(client, key, local_checksum_storage):
start_time = current_time
success, data, checksum = client.get(key)
IF !success:
RETURN false
IF checksum != hash(data):
Error("Data corrupted")
IF local_checksum_storage[key] != checksum:
Error("Data lost")
RETURN true
FUNCTION main():
client = connect(...)
keys = "key_0", ..., "key_100"
local_checksum_storage = {}
operations = [read, write]
LOOP forever:
key = random_choice(keys)
operation = random_choice(operations)
success = operation(client, key, local_checksum_storage)

No stale reads (Safety)

Based on your database’s consistency guarantee, clients should read data according to the promised consistency model.

For simplicity, this catalog tests for ‘No stale reads’ (weaker form of linearizability) — Committed data is immediately visible to all clients.

How to test this
  • Create multiple clients in the same process.
  • Write data with one client and read data with a random client that didn’t write the data.
  • Verify that all clients immediately see committed writes.
FUNCTION write_read(clients, key, local_checksum_storage):
write_client = random_choice(clients)
write_data = random_value(MIN_DATA_SIZE, MAX_DATA_SIZE)
write_checksum = hash(write_data)
start_time = current_time
write_success = write_client.put(key, {write_data, write_checksum})
IF !write_success:
RETURN false
local_checksum_storage[key] = write_checksum
read_client = random_choice(clients)
start_time = current_time
read_success, read_data, read_checksum = read_client.get(key)
IF !read_success:
RETURN false
IF read_checksum != hash(data):
Error("Data corrupted")
IF local_checksum_storage[key] != read_checksum:
Error("Data lost")
IF read_data != write_data:
Error("Stale read")
RETURN true
FUNCTION write(clients, key, local_checksum_storage):
client = random_choice(clients)
data = random_value(MIN_DATA_SIZE, MAX_DATA_SIZE)
checksum = hash(data)
start_time = current_time
success = client.put(key, {data, checksum})
IF !success:
RETURN false
local_checksum_storage[key] = checksum
RETURN true
FUNCTION read(clients, key, local_checksum_storage):
client = random_choice(clients)
start_time = current_time
success, data, checksum = client.get(key)
IF !success:
RETURN false
IF checksum != hash(data):
Error("Data corrupted")
IF local_checksum_storage[key] != checksum:
Error("Data lost")
RETURN true
FUNCTION main():
clients = array of NUM_CLIENTS clients
FOR i from 0 to NUM_CLIENTS:
clients[i] = connect(...)
keys = "key_0", ..., "key_100"
local_checksum_storage[key] = {}
operations = [read, write, write_read]
LOOP forever:
key = random_choice(keys)
operation = random_choice(operations)
success = operation(clients, key, local_checksum_storage)

Clients can read and write data (Liveness)

The database should make progress and not get stuck forever. Client requests should eventually be successful.

How to test this
  • For every successful request, update last_successful_operation variable.
  • If the last successful operation was more than 2 minutes ago (configurable), the program panics with an error message.
FUNCTION write_read(clients, key, local_checksum_storage):
write_client = random_choice(clients)
write_data = random_value(MIN_DATA_SIZE, MAX_DATA_SIZE)
write_checksum = hash(write_data)
start_time = current_time
write_success = write_client.put(key, {write_data, write_checksum})
IF !write_success:
RETURN false
local_checksum_storage[key] = write_checksum
read_client = random_choice(clients)
start_time = current_time
read_success, read_data, read_checksum = read_client.get(key)
IF !read_success:
RETURN false
IF read_checksum != hash(data):
Error("Data corrupted")
IF local_checksum_storage[key] != read_checksum:
Error("Data lost")
IF read_data != write_data:
Error("Stale read")
RETURN true
FUNCTION write(clients, key, local_checksum_storage):
client = random_choice(clients)
data = random_value(MIN_DATA_SIZE, MAX_DATA_SIZE)
checksum = hash(data)
start_time = current_time
success = client.put(key, {data, checksum})
IF !success:
RETURN false
local_checksum_storage[key] = checksum
RETURN true
FUNCTION read(clients, key, local_checksum_storage):
client = random_choice(clients)
start_time = current_time
success, data, checksum = client.get(key)
IF !success:
RETURN false
IF checksum != hash(data):
Error("Data corrupted")
IF local_checksum_storage[key] != checksum:
Error("Data lost")
RETURN true
FUNCTION main():
clients = array of NUM_CLIENTS clients
FOR i from 0 to NUM_CLIENTS:
clients[i] = connect(...)
keys = "key_0", ..., "key_100"
local_checksum_storage[key] = {}
operations = [read, write, write_read]
last_successful_operation = current_time
LOOP forever:
IF (current_time - last_successful_operation) > LIVENESS_TIMEOUT:
Error ("LIVENESS property violated")
key = random_choice(keys)
operation = random_choice(operations)
success = operation(clients, key, local_checksum_storage)
IF success:
last_successful_operation = current_time

For KV stores supporting conditional writes

If your database supports conditional writes, you should test the following property.

CAS is never violated (Safety)

Compare-and-swap is generally applicable to single keys and not multi-key. But Zookeeper supports multi-key conditional updates.

How to test this

The workload for this property is written in Go-style code to make thread-safe code easier to comprehend.

  • Initialize 3 clients and set a key to value 0.
  • In each iteration:
    • Have all 3 clients simultaneously attempt a CAS operation to increment the value.
    • Only one should succeed (CAS should be atomic).
    • Verify that exactly one succeeded and the value incremented by 1.
    • If multiple clients succeed or the value is wrong, panic.
func cas(){
// Create multiple clients
clients := make([]Client, 2)
for i := 0; i < 2; i++ {
clients[i] = connect(...) // your connection parameters
}
key := "key"
current_value := 0
clients[0].put(key, current_value)
for iteration := 0; iteration < 100; iteration++ {
var wg sync.WaitGroup
var successes int32 // Use atomic counter
// Barrier ensures all goroutines start together
startBarrier := make(chan struct{})
// Launch concurrent CAS operations
for _, client := range clients {
wg.Add(1) // Add to wait group
go func(client Client) {
defer wg.Done()
// Block here until barrier is released
<-startBarrier
success, err = client.cas(key, current_value, current_value+1)
if err {
return
}
if success {
atomic.AddInt32(&successes, 1) // Thread-safe increment
}
}(client)
}
// Small delay to ensure all goroutines reach the barrier
time.Sleep(10 * time.Millisecond)
// Release all goroutines simultaneously
close(startBarrier)
wg.Wait()
// Verify results
val = client[0].get(key)
success_count := atomic.LoadInt32(&successes)
if success_count > 1 {
panic("CAS violated! Expected value: %d, Got: %d, Successes: %d", expected_value, response["data"], success_count)
}
current_value = expected_value
}
}
  • Introduction
  • Welcome to Antithesis
  • How Antithesis works
  • Using Antithesis with AI
  • Get started
  • Setup guide
  • Overview
  • For Docker Compose users
  • For Kubernetes users
  • Test an example system
  • Overview
  • With Docker Compose
  • Overview
  • Build and run an etcd cluster
  • Add a test template
  • With Kubernetes
  • Overview
  • Build and run an etcd cluster
  • Add a test template
  • Product
  • Test templates
  • Overview
  • Creating test templates
  • Test commands
  • How to check a test template locally
  • How to port tests to Antithesis
  • Test launchers
  • The triage report
  • Overview
  • Findings
  • Environment
  • Utilization
  • Properties
  • Logs Explorer & multiverse map
  • Debugging
  • Overview
  • Causality analysis
  • Multiverse debugging
  • Simple Multiverse debugging
  • Advanced
  • Overview
  • The Antithesis multiverse
  • Querying with event sets
  • Environment utilities
  • Using the Antithesis Notebook
  • Cookbook
  • Tooling integrations
  • CI integration
  • Discord and Slack integrations
  • Issue tracker integration - BETA
  • Configuration
  • Access and authentication
  • The Antithesis environment
  • Best practices
  • Docker best practices
  • Kubernetes best practices
  • Optimizing for testing
  • Concepts
  • Properties and Assertions
  • Overview
  • Properties in Antithesis
  • Assertions in Antithesis
  • Sometimes Assertions
  • Properties to test for
  • Fault injection
  • Reference
  • Webhooks
  • Overview
  • Launching a test
  • Launching a debugging session
  • webhook reference
  • Antithesis API
  • Handling external dependencies
  • SDK reference
  • Overview
  • Define test properties
  • Generate randomness
  • Manage test lifecycle
  • Assertion catalog
  • Coverage instrumentation
  • Go
  • Go SDK
  • Instrumentor
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • Java
  • Java SDK
  • Using the SDK
  • Building your software
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • C SDK
  • C++
  • C++ SDK
  • C/C++ Instrumentation
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • JavaScript
  • Python
  • Python SDK
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • Rust
  • Rust SDK
  • Instrumentation
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • .NET
  • .NET SDK
  • Instrumentation
  • Tutorial
  • Assert (reference)
  • Lifecycle (reference)
  • Random (reference)
  • Languages not listed above
  • Fallback SDK
  • Assert (reference)
  • Lifecycle (reference)
  • Assertion Schema
  • FAQ
  • Product FAQs
  • About Antithesis POCs
  • Release notes
  • Release notes
  • General reliability resources
  • Reliability glossary
  • Techniques for better software testing
  • Autonomous testing
  • Deterministic simulation testing
  • Property-based testing
  • White paper — How much does an outage cost?
  • Catalog of reliability properties for key-value datastores
  • Catalog of reliability properties for blockchains
  • Test ACID compliance with a ring test