Catalog of reliability properties for key-value datastores
Key-Value (KV) datastores are non-relational databases that store data as a collection of keys and associated values. Commonly used KV stores include etcd, , Memcached, Redis, R2, S3, TiKV, and ZooKeeper.
This is a catalog of reliability properties which are applicable to most, if not all, non-transactional KV datastores. These high-level holistic properties, ensure the system as a whole behaves as expected and delivers on all promised guarantees. If you’re designing or working on a KV store, this reference helps you think about what properties your KV store should have. We’d love your help adding more properties to this list, please reach out to us at support@antithesis.com.
Every property here is accompanied by a pseudocode workload that checks the property. As we go down the list, we extend the workload, so that by the end, the workload implements all properties listed here.
Also check out our reference implementation of this workload for TiKV.
Property catalog
- No data corruption (Safety)
- No data loss (Safety)
- No stale reads (Safety)
- CAS is never violated (Safety)
- Clients can read and write data (Liveness)
No data corruption (Safety)
Clients never receive corrupt data.
How to test this
- Make read and write requests to the datastore using a predefined pool of keys.
- Choose a key from the pool, choose either a read or write request.
- For every write request, generate random data for the key and send the request.
- For every read request, send the request.
- For every write request, store the checksum of the data along with the value.
- During a read, calculate the checksum of the read data and compare it with the stored checksum.
- Any mismatch indicates data corruption.
FUNCTION write(client, key):
data = random_value(MIN_DATA_SIZE, MAX_DATA_SIZE)
checksum = hash(data)
start_time = current_time
success = client.put(key, {data, checksum})
IF !success:
RETURN false
RETURN true
FUNCTION read(client, key):
start_time = current_time
success, data, checksum = client.get(key)
IF !success:
RETURN false
IF checksum != hash(data):
Error("Data corrupted")
RETURN true
FUNCTION main():
client = connect(...)
keys = "key_0", ..., "key_100"
operations = [read, write]
LOOP forever:
key = random_choice(keys)
operation = random_choice(operations)
success = operation(client, key)No data loss (Safety)
Committed data should not get lost.
How to test this
- During a write, store the successful requests in an in-memory hashmap.
- Compare read responses against the in-memory storage to verify durability.
- Any mismatch indicates data loss.
FUNCTION write(client, key, local_checksum_storage):
data = random_value(MIN_DATA_SIZE, MAX_DATA_SIZE)
checksum = hash(data)
start_time = current_time
success = client.put(key, {data, checksum})
IF !success:
RETURN false
local_checksum_storage[key] = checksum
RETURN true
FUNCTION read(client, key, local_checksum_storage):
start_time = current_time
success, data, checksum = client.get(key)
IF !success:
RETURN false
IF checksum != hash(data):
Error("Data corrupted")
IF local_checksum_storage[key] != checksum:
Error("Data lost")
RETURN true
FUNCTION main():
client = connect(...)
keys = "key_0", ..., "key_100"
local_checksum_storage = {}
operations = [read, write]
LOOP forever:
key = random_choice(keys)
operation = random_choice(operations)
success = operation(client, key, local_checksum_storage)No stale reads (Safety)
Based on your database’s consistency guarantee, clients should read data according to the promised consistency model.
For simplicity, this catalog tests for ‘No stale reads’ (weaker form of linearizability) – Committed data is immediately visible to all clients.
How to test this
- Create multiple clients in the same process.
- Write data with one client and read data with a random client that didn’t write the data.
- Verify that all clients immediately see committed writes.
FUNCTION write_read(clients, key, local_checksum_storage):
write_client = random_choice(clients)
write_data = random_value(MIN_DATA_SIZE, MAX_DATA_SIZE)
write_checksum = hash(write_data)
start_time = current_time
write_success = write_client.put(key, {write_data, write_checksum})
IF !write_success:
RETURN false
local_checksum_storage[key] = write_checksum
read_client = random_choice(clients)
start_time = current_time
read_success, read_data, read_checksum = read_client.get(key)
IF !read_success:
RETURN false
IF read_checksum != hash(data):
Error("Data corrupted")
IF local_checksum_storage[key] != read_checksum:
Error("Data lost")
IF read_data != write_data:
Error("Stale read")
RETURN true
FUNCTION write(clients, key, local_checksum_storage):
client = random_choice(clients)
data = random_value(MIN_DATA_SIZE, MAX_DATA_SIZE)
checksum = hash(data)
start_time = current_time
success = client.put(key, {data, checksum})
IF !success:
RETURN false
local_checksum_storage[key] = checksum
RETURN true
FUNCTION read(clients, key, local_checksum_storage):
client = random_choice(clients)
start_time = current_time
success, data, checksum = client.get(key)
IF !success:
RETURN false
IF checksum != hash(data):
Error("Data corrupted")
IF local_checksum_storage[key] != checksum:
Error("Data lost")
RETURN true
FUNCTION main():
clients = array of NUM_CLIENTS clients
FOR i from 0 to NUM_CLIENTS:
clients[i] = connect(...)
keys = "key_0", ..., "key_100"
local_checksum_storage[key] = {}
operations = [read, write, write_read]
LOOP forever:
key = random_choice(keys)
operation = random_choice(operations)
success = operation(clients, key, local_checksum_storage)Clients can read and write data (Liveness)
The database should make progress and not get stuck forever. Client requests should eventually be successful.
How to test this
- For every successful request, update
last_successful_operationvariable. - If the last successful operation was more than 2 minutes ago (configurable), the program panics with an error message.
FUNCTION write_read(clients, key, local_checksum_storage):
write_client = random_choice(clients)
write_data = random_value(MIN_DATA_SIZE, MAX_DATA_SIZE)
write_checksum = hash(write_data)
start_time = current_time
write_success = write_client.put(key, {write_data, write_checksum})
IF !write_success:
RETURN false
local_checksum_storage[key] = write_checksum
read_client = random_choice(clients)
start_time = current_time
read_success, read_data, read_checksum = read_client.get(key)
IF !read_success:
RETURN false
IF read_checksum != hash(data):
Error("Data corrupted")
IF local_checksum_storage[key] != read_checksum:
Error("Data lost")
IF read_data != write_data:
Error("Stale read")
RETURN true
FUNCTION write(clients, key, local_checksum_storage):
client = random_choice(clients)
data = random_value(MIN_DATA_SIZE, MAX_DATA_SIZE)
checksum = hash(data)
start_time = current_time
success = client.put(key, {data, checksum})
IF !success:
RETURN false
local_checksum_storage[key] = checksum
RETURN true
FUNCTION read(clients, key, local_checksum_storage):
client = random_choice(clients)
start_time = current_time
success, data, checksum = client.get(key)
IF !success:
RETURN false
IF checksum != hash(data):
Error("Data corrupted")
IF local_checksum_storage[key] != checksum:
Error("Data lost")
RETURN true
FUNCTION main():
clients = array of NUM_CLIENTS clients
FOR i from 0 to NUM_CLIENTS:
clients[i] = connect(...)
keys = "key_0", ..., "key_100"
local_checksum_storage[key] = {}
operations = [read, write, write_read]
last_successful_operation = current_time
LOOP forever:
IF (current_time - last_successful_operation) > LIVENESS_TIMEOUT:
Error ("LIVENESS property violated")
key = random_choice(keys)
operation = random_choice(operations)
success = operation(clients, key, local_checksum_storage)
IF success:
last_successful_operation = current_timeFor KV stores supporting conditional writes
If your database supports conditional writes, you should test the following property.
CAS is never violated (Safety)
Compare-and-swap is generally applicable to single keys and not multi-key. But Zookeeper supports multi-key conditional updates.
How to test this
The workload for this property is written in Go-style code to make thread-safe code easier to comprehend.
- Initialize 3 clients and set a key to value 0.
- In each iteration:
- Have all 3 clients simultaneously attempt a CAS operation to increment the value.
- Only one should succeed (CAS should be atomic).
- Verify that exactly one succeeded and the value incremented by 1.
- If multiple clients succeed or the value is wrong, panic.
func cas(){
// Create multiple clients
clients := make([]Client, 2)
for i := 0; i < 2; i++ {
clients[i] = connect(...) // your connection parameters
}
key := "key"
current_value := 0
clients[0].put(key, current_value)
for iteration := 0; iteration < 100; iteration++ {
var wg sync.WaitGroup
var successes int32 // Use atomic counter
// Barrier ensures all goroutines start together
startBarrier := make(chan struct{})
// Launch concurrent CAS operations
for _, client := range clients {
wg.Add(1) // Add to wait group
go func(client Client) {
defer wg.Done()
// Block here until barrier is released
<-startBarrier
success, err = client.cas(key, current_value, current_value+1)
if err {
return
}
if success {
atomic.AddInt32(&successes, 1) // Thread-safe increment
}
}(client)
}
// Small delay to ensure all goroutines reach the barrier
time.Sleep(10 * time.Millisecond)
// Release all goroutines simultaneously
close(startBarrier)
wg.Wait()
// Verify results
val = client[0].get(key)
success_count := atomic.LoadInt32(&successes)
if success_count > 1 {
panic("CAS violated! Expected value: %d, Got: %d, Successes: %d", expected_value, response["data"], success_count)
}
current_value = expected_value
}
}