Cookbook notebook

View the logs from the future

Peer 2 seconds into the future of a buggy-moment and view its events.

// PRE-REQ: you have a 'moment' (likely from our boilerplate)

// spawn a branch where you'll play the simulation forward
branch = moment.branch()
branch.wait(Time.seconds(2))

// view the events leading up to this future
print(environment.events.up_to(branch))

Run a diagnostic command

Run a bash command (e.g. netstat) in a container you’re investigating.

// PRE-REQ: you have a 'moment' and 'environment' (likely from our boilerplate)

// list all running container names
client_names = environment.containers.list({moment}).map(x => x.name)
print(client_names)

// select the container name for the container you want to run netstat in
client_name = client_names.at(0)

// spawn a branch and run the command, printing the resulting process output
branch = moment.branch()
process = bash`netstat`.run({container: client_name, branch})
print(process)

If you’re here to debug a crashed process, get the pid of that process from the logs. Container processes have two pids, one found in the host machine and another in the container, the pid obtained from the logs is the host pid. For most debugging use-cases, you’ll want the container pid.

Map host pid to its container pid

To map a host pid to its container pid, follow the instructions below.

// PRE-REQ: you have a 'moment' and 'environment' (likely from our boilerplate)
// PRE-REQ: you have the host pid of the crashed process

// Rewind to a moment before the crash
pre_crash_moment = moment.rewind(Time.seconds(0.8))
branch = pre_crash_moment.branch()

// Let the host pid = 3653
hpid = 3653

// Get all pids of the target process in the namespaces it participates in
// This command is run on the host machine
print(bash`grep NSpid /proc/${hpid}/status`.run({container: environment.host, branch}))

// Output
// 82.108  NSpid:  3653    103
// The container pid is 103

Get process id of a known process

If you’re here to proactively debug an event and want the pid of a specific process in a specific container, follow the instructions below.

// PRE-REQ: you have a 'moment' and 'environment' (likely from our boilerplate)

// grab the container you want to debug, similar to in our diagnostic command example
environment.containers.list({moment})
// container_1
// container_2
// container_3

// you want to debug container_2
chosen_container = 'container_2'

// rewind time if you're interested in debugging just before the current moment
debug_moment = moment.rewind(Time.seconds(0.8))

// if you don't know the pid of the program you want to debug, you can use a ps command to see the list of running processes
print(bash`ps aux`.run({branch: debug_moment.branch(), container: chosen_container}))

// if you know the process name
process_name = "slirp4netns"
ps_command = bash`ps --format pid --no-headers -C ${process_name} | head -n 1`.run({branch: debug_moment.branch(), container: chosen_container})

// If you have `psgrep` inside of your target container, that could be more ergonomic
ps_command = bash`pgrep ${process_name}`.run({branch: debug_moment.branch(), container: chosen_container})

// then create a variable of the pid of interest
pid_to_debug = ps_command.stdout_text

Get the container id from a host pid

If you’re here to investigate a process crash, here’s how to get the container id that was running the crashed process.

Grab the host pid of the crashed process from the logs.

// PRE-REQ: you have a 'moment' and 'environment' (likely from our boilerplate)
// PRE-REQ: you have the host pid of the crashed process

// Rewind to a moment before the crash
pre_crash_moment = moment.rewind(Time.seconds(0.8))
branch = pre_crash_moment.branch()

// Let the host pid = 3653
hpid = 3653

// Containers are placed into distinct cgroups, so you can get the container id from the process's cgroup information
print(bash`cat /proc/${hpid}/cgroup`.run({branch, container: environment.host}))

// The output will look similar to
// 0::/machine.slice/libpod-<container-id>.scope
// The important part is: "libpod-<container-id>.scope" to get the container id

// You can also use this command to extract the container id
print(bash`grep -o 'libpod-[^.]*' /proc/${hpid}/cgroup | sed 's/libpod-//'`.run({container: environment.host, branch}))

// Inspect the container to find the image name, container name
print(bash`podman inspect <container-id>`.run({container: environment.host, branch}))

Pull a core dump

Cause a process in a container to exit and extract a core dump. Follow the instructions to get the pid of the target process.

// PRE-REQ: you have a 'moment' and 'environment' (likely from our boilerplate)

// grab a container, similar to in our diagnostic command example
client_name = environment.containers.list({moment}).at(0).name

// If you're here because your target process has crashed, rewind to before the crash occurred
coredump_moment = moment.rewind(Time.seconds(0.8))

// If you don't know the process id (pid) of the program you'd like a core dump for, follow the steps in get a process id example
// Then create a variable of the pid you're interested in core_dumping.
pid_to_kill = 1234

file = environment.core_dump_by_pid({moment: coredump_moment, pid: pid_to_kill, container: client_name})

print(file)

Run the profiler

Runs the profiler for ten seconds and prints the its results.

// Start the profiler on some branch
background_profiler = environment.profiler.start({branch})

// You can optionally supply a PID if you want to look at a particular process.
// background_profiler = environment.profiler.start({branch, pid: 1})

// Advance time on the branch
// Note that instead of waiting, you could run commands here.
// This is especially helpful for investigating the performance of a series of commands
branch.wait(Time.seconds(10))

// Stop the profiler on the branch
environment.profiler.stop({branch, background_profiler})

// View the results as of the end of the branch
print(environment.profiler.report({moment: branch.end}))

Ask a counterfactual

If you have a bug you believe is not vulnerable to small changes in CPU-timings you can ask counterfactual questions like “if I turn this feature flag off, does the bug still occur?”

// PRE-REQ: you have a 'moment' and 'environment' (likely from our boilerplate)
// PRE-REQ: you have something you want to tweak that can be controlled by a container in your system under test

// define what you consider a bug as an eventset
bugs_event_set = environment.events.filter(ev => ev.output_text != null && ev.output_text.includes("FATAL"))

// check that the bug occurred in this moment's history
print(bugs_event_set.up_to(moment))

// grab a container, similar to in our diagnostic command example
feature_flag_client = environment.containers.list({moment}).at(0)?.name

// Rewind to where you want to tweak history
alternative_timeline = moment.rewind(Time.seconds(0.8)).branch()

// use your feature-flag-service (eg. StatSig) to flip a feature flag
bash`siggy gates update my-feature-flag '{ "type": "public" }'`.run({branch: alternative_timeline, container: feature_flag_client})

// wait until a bug occurs, or 10 simulated seconds pass, whichever is sooner
alternative_timeline.wait_until({until: bugs_event_set, timeout: Time.seconds(10)})

// see if the bug did occur
print(bugs_event_set.up_to(alternative_timeline))

Using language-specific debuggers

Multiverse debugging can be used to operate command-line debuggers inside the simulation. If you’re used to a GUI this may be awkward at first, but we encourage you to give it a shot. Being able to undo, view your command history, or parallelize commands offers a lot of power and usability.

If there are particular tools or abilities you’d like to be made more ergonomic, please contact us at support@antithesis.com or join our Discord.

Example

Imagine you found a core dump error in a test run and you’re trying to debug it. With the help of command-line debuggers, you can know that the core dump was caused by, say, memory corruption at <memory_address>.

But you can’t know what caused the memory corruption in that specific test run without travelling back in time and monitoring the affected area of code. Antithesis’ deterministic replayability allows you to rewind time before the core dump, set a watchpoint on the memory address that’ll be corrupted and debug what caused it and how it happened.

Below is an example workflow of how you can investigate memory corruption in the simulation using GDB or LLDB.

The example illustrates the commands to set a watchpoint on a memory address, set a breakpoint on a function, and read a memory address – using a named pipe. Alternatively, you can also run these commands in the batch mode.

Using GDB

// PRE-REQ: you have a 'moment' and a 'pid'

chosen_container = "container_2"
branch = moment.branch()

pre_moment = branch.end.rewind(1)
pre_branch = pre_moment.branch()

// Notice the `pre_branch.branch()`
// This creates a quick branch to inspect the pids without moving `pre_branch` forward
print(bash`ps aux`.run({branch: pre_branch.branch(), container: chosen_container}))

// Creates the pipe
print(bash`mkfifo /dev/gdb_pipe`.run({branch: pre_branch, container: chosen_container}))

// Keeps the pipe open
print(bash`sleep infinity > /dev/gdb_pipe`.run_in_background({branch: pre_branch, container: chosen_container}))

// Attach the pid (e.g. 36)
print(bash`gdb -p 36 < /dev/gdb_pipe`.run_in_background({branch: pre_branch, container: chosen_container}))

// Set a watchpoint
print(bash`echo "watch *(long*)<memory_address>" > /dev/gdb_pipe`.run({branch: pre_branch, container: chosen_container}))

// Set a breakpoint
print(bash`echo "b <function>" > /dev/gdb_pipe`.run({branch: pre_branch, container: chosen_container}))

// Examine a memory address
print(bash`echo "x/2wu <memory_address>" > /dev/gdb_pipe`.run({branch: pre_branch, container: chosen_container}))

print(bash`echo "c" > /dev/gdb_pipe`.run({branch: pre_branch, container: chosen_container}))

// Move the `pre_branch` forward to hit the breakpoint
pre_branch.wait({duration: Time.seconds(3)})

// you can also run a batch of gdb commands
gdbout = bash`gdb -p ${FAILING_pid.toString()} -ex "watch *(long*)<memory_address>" -ex "b <function>" -ex "x/2wu <memory_address>"`.run({ branch: pre_branch, container: chosen_container})

print(gdbout)

download(gdbout)

Using LLDB

// PRE-REQ: you have a 'moment' and a 'pid'

chosen_container = "container_2"
branch = moment.branch()

pre_moment = branch.end.rewind(1)
pre_branch = pre_moment.branch()

// Notice the `pre_branch.branch()`
// This creates a quick branch to inspect the pids without moving `pre_branch` forward
print(bash`ps aux`.run({branch: pre_branch.branch(), container: chosen_container}))

// Creates the pipe
print(bash`mkfifo /dev/lldb_pipe`.run({branch: pre_branch, container: chosen_container}))

// Keeps the pipe open
print(bash`sleep infinity > /dev/lldb_pipe`.run_in_background({branch: pre_branch, container: chosen_container}))

// Attach to pid (e.g. 36)
print(bash`lldb -p 36 < /dev/lldb_pipe`.run_in_background({branch: pre_branch, container: chosen_container}))

// Set a watchpoint
print(bash`echo "watchpoint set expression <memory_address>" > /dev/lldb_pipe`.run({branch: pre_branch, container: chosen_container}))

//Set a breakpoint
print(bash`echo "b -n <function>" > /dev/lldb_pipe`.run({branch: pre_branch, container: chosen_container}))

//Read memory
print(bash`echo "memory read --size 2 --format x --count 2 <memory_address>" > /dev/lldb_pipe`.run({branch: pre_branch, container: chosen_container}))

print(bash`echo "continue" > /dev/lldb_pipe`.run({branch: pre_branch, container: chosen_container}))

// Move the `pre_branch` forward to hit the breakpoint
pre_branch.wait({duration: Time.seconds(3)})

// you can also run a batch of lldb commands
lldbout = bash`lldb attach -p ${failing_pid.toString()} --batch -o "watchpoint set expression <memory_address>" -o "b -n <function>" -o "memory read --size 2 --format x --count 2 <memory_address>" -o "continue"`.run({ branch: pre_branch, container: chosen_container})

print(lldbout)

download(lldbout)

Using JDB

You must have JDWP enabled to use JDB. To do this, set the JAVA_TOOL_OPTIONS environment variable as follows: JAVA_TOOL_OPTIONS: -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=7896 (The port that JDWP is enabled on, in this case 7896, just needs to be an available port.) This must be set in the relevant containers, at or before the time of invocation.

// PRE-REQ: you have a 'moment' (likely a bug moment from a triage report)

// rewind and spawn a new branch starting shortly before your moment
debug_branch = moment.rewind(.3).branch()

// because bash shells are ephemeral, we need to write jdb commands to a file...
print(bash`mkfifo /dev/jdb_pipe`.run({branch: debug_branch, container: 'kafka-2'}))

// ...that's piping the output to jdb
// In this case, JDWP is enabled on port 7896
print(bash`jdb -attach 7896 < /dev/jdb_pipe`.run_in_background({branch: debug_branch, container: 'kafka-2'}))

// sleep to the pipe (keeping the pipe open)
print(bash`sleep infinity > /dev/jdb_pipe`.run_in_background({branch: debug_branch, container: 'kafka-2'}))

// set breakpoint
print(bash`echo 'catch java.lang.NullPointerException' > /dev/jdb_pipe`.run({branch:
debug_branch, container: 'kafka-2'}))

//advance time to hit the issue
debug_branch.wait({duration: Time.seconds(.5)})

//get all local variable output
print(bash`echo 'locals' > /dev/jdb_pipe`.run({branch:debug_branch, container: 'kafka-2'}))

//access Object information, field by field
print(bash`echo 'dump getDataRequest.ctx' > /dev/jdb_pipe`.run({branch:
debug_branch, container: 'kafka-2'}))

If you want to start with a Java Heap Dump instead

Generating a Java Heap Dump is done using the jcmd command-line tool, which is included in the JDK.

// PRE-REQ: you have a 'moment' (likely a bug moment from a triage report)

// spawn a branch where you'll play the simulation forward
branch = moment.branch()

// list all processes running in the kafka-2 container along with their pid
print(pid = bash` ps -elf`.run({
   branch: branch,
   container: 'kafka-2'
}))

// since the process we want is the container entrypoint we're using pid = 1, but you can use ps to find the relevant pid, as we do above
heap_dump_pid = 1

// This will generate a heap dump file...
bash`jcmd ${heap_dump_pid} GC.heap_dump /tmp/heap_dump.hprof`.run({branch: branch, container: 'kafka-2'})

//...which you can download
download(environment.extract_file({
    moment: branch.end,
    path: `/tmp/heap_dump.hprof`,
    container: 'kafka-2'
}));

Using Delve or another Go debugger

In this example, we’re using Delve. You must install Delve on your container image for these commands to work.

// PRE-REQ: you have a 'moment' (likely a bug moment from a triage report)

// rewind and spawn a new branch starting shortly before your moment
delve_branch = moment.rewind(Time.seconds(2)).branch()

// create a pipe for delve input/output
print(bash`mkfifo /dev/dlv_pipe && ls -l /dev/dlv_pipe`.run({branch: delve_branch, container }))

// define the pid of the process you're debugging
delve_pid = 1

// enable delve
print((pipe = bash`dlv attach ${delve_pid} --allow-non-terminal-interactive < /dev/dlv_pipe`.run_in_background({ branch: delve_branch, container })))

// sleep to the pipe (keeping the pipe open)
bash`sleep infinity > /dev/dlv_pipe`.run_in_background({branch: delve_branch, container})

// create a breakpoint and continue
bash`echo "break bp path/to/file/file.go:406" > /dev/dlv_pipe`.run({branch: delve_branch, container })

delve_branch.wait({duration:Time.seconds(2)})

From here, you can:

//print a stack trace
print(bash`echo "stack" > /dev/dlv_pipe`.run({branch: delve_branch, container }))

//move the current frame up
print(bash`echo "up 3" > /dev/dlv_pipe`.run({branch: delve_branch, container }))

//print local variable output
print(bash`echo "locals" > /dev/dlv_pipe`.run({branch: delve_branch, container }))

All available Delve CLI commands can be found here.