Cookbook

Simple: View the logs from the future

Peer 2 seconds into the future of a buggy-moment and view its events.

// PRE-REQ: you have a 'moment' (likely from our boilerplate)

// spawn a branch where you'll play the simulation forward
branch = moment.branch()
branch.wait(Time.seconds(2))

// view the events leading up to this future
print(environment.events.up_to(branch))

Simple: Run a diagnostic command

Run a bash command (e.g. netstat) in a container you’re investigating.

// PRE-REQ: you have a 'moment' and 'environment' (likely from our boilerplate)

// list all running container names
client_names = environment.containers.list({moment}).map(x => x.name)
print(client_names)

// select the container name for the container you want to run netstat in
client_name = client_names.at(0)

// spawn a branch and run the command, printing the resulting process output
branch = moment.branch()
process = bash`netstat`.run({container: client_name, branch})
print(process)

Pull a core dump

Cause a process in a container to exit and extract a core dump.

// PRE-REQ: you have a 'moment' and 'environment' (likely from our boilerplate)

// grab a container, similar to in our diagnostic command example
client_name = environment.containers.list({moment}).at(0).name

// If you're here because your target process has crashed, rewind to before the crash occurred
coredump_moment = moment.rewind(Time.seconds(0.8))

// If you don't know the process id (pid) of the program you'd like a core dump for, you can use a ps command to see the list of running processes
process_name = "slirp4netns"
ps_command = bash`ps --format pid --no-headers -C ${process_name} | head -n 1`.run({branch: coredump_moment.branch(), container: client_name})

// If you have `psgrep` inside of your target container, that could be more ergonomic
ps_command = bash`pgrep ${process_name}`.run({branch: coredump_moment.branch(), container: client_name})

// Then create a variable of the pid you're interested in core_dumping.
pid_to_kill = ps_command.stdout_text

file = environment.core_dump_by_pid({moment: coredump_moment, pid: pid_to_kill, container: client_name})

print(file)

Run the profiler

Runs the profiler for ten seconds and prints the its results.

// Start the profiler on some branch
background_profiler = environment.profiler.start({branch})

// You can optionally supply a PID if you want to look at a particular process.
// background_profiler = environment.profiler.start({branch, pid: 1})

// Advance time on the branch
// Note that instead of waiting, you could run commands here.
// This is especially helpful for investigating the performance of a series of commands
branch.wait(Time.seconds(10))

// Stop the profiler on the branch
environment.profiler.stop({branch, background_profiler})

// View the results as of the end of the branch
print(environment.profiler.report({moment: branch.end}))

Ask a counterfactual

If you have a bug you believe is not vulnerable to small changes in CPU-timings you can ask counterfactual questions like “if I turn this feature flag off, does the bug still occur?”

// PRE-REQ: you have a 'moment' and 'environment' (likely from our boilerplate)
// PRE-REQ: you have something you want to tweak that can be controlled by a container in your system under test 

// define what you consider a bug as an eventset
bugs_event_set = environment.events.filter(ev => ev.output_text != null && ev.output_text.includes("FATAL"))

// check that the bug occurred in this moment's history
print(bugs_event_set.up_to(moment))

// grab a container, similar to in our diagnostic command example
feature_flag_client = environment.containers.list({moment}).at(0)?.name

// Rewind to where you want to tweak history
alternative_timeline = moment.rewind(Time.seconds(0.8)).branch()

// use your feature-flag-service (eg. StatSig) to flip a feature flag
bash`siggy gates update my-feature-flag '{ "type": "public" }'`.run({branch: alternative_timeline, container: feature_flag_client})

// wait until a bug occurs, or 10 simulated seconds pass, whichever is sooner
alternative_timeline.wait_until({until: bugs_event_set, timeout: Time.seconds(10)})

// see if the bug did occur
print(bugs_event_set.up_to(alternative_timeline))

Using language-specific debuggers

Multiverse debugging can be used to operate command-line debuggers inside the simulation. If you’re used to a GUI this may be awkward at first, but we encourage you to give it a shot. Being able to undo, view your command history, or parallelize commands offers a lot of power and usability.

That said, if there are particular tools or abilities you’d like to be made more ergonomic please let us know.

Getting the relevant process id

First you’ll need to find the right process id (pid) to pass to your external debugger. These steps are almost identical to the ones used to pull a core dump. Usually, the pid you’ll want is the container entrypoint for the failing container, but you can use this code to search for any pid you need.

// PRE-REQ: you have a 'moment' and 'environment' (likely from our boilerplate)

// grab a container, similar to in our diagnostic command example
client_name = environment.containers.list({moment}).at(0).name

// if you're here because your target process has crashed, rewind to before the crash occurred
coredump_moment = moment.rewind(Time.seconds(0.8))

// if you don't know the pid of the program you want to debug, you can use a ps command to see the list of running processes
process_name = "slirp4netns"
ps_command = bash`ps --format pid --no-headers -C ${process_name} | head -n 1`.run({branch: debug_moment.branch(), container: client_name})

// if you have `psgrep` inside your target container, that could be more ergonomic
ps_command = bash`pgrep ${process_name}`.run({branch: debug_moment.branch(), container: client_name})

// then create a variable of the pid of interest 
pid_to_debug = ps_command.stdout_text

Using JDB

You must have JDWP enabled to use JDB. To do this, set the JAVA_TOOL_OPTIONS environment variable as follows: JAVA_TOOL_OPTIONS: -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=7896 (The port that JDWP is enabled on, in this case 7896, just needs to be an available port.) This must be set in the relevant containers, at or before the time of invocation.

// PRE-REQ: you have a 'moment' (likely a bug moment from a triage report)

// rewind and spawn a new branch starting shortly before your moment
debug_branch = moment.rewind(.3).branch()

// because bash shells are ephemeral, we need to write jdb commands to a file...
print(bash`mkfifo /dev/jdb_pipe`.run({branch: debug_branch, container: 'kafka-2'}))

// ...that's piping the output to jdb
// In this case, JDWP is enabled on port 7896
print(bash`jdb -attach 7896 < /dev/jdb_pipe`.run_in_background({branch: debug_branch, container: 'kafka-2'}))

// sleep to the pipe (keeping the pipe open)
print(bash`sleep infinity > /dev/jdb_pipe`.run_in_background({branch: debug_branch, container: 'kafka-2'}))

// set breakpoint
print(bash`echo 'catch java.lang.NullPointerException' > /dev/jdb_pipe`.run({branch:
debug_branch, container: 'kafka-2'}))

//advance time to hit the issue
debug_branch.wait({duration: Time.seconds(.5)})

//get all local variable output
print(bash`echo 'locals' > /dev/jdb_pipe`.run({branch:debug_branch, container: 'kafka-2'}))

//access Object information, field by field
print(bash`echo 'dump getDataRequest.ctx' > /dev/jdb_pipe`.run({branch:
debug_branch, container: 'kafka-2'}))

If you want to start with a Java Heap Dump instead

Generating a Java Heap Dump is done using the jcmd command-line tool, which is included in the JDK.

// PRE-REQ: you have a 'moment' (likely a bug moment from a triage report)

// spawn a branch where you'll play the simulation forward
branch = moment.branch()

// list all processes running in the kafka-2 container along with their pid 
print(pid = bash` ps -elf`.run({
   branch: branch,
   container: 'kafka-2'
}))

// since the process we want is the container entrypoint we're using pid = 1, but you can use ps to find the relevant pid, as we do above
heap_dump_pid = 1

// This will generate a heap dump file...
bash`jcmd ${heap_dump_pid} GC.heap_dump /tmp/heap_dump.hprof`.run({branch: branch, container: 'kafka-2'})

//...which you can download
download(environment.extract_file({
    moment: branch.end,
    path: `/tmp/heap_dump.hprof`,
    container: 'kafka-2'
}));

Using Delve or another Go debugger

In this example, we’re using Delve. You must install Delve on your container image for these commands to work.

// PRE-REQ: you have a 'moment' (likely a bug moment from a triage report)

// rewind and spawn a new branch starting shortly before your moment
delve_branch = moment.rewind(Time.seconds(2)).branch()

// create a pipe for delve input/output
print(bash`mkfifo /dev/dlv_pipe && ls -l /dev/dlv_pipe`.run({branch: delve_branch, container }))

// define the PID of the process you're debugging
delve_pid = 1

// enable delve 
print((pipe = bash`dlv attach ${delve_pid} --allow-non-terminal-interactive < /dev/dlv_pipe`.run_in_background({ branch: delve_branch, container })))

// sleep to the pipe (keeping the pipe open)
bash`sleep infinity > /dev/dlv_pipe`.run_in_background({branch: delve_branch, container})

// create a breakpoint and continue
bash`echo "break bp path/to/file/file.go:406" > /dev/dlv_pipe`.run({branch: delve_branch, container })

delve_branch.wait({duration:Time.seconds(2)})

From here, you can:

//print a stack trace
print(bash`echo "stack" > /dev/dlv_pipe`.run({branch: delve_branch, container }))

//move the current frame up
print(bash`echo "up 3" > /dev/dlv_pipe`.run({branch: delve_branch, container }))

//print local variable output
print(bash`echo "locals" > /dev/dlv_pipe`.run({branch: delve_branch, container }))

All available Delve CLI commands can be found here.