-
Notifications
You must be signed in to change notification settings - Fork 9
Debugging
Debugging programs in parallel can pose a significant challenge. Here are some possible approaches when working with vt.
rr is a lightweight tool for recording, replaying and debugging execution of applications, built on top of gdb. It records all processes forked by the initial process automatically (e.g. when mpiexec
is used). Maintained by Mozilla, rr is a mature open source solution.
Example:
- record an execution
$ rr record mpiexec -n 2 ../build_vt/vt/examples/hello_world/hello_world
rr: Saving execution to trace directory `/home/cz4rs/.local/share/rr/mpiexec-2'.
vt: Runtime Initializing: mode: single-thread per rank
(...)
- list all processes recorded
$ rr ps
PID PPID EXIT CMD
139799 -- 0 mpiexec -n 2 ../build_vt/vt/examples/hello_world/hello_world
139800 139799 0 /usr/bin/hydra_pmi_proxy --control-port rysy:39927 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
139801 139800 0 ../build_vt/vt/examples/hello_world/hello_world
139802 139800 0 ../build_vt/vt/examples/hello_world/hello_world
- replay and debug selected process
$ rr replay -p 139802
- we land in a familiar
gdb
environment, let's set a breakpoint
Reading symbols from /home/cz4rs/.local/share/rr/mpiexec-2/mmap_hardlink_112_hello_world...
(...)
0x00007f6f56736100 in ?? () from /lib64/ld-linux-x86-64.so.2
(rr) b hello_world.cc:55
Breakpoint 1 at 0x55b5faf9826e: file /home/cz4rs/nga/vt/examples/hello_world/hello_world.cc, line 55.
(rr) c
Continuing.
vt: Runtime Initializing: mode: single-thread per rank
(...)
Breakpoint 1, hello_world (msg=0x55b5fd8d8868)
at /home/cz4rs/nga/vt/examples/hello_world/hello_world.cc:55
55 fmt::print("{}: Hello from node {}\n", this_node, msg->from);
- all basic
gdb
commands work as expected
(rr) p this_node
$1 = 1
(rr) watch -l this_node
Hardware watchpoint 2: -location this_node
- and we can use reverse execution to track down problems efficiently
(rr) reverse-continue
Continuing.
Hardware watchpoint 2: -location this_node
Old value = 1
New value = 0
hello_world (msg=0x55b5fd8d8868) at /home/cz4rs/nga/vt/examples/hello_world/hello_world.cc:54
54 vt::NodeType this_node = vt::theContext()->getNode();
(rr) ...
See more at the Usage page.
The script launches debugger instances in separate xterm
windows and pauses execution of the program using --vt_pause
.
If your bug is only reproducible at scale on an HPC system, you might try running gdb
non-interactively. Your command-line would look something like:
mpiexec <mpiexec-args> <wrapio-script> gdb -x <gdb-command-file> -batch <app-executable>
Here, <wrapio-script>
is a script that can separate the output from each MPI process into a separate file. For OpenMPI, it would be:
#!/bin/bash
file="output-${OMPI_COMM_WORLD_SIZE}-${OMPI_COMM_WORLD_RANK}.log"
# for stdout to be line buffered
stdbuf -oL $@ &>${file}
For Slurm, it would be:
#!/bin/bash
file="output-${SLURM_NPROCS}-${SLURM_PROCID}.log"
# for stdout to be line buffered
stdbuf -oL $@ &>${file}
Your <gdb-command-file>
lists the gdb
commands that you want to run, e.g.:
catch throw
commands
backtrace
list
continue
end
catch signal SIGABRT
commands
backtrace
list
end
catch signal SIGSEGV
commands
backtrace
list
end
run <app-args>
where <app-args>
are the command-line arguments for your executable.
If you need to print a backtrace for a thrown exception in a parallel unit test on a Mac laptop, here's how. First create a command file like the one below, where <app-args>
can be something like --gtest_filter="TestLoadBalancerNoWork.test_load_balancer_no_work"
.
break set -n __cxa_throw
break command add
bt
DONE
run <app-args>
q
Then, run with a command-line like the one below:
mpirun <mpirun-args> lldb -s <lldb-command-file> <app-executable>
You will get output from all processes to your terminal simultaneously. You could probably employ a script like in the previous section to pipe the output from each process to a separate file.
You can use vt_asan_enabled
(CMake), VT_ASAN_ENABLED
(build script) or VT_ASAN
(Docker) configuration variables to enable building with address sanitizer.
heaptrack is a simple heap memory profiler with low time and memory overhead.
- see checkpoint#180 for actual use (diagnosing bug in
checkpoint
) - comparison with Valgrind's massif