Skip to content

Debugging

Nicole Slattengren edited this page Aug 16, 2024 · 11 revisions

Debugging programs in parallel can pose a significant challenge. Here are some possible approaches when working with vt.

rr

rr is a lightweight tool for recording, replaying and debugging execution of applications, built on top of gdb. It records all processes forked by the initial process automatically (e.g. when mpiexec is used). Maintained by Mozilla, rr is a mature open source solution.

Example:

  • record an execution
$ rr record mpiexec -n 2 ../build_vt/vt/examples/hello_world/hello_world
rr: Saving execution to trace directory `/home/cz4rs/.local/share/rr/mpiexec-2'.
vt: Runtime Initializing: mode: single-thread per rank
(...)
  • list all processes recorded
$ rr ps
PID	PPID	EXIT	CMD
139799	--	0	mpiexec -n 2 ../build_vt/vt/examples/hello_world/hello_world
139800	139799	0	/usr/bin/hydra_pmi_proxy --control-port rysy:39927 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
139801	139800	0	../build_vt/vt/examples/hello_world/hello_world
139802	139800	0	../build_vt/vt/examples/hello_world/hello_world
  • replay and debug selected process
$ rr replay -p 139802
  • we land in a familiar gdb environment, let's set a breakpoint
Reading symbols from /home/cz4rs/.local/share/rr/mpiexec-2/mmap_hardlink_112_hello_world...
(...)
0x00007f6f56736100 in ?? () from /lib64/ld-linux-x86-64.so.2
(rr) b hello_world.cc:55
Breakpoint 1 at 0x55b5faf9826e: file /home/cz4rs/nga/vt/examples/hello_world/hello_world.cc, line 55.
(rr) c
Continuing.
vt: Runtime Initializing: mode: single-thread per rank
(...)

Breakpoint 1, hello_world (msg=0x55b5fd8d8868)
    at /home/cz4rs/nga/vt/examples/hello_world/hello_world.cc:55
55	  fmt::print("{}: Hello from node {}\n", this_node, msg->from);
  • all basic gdb commands work as expected
(rr) p this_node
$1 = 1
(rr) watch -l this_node
Hardware watchpoint 2: -location this_node
  • and we can use reverse execution to track down problems efficiently
(rr) reverse-continue 
Continuing.

Hardware watchpoint 2: -location this_node

Old value = 1
New value = 0
hello_world (msg=0x55b5fd8d8868) at /home/cz4rs/nga/vt/examples/hello_world/hello_world.cc:54
54	  vt::NodeType this_node = vt::theContext()->getNode();
(rr) ...

See more at the Usage page.

run_vt.pl script

The script launches debugger instances in separate xterm windows and pauses execution of the program using --vt_pause.

--vt_pause argument

Non-interactive gdb at scale

If your bug is only reproducible at scale on an HPC system, you might try running gdb non-interactively. Your command-line would look something like:

mpiexec <mpiexec-args> <wrapio-script> gdb -x <gdb-command-file> -batch <app-executable>

Here, <wrapio-script> is a script that can separate the output from each MPI process into a separate file. For OpenMPI, it would be:

#!/bin/bash
file="output-${OMPI_COMM_WORLD_SIZE}-${OMPI_COMM_WORLD_RANK}.log"
# for stdout to be line buffered
stdbuf -oL $@ &>${file}

For Slurm, it would be:

#!/bin/bash
file="output-${SLURM_NPROCS}-${SLURM_PROCID}.log"
# for stdout to be line buffered
stdbuf -oL $@ &>${file}

Your <gdb-command-file> lists the gdb commands that you want to run, e.g.:

catch throw
commands
backtrace
list
continue
end
catch signal SIGABRT
commands
backtrace
list
end
catch signal SIGSEGV
commands
backtrace
list
end
run <app-args>

where <app-args> are the command-line arguments for your executable.

Non-interactive lldb in parallel

If you need to print a backtrace for a thrown exception in a parallel unit test on a Mac laptop, here's how. First create a command file like the one below, where <app-args> can be something like --gtest_filter="TestLoadBalancerNoWork.test_load_balancer_no_work".

break set -n __cxa_throw
break command add
bt
DONE
run <app-args>
q

Then, run with a command-line like the one below:

mpirun <mpirun-args> lldb -s <lldb-command-file> <app-executable>

You will get output from all processes to your terminal simultaneously. You could probably employ a script like in the previous section to pipe the output from each process to a separate file.

Valgrind

address sanitizer

You can use vt_asan_enabled (CMake), VT_ASAN_ENABLED (build script) or VT_ASAN (Docker) configuration variables to enable building with address sanitizer.

TotalView

Arm DDT (Allinea DDT)

heaptrack

heaptrack is a simple heap memory profiler with low time and memory overhead.


References

  1. Open MPI FAQ
  2. Valgrind manual
  3. How do I debug an MPI program? @stackoverflow