Add example PBS watcher script for Aurora hangs, possibly in GettingStarted repo? #783

felker · 2025-03-12T03:27:07Z

See https://docs.alcf.anl.gov/aurora/known-issues/#hangs

There are multiple failure modes that can lead to jobs hanging. For known hardware or low-level software issues such as ping failures as discussed above, just restart the job.

To avoid a hung job running out all the requested wallclock time on all its nodes, we suggest devising ways to monitor job progress. For example, if your application regularly writes small output to a logfile, then you could launch a “watcher” script that looks for that expected output and collects a stack trace and kills the job if it's been too long since progress was made. Please engage your Catalyst POC if you are interested in evaluating this for your application.

I have received such an example script from Intel, but it is not generic. Eventually planning on modularizing it, and making it application agnostic, but if someone else has such an example, it would be good to share it.

felker added the content Improvements or additions to documentation content label Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example PBS watcher script for Aurora hangs, possibly in GettingStarted repo? #783

Add example PBS watcher script for Aurora hangs, possibly in GettingStarted repo? #783

felker commented Mar 12, 2025

Add example PBS watcher script for Aurora hangs, possibly in GettingStarted repo? #783

Add example PBS watcher script for Aurora hangs, possibly in GettingStarted repo? #783

Comments

felker commented Mar 12, 2025