You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are multiple failure modes that can lead to jobs hanging. For known hardware or low-level software issues such as ping failures as discussed above, just restart the job.
To avoid a hung job running out all the requested wallclock time on all its nodes, we suggest devising ways to monitor job progress. For example, if your application regularly writes small output to a logfile, then you could launch a “watcher” script that looks for that expected output and collects a stack trace and kills the job if it's been too long since progress was made. Please engage your Catalyst POC if you are interested in evaluating this for your application.
I have received such an example script from Intel, but it is not generic. Eventually planning on modularizing it, and making it application agnostic, but if someone else has such an example, it would be good to share it.
The text was updated successfully, but these errors were encountered:
See https://docs.alcf.anl.gov/aurora/known-issues/#hangs
I have received such an example script from Intel, but it is not generic. Eventually planning on modularizing it, and making it application agnostic, but if someone else has such an example, it would be good to share it.
The text was updated successfully, but these errors were encountered: