Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example PBS watcher script for Aurora hangs, possibly in GettingStarted repo? #783

Open
felker opened this issue Mar 12, 2025 · 0 comments
Labels
content Improvements or additions to documentation content

Comments

@felker
Copy link
Member

felker commented Mar 12, 2025

See https://docs.alcf.anl.gov/aurora/known-issues/#hangs

There are multiple failure modes that can lead to jobs hanging. For known hardware or low-level software issues such as ping failures as discussed above, just restart the job.

To avoid a hung job running out all the requested wallclock time on all its nodes, we suggest devising ways to monitor job progress. For example, if your application regularly writes small output to a logfile, then you could launch a “watcher” script that looks for that expected output and collects a stack trace and kills the job if it's been too long since progress was made. Please engage your Catalyst POC if you are interested in evaluating this for your application.

I have received such an example script from Intel, but it is not generic. Eventually planning on modularizing it, and making it application agnostic, but if someone else has such an example, it would be good to share it.

@felker felker added the content Improvements or additions to documentation content label Mar 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content Improvements or additions to documentation content
Projects
None yet
Development

No branches or pull requests

1 participant