Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow certain exit codes at user's discretion #67

Open
jackbaker1001 opened this issue May 30, 2023 · 6 comments
Open

Allow certain exit codes at user's discretion #67

jackbaker1001 opened this issue May 30, 2023 · 6 comments
Labels
feature New feature or functionality

Comments

@jackbaker1001
Copy link

jackbaker1001 commented May 30, 2023

What should we add?

For a number of reasons, some code you are running with srun may exit with an exit code other than 0 (0 indicates "success", non-zero indicates certain categories of failure) although from the standpoint of the user, nothing actually went wrong, and usable output was produced. This can happen, for example, in quantum espresso (pw.x) if self-consistency isn't met. In some select situations, we may not care about this and wish to carry on anyways.

Presently, however, this non-zero exit code is caught by the slurm executor from STDERR and the calculation crashes.

I suggest we implement "allowed exit codes" other than just 0. This is intended for advanced users and we should consider throwing a "are you sure you know what you're doing?" warning if this input variable is set as it can lead to unexpected beahviour.

In bash, to ignore a certain exit code (say exit code 3), we can use:

bash -c '(srun pw.x -npool %d -ndiag 1 -input PREFIX.pwi 2>stderr.log 1>PREFIX.pwo || exitcode=$?) && [[ $exitcode -eq 3 ]] && exitcode=0

if [[ $exitcode -eq 3 ]]; then
  >&2 cat stderr.log
fi

exit $exitcode'

where the above example is on Perlmutter running the GPU version of quantum expresso (pw.x).

We can implement something along these lines for srun options.

One thing that isn't clear to me is whether or not this is a slurm executor addition or a core covalent addition.

Describe alternatives you've considered.

You can set use_srun=False and manually define your srun behaviour as the above script.

@jackbaker1001 jackbaker1001 added the feature New feature or functionality label May 30, 2023
@Andrew-S-Rosen
Copy link
Contributor

Andrew-S-Rosen commented May 30, 2023

Interesting. I could see the use case although I personally haven't run into this myself because I am almost always running quantum chemistry codes with Python wrappers that automatically capture the exit code and then decide if an error should be raised (so, one abstraction layer higher if that makes sense).

In the end, if the user is launching the executable with a Python subprocess, this never really has to be an issue. But if they are calling the executable directly in the Slurm script I could see it coming up.

@jackbaker1001
Copy link
Author

Where I ran into this is using ASE with quantum espresso (although I have seen it in other codes also).

Typically, in ASE, one would have an environment variable set:

os.environ["ASE_ESPRESSO_COMMAND"] = "srun pw.x -npool %d -ndiag 1 -input PREFIX.pwi > PREFIX.pwo" % num_nodes

Finishing this command with an exit code other then 0 will then cause Covalent to crash out. You can get around this by doing the below instead:

script = r'''bash -c '(srun pw.x -npool %d -ndiag 1 -input PREFIX.pwi 2>stderr.log 1>PREFIX.pwo || exitcode=$?) && [[ $exitcode -eq 3 ]] && exitcode=0

if [[ $exitcode -eq 3 ]]; then
  >&2 cat stderr.log
fi

exit $exitcode'
''' % (num_nodes)

os.environ["ASE_ESPRESSO_COMMAND"] = script

but it's a a bit of a hack. The error codes could be handled in the ASE internals but the standard user isn't going to want to do that. From a UX standpoint, we may wish to do something about all of this.

@Andrew-S-Rosen
Copy link
Contributor

Ah! You said the magic words. I totally get what you mean now! I'm 100% on board.

@Andrew-S-Rosen
Copy link
Contributor

As a side-comment: does that exit code issue not make the ASE calculator not useful as-is? Perhaps put another way, would things work fine without Covalent but not with Covalent? I'm just curious.

@jackbaker1001
Copy link
Author

The default behaviour of ASE is to allow non-zero exit codes for the wrapped DFT binaries. So, it works fine without covalent but not with!

@Andrew-S-Rosen
Copy link
Contributor

Andrew-S-Rosen commented Jun 28, 2023

Ah, I realized another related scenario where this comes up (for me at least). A lot of the workflows I run have some sort of error-handling routine associated with them, such that if the executable errors out, the error-handling routine will try to fix it and re-launch the executable rather than cancel the Slurm job entirely. That's the basis behind the Custodian package we use with VASP. For the kinds of calculations I do, I don't want Covalent to care about the exit code in a subprocess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or functionality
Projects
None yet
Development

No branches or pull requests

2 participants