-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine dynamic resources depending on retry reason? #65
Comments
Indeed. Yet: Whilst For
And then, such messages can be misleading: SLURM occasionally does not recognize node failures but reports time-outs or similar. How do we discriminate against these reasons here? I am all open for suggestions here. Anything which can be a simple improvement is most welcome! |
Regarding increments, I think it makes sense for the user to determine this? One idea: introducing the variable
in the workflow profile. Perhaps this would require updates to snakemake as a whole rather than just this plugin? Anyway, then users can design the retry logic based on whichever system they are dealing with. Right now what I've been doing is running my pipeline once, manually re-running it with a longer requested run-time to catch the maybe 5% TIMEOUT jobs, and manually re-running it once again following this requesting large memory nodes to get the handful of memory intensive jobs sorted. It's a very okay workflow, but it would be neater if this logic could live in my workflow profile instead, haha. But yeah, maybe introducing this is just a matter of added complexity that won't save people time relative to what's needed to implement it and maintain it in workflows. |
Hm, worth a thought! That
is, of course, unacceptable. The question is, though, whether there are issues at hand which can be prevented. It would merit further investigation. |
I too would be interested in having access to a |
Example of what I'm currently doing in one of my workflows:
Then, in a rule that tends to fail for two types of issues, depending on what part of the genome it is currently dealing with (out of memory or timeout):
I need to increase both the time and memory requirement, which is a pity when failure is due to memory issues, because the job ends up requesting lower priority qos and partitions. And when the failure cause is timeout, it will end up (under-)using high amounts of memory for a long time, which is a pity for the other cluster users. |
Currently it is possible to adjust resources based on which
attempt
of a job is being run. Jobs fail for different reasons - timing out requires a different solution to out of memory errors. A neat feature would be allowing to adjust resources based on if previous attempts failed withTIMEOUT
orOUT_OF_MEMORY
.The text was updated successfully, but these errors were encountered: