Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine dynamic resources depending on retry reason? #65

Open
nikostr opened this issue Apr 11, 2024 · 5 comments
Open

Determine dynamic resources depending on retry reason? #65

nikostr opened this issue Apr 11, 2024 · 5 comments
Labels
discussion ongoing discussion, give your input

Comments

@nikostr
Copy link

nikostr commented Apr 11, 2024

Currently it is possible to adjust resources based on which attempt of a job is being run. Jobs fail for different reasons - timing out requires a different solution to out of memory errors. A neat feature would be allowing to adjust resources based on if previous attempts failed with TIMEOUT or OUT_OF_MEMORY.

@cmeesters cmeesters added the discussion ongoing discussion, give your input label Apr 11, 2024
@cmeesters
Copy link
Member

Indeed. Yet: Whilst OUT_OF_MEMORY seems easy to accomplish (just allow for more), it is not that easy because what is a sensible increment? E.g. for whole node jobs, it might be sensible to require the next bigger node type (memory-wise) in the cluster. Which is that, and how can Snakemake get this info?

For TIMEOUT the difficulty is elsewhere. It can be caused by

  • not making use of the fs-plugin, hence I/O contention
  • asking for too few threads
  • inappropriate parameterization of an application
  • in the case of user code it might be unoptimized, too

And then, such messages can be misleading: SLURM occasionally does not recognize node failures but reports time-outs or similar. How do we discriminate against these reasons here?

I am all open for suggestions here. Anything which can be a simple improvement is most welcome!

@nikostr
Copy link
Author

nikostr commented Apr 12, 2024

Regarding increments, I think it makes sense for the user to determine this? One idea: introducing the variable retry_reason to users for the dynamic resource allocation could allow for something like

constraint: 'fat' if retry_reason=="oom" else ''

in the workflow profile. Perhaps this would require updates to snakemake as a whole rather than just this plugin? Anyway, then users can design the retry logic based on whichever system they are dealing with.

Right now what I've been doing is running my pipeline once, manually re-running it with a longer requested run-time to catch the maybe 5% TIMEOUT jobs, and manually re-running it once again following this requesting large memory nodes to get the handful of memory intensive jobs sorted. It's a very okay workflow, but it would be neater if this logic could live in my workflow profile instead, haha. But yeah, maybe introducing this is just a matter of added complexity that won't save people time relative to what's needed to implement it and maintain it in workflows.

@cmeesters
Copy link
Member

cmeesters commented Apr 12, 2024

Hm, worth a thought!

That

... once, manually re-running ... to catch the maybe 5% TIMEOUT jobs, and manually re-running it once again ...

is, of course, unacceptable. The question is, though, whether there are issues at hand which can be prevented. It would merit further investigation.

@blaiseli
Copy link

blaiseli commented Sep 23, 2024

I too would be interested in having access to a retry_reason variable in functions that I use to adapt resources to the situation encountered during previous retries. I can increase memory or runtime based on number of attempts, but it is a waste of resources (and time spent in the queue) if my job asks for more memory after a TIMEOUT.

@blaiseli
Copy link

Example of what I'm currently doing in one of my workflows:

longpart = config.get("longpart", "long")

# [...]

def make_resource_setter(*res_values):
    """
    Make a function that sets a resource according to an attempt number.

    `res_values` should be the values to use for attempts 1, 2, 3, etc.
    in that order.
    If the attempt number is higher than the number of values provided,
    the last one will be used.

    The returned function will take wildcards, attempt and threads as
    arguments, in that order, so that it can be used in the resources
    section of a snakemake rule.
    """
    attempt2resource = dict(enumerate(res_values, start=1))
    def set_resource(wildcards, attempt, threads):
        return attempt2resource.get(attempt, res_values[-1])
    return set_resource

Then, in a rule that tends to fail for two types of issues, depending on what part of the genome it is currently dealing with (out of memory or timeout):

    resources:
    # QOS long is only available on partition long, and can be used for jobs with runtime up to 365d
    # QOS normal can be used for jobs with runtime up to 1d (1440m)
    # QOS fast can be used for jobs with runtime up to 2h (120m).
    # QOS ultrafast can be used for jobs with runtime up to 5m.
    mem_mb=make_resource_setter(1024 * 4, 1024 * 16, 1024 * 32, 1024 * 64),
    runtime=make_resource_setter(120, 60 * 24, 60 * 24 * 60, 60 * 24 * 365),
    slurm_extra=make_resource_setter("'--qos=fast'", "'--qos=normal'", f"'--qos=long' '--partition={longpart}'", f"'--qos=long' '--partition={longpart}'"),

I need to increase both the time and memory requirement, which is a pity when failure is due to memory issues, because the job ends up requesting lower priority qos and partitions. And when the failure cause is timeout, it will end up (under-)using high amounts of memory for a long time, which is a pity for the other cluster users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion ongoing discussion, give your input
Projects
None yet
Development

No branches or pull requests

3 participants