Add resubmission function #203

mschubert · 2020-06-25T18:25:41Z

originally posted in #153 by @nick-youngblut

I'm a big fan of snakemake, which allows for automatic resubmission of failed jobs with increased resources (eg., mem = lambda wildcards, threads, attempt: attempt * 8 # Gb of memory doubles per attempt). It would be really awesome to have that feature in clustermq. For example, one could provide a function instead of a value for the template:

tmpl = list(job_mem = function(attempt) attempt * 8)
fx = function(x, y) x * 2 + y
Q(fx, x=1:10, const=list(y=10), n_jobs=10, job_size=1, template=tmpl)

One would also need a max_attempts parameter for Q().

The text was updated successfully, but these errors were encountered:

nick-youngblut · 2021-01-27T18:09:41Z

As a simpler approach, clustermq could just keep a log of all jobs that have completed successfully (or just those that failed), and the user could then set a parameter in Q() to just run the previously failed jobs (eg., Q(just.failed=TRUE)). The user could then wrap Q() in a loop in which the resources in template are increased in each iteration of the loop.

nick-youngblut · 2021-01-28T12:25:19Z

It appears that clustermq cluster array jobs suffer from the issue that if one of the N parallel cluster jobs (eg., n_jobs=20) dies, then the other jobs continue, and clustermq doesn't keep the total number of jobs at 20 (unlike snakemake).

In my case, this means that I currently only have 4 of 20 (n_jobs=20) running, since 16 of the cluster jobs have died for one reason or another. Running 4 jobs isn't really efficient, or what I intended. These 4 remaining jobs have been running for >1 day, so I don't want to kill them and loose all of that computation.

What do others do in cases where some of their 100's or 1000's of jobs fail? Do they have to always figure out which failed and then re-run just those jobs? That's potentially a lot of extra code just to figure out failed jobs and re-run only those (or just re-run everything).

mschubert added enhancement community request labels Jun 25, 2020

mschubert removed the community request label Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add resubmission function #203

Add resubmission function #203

mschubert commented Jun 25, 2020

nick-youngblut commented Jan 27, 2021

nick-youngblut commented Jan 28, 2021

Add resubmission function #203

Add resubmission function #203

Comments

mschubert commented Jun 25, 2020

nick-youngblut commented Jan 27, 2021

nick-youngblut commented Jan 28, 2021