Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add resubmission function #203

Open
mschubert opened this issue Jun 25, 2020 · 2 comments
Open

Add resubmission function #203

mschubert opened this issue Jun 25, 2020 · 2 comments

Comments

@mschubert
Copy link
Owner

originally posted in #153 by @nick-youngblut

I'm a big fan of snakemake, which allows for automatic resubmission of failed jobs with increased resources (eg., mem = lambda wildcards, threads, attempt: attempt * 8 # Gb of memory doubles per attempt). It would be really awesome to have that feature in clustermq. For example, one could provide a function instead of a value for the template:

tmpl = list(job_mem = function(attempt) attempt * 8)
fx = function(x, y) x * 2 + y
Q(fx, x=1:10, const=list(y=10), n_jobs=10, job_size=1, template=tmpl)

One would also need a max_attempts parameter for Q().

@nick-youngblut
Copy link

As a simpler approach, clustermq could just keep a log of all jobs that have completed successfully (or just those that failed), and the user could then set a parameter in Q() to just run the previously failed jobs (eg., Q(just.failed=TRUE)). The user could then wrap Q() in a loop in which the resources in template are increased in each iteration of the loop.

@nick-youngblut
Copy link

It appears that clustermq cluster array jobs suffer from the issue that if one of the N parallel cluster jobs (eg., n_jobs=20) dies, then the other jobs continue, and clustermq doesn't keep the total number of jobs at 20 (unlike snakemake).

In my case, this means that I currently only have 4 of 20 (n_jobs=20) running, since 16 of the cluster jobs have died for one reason or another. Running 4 jobs isn't really efficient, or what I intended. These 4 remaining jobs have been running for >1 day, so I don't want to kill them and loose all of that computation.

What do others do in cases where some of their 100's or 1000's of jobs fail? Do they have to always figure out which failed and then re-run just those jobs? That's potentially a lot of extra code just to figure out failed jobs and re-run only those (or just re-run everything).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants