Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batchtools_slurm futures block (Was: stdout/stderr to log file only when using slurm?) #68

Open
privefl opened this issue Jan 9, 2021 · 18 comments
Labels

Comments

@privefl
Copy link

privefl commented Jan 9, 2021

I have a loop like this:

for (i in 14:11) {
  f <- future({
    warning("Sleeping ", i, " seconds")
    Sys.sleep(i)
    saveRDS(i^2, paste0("tmp-data/res", i))
  })
} 

and I would to output the warning in the log files only.
I've seen the new option split in a blog post to do both (but I didn't find it in the documentation).
The problem is that capturing this output and returning it to the main session usually blocks my session for a long time (sometimes minutes), and I would like to avoid this.

Basically, is there any way to use future/future.batchtools just to submit jobs on a slurm cluster without returning anything to the session sending them?

@HenrikBengtsson
Copy link
Collaborator

future(..., conditions = NULL) should do it; the future won't relay any condition classes except errors that are always relayed.

@privefl
Copy link
Author

privefl commented Jan 9, 2021

Thanks for the quick response.
Now I do get the output in the log file, but my session still gets blocked.
Basically, once the loop is finished submitting jobs, I can run things in my R session for e.g. 30 sec and then I can't run anything anymore for a few minutes (or I just kill the session).

@HenrikBengtsson
Copy link
Collaborator

Hmm... I don't see how the above for loop finishes and you get a prompt back, which then all of a sudden blocks. There must be something else you do for the latter to occur. Do you have a reproducible example?

@privefl
Copy link
Author

privefl commented Jan 9, 2021

This is exactly the code I tried on the slurm cluster:
code-future

The template file I use is there.

@HenrikBengtsson
Copy link
Collaborator

HenrikBengtsson commented Jan 9, 2021

Ok. The only thing that I can see happening if it just hangs later while you're doing other things is that the garbage collector runs, which then will trigger running the garbage collector on the 10 futures you create above. Garbage collecting a batchtools future involves trying to get their result, which requires waiting the scheduler to process them.

First of all, it's not clear to me if you're trying future.batchtools for the very first time and it doesn't work, or you've got it to work in the past and now it doesn't work. I recommend that you first make sure that you can create a basic future and then gets it's value, e.g.

library(future.batchtools)
plan(batchtools_slurm, ...)
f <- future(42)
v <- value(f)

Does that also hang? If it does, you can set options(future.debug = TRUE) to see at what point it hangs. There's nothing magic happening in the future framework, so if it hangs I think it comes from batchtools or below (but I've been wrong before).

Also, spawning future():s and then just ignoring them without getting their value():s is not really how they're meant to be used.

PS. Please don't post screenshots of plain code.

@privefl
Copy link
Author

privefl commented Jan 9, 2021

Yeah everything is working and I find {future} and {future.batchtools} very useful for my purpose (very convenient for running things on the cluster). It is just that my use might not be the intended purpose of the package. I'm just running things and saving the results in rds files, so I don't really need to return the results. And I don't want to because I don't want it to block my R session so that I can try other things in the meantime. If you tell me this is not really the purpose of the packages, then I'll look for something else when I have time.

PS: Sorry about the screenshot, but I don't have a choice as I can't really get things out of the cluster easily.

@privefl
Copy link
Author

privefl commented Jan 11, 2021

I think you are right about the gc(), this is what blocks my session.

Also note that my session is not blocked if the loop finishes submitting all jobs before any of them finishes.

@HenrikBengtsson HenrikBengtsson changed the title stdout/stderr to log file only when using slurm? batchtools_slurm futures block (Was: stdout/stderr to log file only when using slurm?) Jan 21, 2021
@HenrikBengtsson
Copy link
Collaborator

I think you are right about the gc(), this is what blocks my session.

Yes. You should be able to see the finalizer being called if you set options(future.debug = TRUE).

I'll try to give more suggestions soon-ish - need find time to do a few test runs.

@privefl
Copy link
Author

privefl commented Jan 21, 2021

Thanks Henrik!

@HenrikBengtsson
Copy link
Collaborator

Here's how you can disable the finalizer of batchtools future;

library(future.batchtools)
plan(batchtools_slurm, workers = Inf, finalize=FALSE)
# Warning message:
# In tweak.future(function (expr, envir = parent.frame(), substitute = TRUE,  :
#   Detected 1 unknown future arguments: 'finalize'

Unfortunately, you're gonna get that annoying warning (an oversight by me) but it does work.

With this, the finalizer, which attempts to collect the future results and then delete them from disk, will not run. Since there is an infinite number of workers (default is 100), you will also not hit an upper limit of concurrently running futures. If you'd hit the upper limit, the next future would wait until one of the running futures have been resolved, which would require collecting the results (which you are trying to avoid).


It works, but is this a good idea? I'm not sure. I say this mostly because this type of Future API use case, where you just use future() without value(), is not really on my radar when I'm expanding the future framework. For example, I'm working on improvements that potentially could lead to making lazy = TRUE actually the better default. I'm not saying this switch will take place but if it would, then your code would just sit there producing lots of lazy batchtools futures that are never submitted to the scheduler. BTW, you could of course solve this by using an explicit:

f <- future(..., lazy = FALSE)

Having said all this, my gut feeling is that your approach should be good for a very long time.

HenrikBengtsson added a commit that referenced this issue Jan 22, 2021
@privefl
Copy link
Author

privefl commented Jan 22, 2021

Thanks for this.

I don't think my use case is very uncommon; when you schedule jobs on a cluster, you usually have them as independent, running things at their pace, storing some results and leaving the computing resources when they are finished.

For scheduling many jobs using e.g. a loop, {future.batchtools} is SO useful. It is just that I'm not really interested in returning the results/stdout because I'm already storing them to disk. I'll try the workers = Inf, finalize = FALSE option.

@HenrikBengtsson
Copy link
Collaborator

I don't think my use case is very uncommon ...

Sorry, I wasn't clear enough. I meant within the Future API ecosystem. First, the use case where future() is used without a couple value() does not pass the "future sniff test" of being able to run the exact same R code using a different plan(). For instance, would your code work the way you want it to work with plan(sequential) or plan(multisession, workers = 2)?

I agree, there are definitely cases where you want to launch HPC jobs from R and then leave R, leave it to the user to manually poll the queue, and then continue the analyses of the produced results elsewhere. We already have batchtools::submitJobs() that does exactly that. Maybe there's a need for batchtools::submitOneJob()? I guess you also find the automatic handling of globals and packages that comes with future convenient - but you could imagine batchtools::submitOneJob() importing those skills too. My read on this is that you're using plan(batchtools_slurm) + future(...) because there's no batchtools::submitOneJob().

Now, if we peek into the future (pun intended), there might be a day when we have a queueing framework for futures in R - a queue that does not necessary run a HPC scheduler in the background. Very very roughly, something like:

q <- future_queue()
f <- future(..., lazy = TRUE)
future_submit(q, f)
v <- value(f)

Maybe your use case of not caring about value() will become more popular then. What adds to this is support for being able to quit R and at a later time come back and pick up an existing future queue in another R session. FWIW, I'm working toward making these things possible.

@privefl
Copy link
Author

privefl commented Jan 22, 2021

Yes the export of the globals is very convenient.
If a submitonejob becomes available, then it will be very useful to me, and at least other in my team.

@johanneskoch94
Copy link

Hi @privefl,
I was wondering if you managed to solve the issue with workers = Inf, finalize = FALSE? I'm having the same difficulties you describe above, but haven't yet managed to get it to work on my side...

@privefl
Copy link
Author

privefl commented May 10, 2021

@johanneskoch94 For now, I'm just making sure I stop the loop after all jobs have been submitted, but before getting any result back.

@johanneskoch94
Copy link

Thanks @privefl.
I managed to solve my issue. In the end it was indeed adding workers = Inf, finalize = FALSE to the plan that did the trick. My problem was that these options didn't seem to hold, when using furrr::future_pwalk. I don't know why I didn't think of it earlier, but I finally changed my code from future_pwalk(my_tibble, my_function) to for(i in 1:nrow(my_tibble)) future({my_function(...)}) and it works!
So thanks for raising the issue and thanks of course @HenrikBengtsson for the solution and this awesome package!

@HenrikBengtsson
Copy link
Collaborator

Good to hear.

In the end it was indeed adding workers = Inf, finalize = FALSE to the plan that did the trick.

Careful with workers = Inf though. You might end up submitted hundreds or thousands of jobs to your scheduler by mistake. I suggest that you specify a large-enough finite value instead to save yourself from an angry email from your sysadms.

@johanneskoch94
Copy link

True, I will do as you suggest. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants