-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when using envir & globals parameters in futures that are not using the multisession plan #280
Comments
NOTE: It also appears that using Error in packageVersion("future") :
could not find function "packageVersion" Using the package wahani/modules there is a fix to get a minimal required environment based off library(future)
e <- new.env(parent = emptyenv())
e$test <- 123
modules::import(base, where = e) # exclude if using baseenv() as parent
modules::import(utils, where = e)
f <- future({ paste("hello", test) }, envir = e, globals = ls(e), lazy = TRUE)
value(f)
# RESULT: [1] "hello 123" |
Thanks for reporting. So, I have this one on a (private) backburner issue tracker. The simple explanation is that the environment that local/sequential futures are evaluated in becomes the same as the environment specified to identify globals (here argument > e <- new.env(parent = emptyenv())
> eval(quote({ 42 }), envir = e)
Error in { : could not find function "{" In order to fix this pothole, which is on the todo list, is to always copy over globals to the evaluation environment. This is done for all external workers but not the sequential ones. This will come with some extra overhead but will help avoid discrepancies like this one. The upside is that it will solve some other corn-case scenarios. Q. I'm curious, but it'll also help me understand in what cases it causes a problem, what the underlying use case where you bumped into this issue? |
To answer your question, we're working with some HUGE (1-5GB) shapefiles, which are being sectioned and parallel processed. In order to avoid the parallel processes either a) loading the shapefiles themselves and consuming the memory in no time at all or b) trying to copy the large origin environment across the futures (default future behaviour - also consumes memory) we are creating new environments each containing a section of the whole and forcing the futures to use these new environments. One of the ways we can work around this (sort of...) is to create the data slices and save them to file, close the splitting R session, then have a new R session which creates and runs the 5000 futures (each future gets a filename in their environment to load & process), using |
Rather than copy all globals across, would it not be better to use a minimal environment and assume either Additionally, this could mean that a future could be as simple as EDIT: This would also help in debugging what you want the future to do. If you can call |
It's possible I misunderstand you but note that the default behavior is to automatically identify the objects ("globals") and packages needed to evaluate the future expression. It does not export all of the calling environment - only those globals needed. The packages required are assumed to be available on the worker, so those are loaded on the worker prior to evaluation.
Calling It might be that you're misinterpreting what argument To assert that a future can be evaluated "anywhere", I often use: plan(cluster, workers = "localhost") That will set up a single background worker process and any future evaluated will be have any globals required exported to the worker. If that fails, then you'll get an error. From your description of your use case, can't you do something like; fs <- list()
for (ii in seq_along(shapefiles)) {
data <- my_read(shapefiles[ii])
fs[[ii]] <- future({ do_something(data, ...) })
}
vs <- values(fs) I use that model in a lot of my large-scale genomics analysis processing 100-1000's of files each 50+ GiB. |
Ah, this may be where I was falling foul at one point. Through bad coincidence, the large unsplit object outside the future had the same name as a variable inside the future, and thus I'm guessing the algorithm detecting the requirements picked up the large outer object and tried to pass it into ALL of the futures, (size error kicked off), which I didn't want and ended up trying to solve with a custom envir/globals). Controlling the evaluation environment is essentially what I am kind of doing by doing It doesn't help that throughout all this I have one package that loads it's own global "hidden" environment (ie starts with ".") for some of its functions which are then called as default parameters to a public function... causing mayhem. EDIT1: In addition, I wonder if you have also noticed that running a list of futures with resolve() does not seem to always call the progress function? progressPrint <- function(done, total) {
time <- paste0("[", rfc3339(anytime(Sys.time())), "]")
counts <- paste0("(", done, " of ", total, ")")
percent <- paste0(round(done / total * 100, 1), "%")
cat(time, "Resolving", counts, percent, "\n")
}
|
Thanks for this. I won't troubleshoot this one because that teeny progress bar feature is a remnant from early days that should not really be used and will be removed (hence Issue #282). The goal is to introduce a proper, generic mechanism for tracking the internal progress of futures, which will probably be based on some kind of hook functions (Roadmap Issue #172 (comment)) |
See snippet below to produce the error:
My guess is there is some difference in the way the multisession and sequential/multicore etc is set up in terms of pre-loading libraries. Without digging into the source, I would guess that the multi-session sessions are set up with certain libraries loaded whereas the other plans are not. Perhaps when using the globals and envir parameters, the future library (or is it another library?) is injected.
NOTE: lazy = TRUE is not necessary, error still occurs, just on line 4 instead
The text was updated successfully, but these errors were encountered: