Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

makeClusterPSOCK(..., rscript_envs = ...) - more clever #8

Open
HenrikBengtsson opened this issue Mar 23, 2020 · 8 comments
Open

makeClusterPSOCK(..., rscript_envs = ...) - more clever #8

HenrikBengtsson opened this issue Mar 23, 2020 · 8 comments

Comments

@HenrikBengtsson
Copy link
Collaborator

HenrikBengtsson commented Mar 23, 2020

  • makeClusterPSOCK() gained argument 'rscript_envs' for setting environment variables in workers on startup, e.g. rscript_envs = c(FOO = "3.14", "BAR").

Instead of doing this via -e "Sys.setenv('<name>'='<value>')" options, can't we do:

> Sys.setenv(FOO="bar")
> system2("Rscript", args = c("-e", shQuote("Sys.getenv('FOO')")), stdout=TRUE)
[1] "[1] \"bar\""
> my_undo_env_fcn() 

This way we can set env vars that need to be set very early on in the R startup process in order to take place, e.g. TMPDIR.

I've verified that the above work on Linux and Windows. Maybe worth adding an internal with_env() to make sure things are properly undone for the main R session.

@HenrikBengtsson
Copy link
Collaborator Author

This will work for the local machine. But, what about remote sessions over, say, SSH?

@HenrikBengtsson
Copy link
Collaborator Author

Ideally, R should support this, cf. HenrikBengtsson/Wishlist-for-R#110

@HenrikBengtsson HenrikBengtsson changed the title future: makeClusterPSOCK(..., rscript_envs = ...) future: makeClusterPSOCK(..., rscript_envs = ...) - more clever Apr 10, 2020
@HenrikBengtsson
Copy link
Collaborator Author

HenrikBengtsson commented Jul 17, 2020

Per futureverse/future#392, we now support:

cl <- makeClusterPSOCK(..., rscript = c("LD_LIBRARY_PATH=/path/to", "Rscript"))

EDIT: Note that this does not work on MS Windows.

@HenrikBengtsson HenrikBengtsson transferred this issue from another repository Oct 20, 2020
@HenrikBengtsson HenrikBengtsson transferred this issue from another repository Oct 20, 2020
@HenrikBengtsson HenrikBengtsson changed the title future: makeClusterPSOCK(..., rscript_envs = ...) - more clever makeClusterPSOCK(..., rscript_envs = ...) - more clever Oct 20, 2020
@HenrikBengtsson
Copy link
Collaborator Author

HenrikBengtsson commented Nov 24, 2021

Regarding not being able to pass environment variables sooner in the R startup process:

So, Rscript ... expands to R --no-echo --no-restore ..., and, contrary to Rscript, we can pass environment variables to R as R PI="3.14" ... --args .... We could do this shuffling internally in makeNodePSOCK(). Since R doesn't take option --default-packages=<pkgs>, we need to pass those via R_DEFAULT_PACKAGES=... and we need to make sure to inject an --args too.

Example: Local worker

Instead of:

> cl <- parallelly::makeClusterPSOCK(1L, rscript_envs = c(PI="3.14"), dryrun = TRUE)
----------------------------------------------------------------------
Manually, start worker #1 on local machine 'localhost' with:

  "C:/PROGRA~1/R/R-41~1.0/bin/x64/Rscript" --default-packages=datasets,utils,grDevices,graphics,stats,methods -e "options(socketOptions = \"no-delay\")" -e "Sys.setenv(\"PI\"=\"3.14\")" -e "workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()" MASTER=localhost PORT=11876 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential```

we could have it do:

```r
> cl <- parallelly::makeClusterPSOCK(1L, rscript_envs = c(PI="3.14"), dryrun = TRUE)
----------------------------------------------------------------------
Manually, start worker #1 on local machine 'localhost' with:

  "C:/PROGRA~1/R/R-41~1.0/bin/x64/R" --no-echo --no-restore R_DEFAULT_PACKAGES="datasets,utils,grDevices,graphics,stats,methods" PI="3.14" -e "options(socketOptions = \"no-delay\")" -e "workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()" --args MASTER=localhost PORT=11876 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential```

Example: Remote worker

Instead of:

> cl <- parallelly::makeClusterPSOCK("remote.example.org", rscript_envs = c(PI="3.14"), dryrun = TRUE)

----------------------------------------------------------------------
Manually, (i) login into external machine 'remote.example.org':

  '/usr/bin/ssh' -R 11121:localhost:11121 remote.example.org

and (ii) start worker #1 from there:

  'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = "no-delay")' -e 'Sys.setenv("PI"="3.14")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11121 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential

Alternatively, start worker #1 from the local machine by combining both step in a single call:

  '/usr/bin/ssh' -R 11121:localhost:11121 remote.example.org "'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = \"no-delay\")' -e 'Sys.setenv(\"PI\"=\"3.14\")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11121 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential"

we could do:

> cl <- parallelly::makeClusterPSOCK("remote.example.org", rscript_envs = c(PI="3.14"), dryrun = TRUE)

----------------------------------------------------------------------
Manually, (i) login into external machine 'remote.example.org':

  '/usr/bin/ssh' -R 11121:localhost:11121 remote.example.org

and (ii) start worker #1 from there:

  'R' --no-echo --no-restore R_DEFAULT_PACKAGES='datasets,utils,grDevices,graphics,stats,methods' PI='3.14' -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' --args MASTER=localhost PORT=11121 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential

Alternatively, start worker #1 from the local machine by combining both step in a single call:

  '/usr/bin/ssh' -R 11121:localhost:11121 remote.example.org "'R' --no-echo --no-restore R_DEFAULT_PACKAGES='datasets,utils,grDevices,graphics,stats,methods' PI='3.14' -e 'options(socketOptions = \"no-delay\")' -e 'Sys.setenv(\"PI\"=\"3.14\")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' --args MASTER=localhost PORT=11121 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential"

Note that the above R PI="3.14" ... is for MS Windows. On all other platforms, we need to do PI="3.14" R ..., which means we equally well can do PI="3.14" Rscript ... there.

@HenrikBengtsson
Copy link
Collaborator Author

In parallelly (>= 1.29.0-9003), we can now do (Issue #75):

> cl <- parallelly::makeClusterPSOCK(1L, rscript = file.path(R.home("bin"), "R"), rscript_args = c("--no-echo", "--no-restore", "*", "--args"), dryrun = TRUE)
----------------------------------------------------------------------
Manually, start worker #1 on local machine 'localhost' with:

  '/home/hb/software/R-devel/R-4-1-branch/lib/R/bin/R' --no-echo --no-restore --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' --args MASTER=localhost PORT=11920 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential

Now, contrary to Rscript, R does not support --default-packages=... so that's ignored and we get a warning;

> cl <- parallelly::makeClusterPSOCK(1L, rscript = file.path(R.home("bin"), "R"), rscript_args = c("--no-echo", "--no-restore", "*", "--args"))
WARNING: unknown option '--default-packages=datasets,utils,grDevices,graphics,stats,methods'

> cl
Socket cluster with 1 nodes where 1 node is on host 'localhost' (R version 4.1.2 Patched (2021-11-01 r81123), platform x86_64-pc-linux-gnu)

@HenrikBengtsson
Copy link
Collaborator Author

In the develop version (commit 2299389), default packages are now set via R_DEFAULT_PACKAGES when Rscript is not used, e.g.

cl <- parallelly::makeClusterPSOCK(1L, rscript = file.path(R.home("bin"), "R"), rscript_args = c("--no-echo", "--no-restore", "*", "--args"), dryrun = TRUE)
----------------------------------------------------------------------
Manually, start worker #1 on local machine 'localhost' with:

  R_DEFAULT_PACKAGES=datasets,utils,grDevices,graphics,stats,methods '/home/hb/software/R-devel/R-4-1-branch/lib/R/bin/R' --no-echo --no-restore -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' --args MASTER=localhost PORT=11606 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential

This avoids above warning.

Currently, this R_DEFAULT_PACKAGES workaround is only applied for locally launched cluster nodes. For remote workers, we'll get a warning that it's not supported.

@HenrikBengtsson
Copy link
Collaborator Author

Currently, this R_DEFAULT_PACKAGES workaround is only applied for locally launched cluster nodes. For remote workers, we'll get a warning that it's not supported.

Update: New argument rscript_sh is used to infer whether a cluster node is launched on MS Windows or not. This allowed me to rely on R_DEFAULT_PACKAGES also for remote workers.

@HenrikBengtsson
Copy link
Collaborator Author

Argh... so, on MS Windows, R does not escape quotes at the CLI like Rscript and Rterm, cf. https://stat.ethz.ch/pipermail/r-devel/2021-December/081371.html.

So, on MS Windows, above R workaround has to use Rterm instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant