-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delayed progress updates to file due to file-system buffering [Was: No near-live progress update when running on a SLURM cluster] #149
Comments
First of all, it could be that you're actually only running with one parallel worker instead of ten, because, I think, the line: #SBATCH --ntasks 10 should be #SBATCH --ntasks=10 I suggest you add: plan(cluster)
message("Number of parallel workers: ", nbrOfWorkers()) and make sure that you get 10 in the Slurm log file. That said, even with a single parallel worker, you should get progress updates eventually, but it will of course take longer. Are you saying you're not getting any updates even if you run it to the very end? |
I already confirmed that I got 10 workers from Update: When I switch to Further update: Using |
Thanks for confirming. FWIW, I cannot reproduce this using R 4.2.2 with progressr 0.12.0 running on CentOS 7.9 and a Slurm 21.08.4 scheduler. The $ sbatch test.sh
Submitted batch job 1065784
$ tail -f slurm.out
Number of workers: 10
37/100 [==========>------------------] 37% in 14s, ETA: 24s Troubleshooting suggestionsUse Add a Have you verified that you get output when you run the $ chmod ugo+x test.sh
$ R_PARALLELLY_AVAILABLECORES_SYSTEM=10 ./test.sh
Number of workers: 10
21/100 [=====>-----------------------] 21% in 9s, ETA: 35s Session details$ $ cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)
$ sbatch --version
slurm 21.08.4 > sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /software/c4/cbi/software/R-4.2.2-gcc10/lib64/R/lib/libRblas.so
LAPACK: /software/c4/cbi/software/R-4.2.2-gcc10/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] progressr_0.12.0 furrr_0.3.1 future_1.30.0
loaded via a namespace (and not attached):
[1] compiler_4.2.2 parallelly_1.33.0 magrittr_2.0.3 cli_3.4.1
[5] tools_4.2.2 parallel_4.2.2 listenv_0.9.0 codetools_0.2-18
[9] digest_0.6.31 globals_0.16.2 rlang_1.0.6 purrr_0.3.5 |
One more troubleshooting suggestion: When using only: #SBATCH --ntasks=10 you might end up getting a job that can run on multiple hosts. To rule that problem out, make sure to ask for a single compute node; #SBATCH --nodes=1
#SBATCH --ntasks=10 Then, although plan(multisession) With these two changes, it'll parallelize on one compute node, on which your job/script got access to 10 CPU cores by Slurm. See if you get progress updates in that case. |
Thanks for the suggesions. I'm also running R 4.2.2 and progressr 0.12.0 on CentOS 7.9, but my slurm scheduler is 22.05.2. I tried the The issue persists if the cluster plan runs on one or multiple nodes. Ultimately, I need multiple nodes, because I need >200 cores for my computation. Do you have any other ideas what I could try? Session details-bash-4.2$ cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)
-bash-4.2$ sbatch --version
slurm 22.05.2 ─ Session info ───────────────────
setting value
version R version 4.2.2 (2022-10-31)
os CentOS Linux 7 (Core)
system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2022-12-20
pandoc 1.12.3.1 @ /usr/bin/pandoc
─ Packages ────────────────────
package * version date (UTC) lib source
cli 3.4.1 2022-09-23 [1] CRAN (R 4.2.2)
codetools 0.2-18 2020-11-04 [1] CRAN (R 4.2.1)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.2)
digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.2)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.2)
furrr * 0.3.1 2022-08-15 [1] CRAN (R 4.2.2)
future * 1.30.0 2022-12-16 [1] CRAN (R 4.2.2)
globals 0.16.2 2022-11-21 [1] CRAN (R 4.2.2)
hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.2)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.2)
listenv 0.9.0 2022-12-16 [1] CRAN (R 4.2.2)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.2)
parallelly 1.33.0 2022-12-14 [1] CRAN (R 4.2.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.2)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.2.2)
progress 1.2.2 2019-05-16 [1] CRAN (R 4.2.2)
progressr * 0.12.0 2022-12-13 [1] CRAN (R 4.2.2)
purrr 0.3.5 2022-10-06 [1] CRAN (R 4.2.2)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.2)
rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.2)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.2)
vctrs 0.5.1 2022-11-16 [1] CRAN (R 4.2.2)
────────────────────────── |
Interesting and "odd" problem. Have you confirmed that the script actually completes, e.g. result <- local({
n <- 100
p <- progressor(n)
future_map(1:n, function(i){
p(class = "sticky")
f(i)
})
})
str(result)
print(sessionInfo()) Because, given that you say you get:
but nothing else tells me that at least that first progress updates generated when calling Another troubleshooting help is: handlers("debug") That outputs all progress update signals captured, without involving "fancy" progress bar rendering (erase, reprint, erase, etc.) See https://progressr.futureverse.org/#debugging for example output. If it only shows that first "initiate" progress signal, then there's something blocking anything following to appear (which is really odd to me right now). Also, another thing you can try is to use Regarding:
Note that R has a built-in limitation where you cannot run more than ~120 parallel workers, cf. https://parallelly.futureverse.org/reference/availableConnections.html. To get more, you'll have to tweak the R code in one place and build from source. If you want to help on that end, please consider reaching out to R-devel about this, so that limitation gets more attention - I've done it in the past, but I only have so much time to push that question. |
The script actually completes, i.e. I put the session info after the loop as you suggested and it ends up in slurm.out. Swtiching to
The interesting thing is that the first 'initiate' line gets written immediately, than nothing happens for a while, and then all the update lines are written simultaneously. I have the feeling that although progress updates are recorded internally, they don't get through until the loop is finished. |
That confirms that all the progress signals go through. Looks like there's a delay in how frequently the What happens if you skip R and instead run the following in your job script: for i in {1..100}; do
echo $i
sleep 0.3
done ? If you tail the file, do you see it being updated immediately (i.e. 3 times a second), or is it buffered too? |
I see the updates in your example immediately, so everything is fine there as well. |
Thanks for confirming. I know you mentioned 'sequential' before, but can you also show me the output of: library("future")
library("furrr")
library("progressr")
options(progressr.enable = TRUE)
handlers(global = TRUE)
handlers("debug")
plan(sequential)
f <- function(i) {
Sys.sleep(0.3)
i
}
local({
n <- 100
p <- progressor(n)
result <- future_map(1:n, function(i){
p(class = "sticky")
f(i)
})
}) This will tell me if there's a buffered output in your R session. FYI, the information outputted by the 'debug' handler reveals when a progress update was signal and how much later it was received by the handler, e.g.
say it was signaled 3.455 s after the very first one, but it took another 1.976 s for it to be received. That's a more than expected, but most of them are delayed < 0.1 s, e.g.
That was signaled at 21.295 s in, and received and outputted 0.085 s later. Thus, all those 'debug' lines should be visible to you within a few seconds. Note that the whole process takes 30.559 s to complete, which means they're received and outputted way before that. So, this still points to output being buffered somewhere. |
Here's the output, so it seems there is no notable buffering.
But I got the live update to work (🥳) by running your script with more delay and iterations: library("future")
library("furrr")
library("progressr")
options(progressr.enable = TRUE)
handlers(global = TRUE)
handlers(handler_progress(
format = ":current/:total [:bar] :percent in :elapsed, ETA: :eta",
width = 60,
complete = "="
))
plan(cluster)
f <- function(i) {
Sys.sleep(3)
i
}
local({
n <- 500
p <- progressor(n)
result <- future_map(1:n, function(i){
p(class = "sticky")
f(i)
})
}) Interestingly, the progress updates come in at roughly multiples of 100, i.e. the first update I saw was |
Great. Definitely, sounds like buffering is happening somewhere, and eventually it's flushed. It could be that the global file system buffers the output. OTH, then I'd expect to see the same behavior when running sequentially. If you add more output, will you get progress updates more frequently? You can do this by, for instance:
BTW, there's no need for you to use |
BTW, you still need to use |
Actually, to force the extra output to be more instant, we can use those sticky progress messages, e.g. result <- future_map(1:n, function(i) {
p(c("Some letters: ", letters), class = "sticky", amount = 0)
p(c("Some digits: ", 1:100), class = "sticky", amount = 0)
p(c("More letters: ", LETTERS), class = "sticky", amount = 0)
p(sprintf("Iteration #%d", i))
f(i)
}) That might flush the buffering sooner. If that's the case, then we know for sure buffering is taking place and that the buffer has a size that needs to be filled up before being flushed. |
The trick with the sticky progress messages works and I even get a working progress bar on the original |
Thanks for confirming.
I'm pretty sure it triggers any buffer output to be flushed. The buffering can be due to the file system caching writes to the file before actually pushing it out, so it's seen throughout the cluster. Whatever the buffer is, file cache or not, it typically has a maximum size and when it's filled up, then it flushes the cache out, to make room for the next one. This makes file writing much more efficient overall, but it obviously introduces a delay when monitored like here. Regarding buffering: Searching online, there's a slight chance that using: stdbuf -o0 -e0 Rscript test.R will make a difference; it avoids buffering output from R. If it's the file system that's buffering, I doubt this makes a difference, but worth a shot. Moreover, because I suspect there's file-system buffering is happening it might not work, but you can also use the 'filesize' progress handler to report on the progress via the size of a custom file, e.g. handlers(list(
bar = handler_progress(
format = ":current/:total [:bar] :percent in :elapsed, ETA: :eta",
width = 60,
complete = "="
),
file = handler_filesize(
file = "slurm.progress",
intrusiveness = 0
)
)) Comment: The names This way you can monitor progress using $ ls -l slurm.progress
-rw-rw-r-- 1 hb hb 68 Dec 21 12:13 slurm.progress
^
+-- = 68% progress |
The |
I want to run a loop in parallel on a SLURM cluster and use
progressr
to output near-live updates. Here's my setup:test.R
test.sh
The issue is that when the job is running,
cat slurm.out
always shows0/100 [--------------------------------------------------] 0% in 0s, ETA: ?
and no live progress is reported. How can I get it to show the current progress?I have read this discussion on getting a progress bar using
future.batchtools
and that thread on printing messages to the console when they are generated, but I can't get it to work. Is this a bug?The text was updated successfully, but these errors were encountered: