Error in unserialize(node$con) : MultisessionFuture (future_lapply-4) failed to receive results from cluster RichSOCKnode #4 (PID 436932 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. #685

Yunuuuu · 2023-05-20T05:42:42Z

Yunuuuu
May 20, 2023

(Please use https://github.com/HenrikBengtsson/future/discussions for Q&A)
Hi, thanks for your great R package future which really convenient to run long-long task.

Describe the bug

A clear and concise description of what the bug is.

options(future.globals.onReference = "error")
[R]> future::plan("multisession", workers = 10L)
[R]> rrho_res <- biomisc::run_rrho(
         bca_diff_res,
         gender_diff_res,
         stepsize = 100L
     )
[R]> rrho_perm_res <- biomisc::rrho_correct_pval(
         rrho_res,
         method = "permutation", perm = 200L
     )
Error in unserialize(node$con) :                                                 
  MultisessionFuture (future_lapply-4) failed to receive results from cluster RichSOCKnode #4 (PID 436932 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. The total size of the 14 globals exported is 15.76 MiB. The three largest globals are ‘rrho_obj’ (15.47 MiB of class ‘list’), ‘rrho_hyper_overlap’ (143.94 KiB of class ‘function’) and ‘progression’ (91.02 KiB of class ‘function’)

I begin to run this function with future::plan("multicore", workers = 10L), it also gave similar error infos as belows, so I tried above multisession as indicated in #474

options(future.globals.onReference = "error")
[R]> future::plan("multicore", workers = 10L)
[R]> rrho_res <- biomisc::run_rrho(
         bca_diff_res,
         gender_diff_res,
         stepsize = 100L
     )
[R]> rrho_perm_res <- biomisc::rrho_correct_pval(
         rrho_res,
         method = "permutation", perm = 200L
     )
Error: Failed to retrieve the result of MulticoreFuture (future_lapply-1) from the forked worker (on localhost; PID 452943). Post-mortem diagnostic: No process exists with this PID, i.e. the forked localhost worker is no longer alive. The total size of the 14 globals exported is 15.76 MiB. The three largest globals are ‘rrho_obj’ (15.47 MiB of class ‘list’), ‘rrho_hyper_overlap’ (143.94 KiB of class ‘function’) and ‘progression’ (91.02 KiB of class ‘function’)
In addition: Warning message:
In mccollect(jobs = jobs, wait = TRUE) :
  1 parallel job did not deliver a result

the rrho_correct_pval is a long function deposited in https://github.com/Yunuuuu/biomisc/blob/81948d2e5e2bab5a4cf76fd76e8ab4a096192efd/R/run_rrho.R#L787

I put the main future function here:

        p <- progressr::progressor(steps = perm)
        perm_hyper_metric <- future.apply::future_lapply(
            seq_len(perm), function(i) {
                hyper_res <- rrho_hyper_overlap(
                    names(rrho_obj$rrho_data$list1)[
                        sample.int(length(rrho_obj$rrho_data$list1), replace = FALSE)
                    ],
                    names(rrho_obj$rrho_data$list2)[
                        sample.int(length(rrho_obj$rrho_data$list2), replace = FALSE)
                    ],
                    stepsize = rrho_obj$stepsize,
                    .parallel = FALSE
                )
                p(message = sprintf("Permuatating %d times", i))
                rrho_metrics(hyper_res, log_base = rrho_obj$log_base)
            },
            future.globals = TRUE,
            future.seed = TRUE
        )

Reproduce example

Actually, the biomisc::run_rrho also use future_lapply but it won't often gave a error (it'll also gave error randomly, I cannot reproduce biomisc::run_rrho error message, but I run biomsc::rrho_correct_pval after biomisc::run_rrho which will often reproduce above error (use my own data: gene expression array data)).

But when I use a artificial data (I cannot make error occur, so I can't give reboust example code to reproduce this error)

Expected behavior

Run without error

Session information

Please share your session information after the error has occurred so that we also see which packages and versions are involved;

> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libmkl_rt.so;  LAPACK version 3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=zh_CN.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C       

time zone: Asia/Shanghai
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_4.3.0        parallelly_1.35.0     cli_3.6.1            
 [4] tools_4.3.0           parallel_4.3.0        future.apply_1.10.0  
 [7] listenv_0.9.0         Rcpp_1.0.10           codetools_0.2-19     
[10] progressr_0.13.0-9002 data.table_1.14.9     biomisc_0.0.0.9000   
[13] jsonlite_1.8.4        digest_0.6.31         globals_0.16.2       
[16] rlang_1.1.1           future_1.32.0   
…

…
> future::futureSessionInfo()
*** Package versions
future 1.32.0, parallelly 1.35.0, parallel 4.3.0, globals 0.16.2, listenv 0.9.0

*** Allocations
availableCores():
system  nproc 
    24     24 
availableWorkers():
$nproc
 [1] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"
 [7] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"
[13] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"
[19] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"

$system
 [1] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"
 [7] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"
[13] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"
[19] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"


*** Settings
- future.plan=<not set>
- future.fork.multithreading.enable=<not set>
- future.globals.maxSize=<not set>
- future.globals.onReference=‘error’
- future.resolve.recursive=<not set>
- future.rng.onMisuse=<not set>
- future.wait.timeout=<not set>
- future.wait.interval=<not set>
- future.wait.alpha=<not set>
- future.startup.script=<not set>

*** Backends
Number of workers: 10
List of future strategies:
1. multicore:
   - args: function (..., workers = 10L, envir = parent.frame())
   - tweaked: TRUE
   - call: future::plan("multicore", workers = 10L)

*** Basic tests
Main R session details:
     pid     r sysname           release
1 455700 4.3.0   Linux 5.19.0-41-generic
                                                           version nodename
1 #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2  host001
  machine   login    user effective_user
1  x86_64 user001 user001        user001
Worker R session details:
   worker    pid     r sysname           release
1       1 472740 4.3.0   Linux 5.19.0-41-generic
2       2 472741 4.3.0   Linux 5.19.0-41-generic
3       3 472742 4.3.0   Linux 5.19.0-41-generic
4       4 472743 4.3.0   Linux 5.19.0-41-generic
5       5 472744 4.3.0   Linux 5.19.0-41-generic
6       6 472745 4.3.0   Linux 5.19.0-41-generic
7       7 472746 4.3.0   Linux 5.19.0-41-generic
8       8 472747 4.3.0   Linux 5.19.0-41-generic
9       9 472748 4.3.0   Linux 5.19.0-41-generic
10     10 472749 4.3.0   Linux 5.19.0-41-generic
                                                            version nodename
1  #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2  host001
2  #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2  host001
3  #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2  host001
4  #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2  host001
5  #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2  host001
6  #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2  host001
7  #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2  host001
8  #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2  host001
9  #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2  host001
10 #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2  host001
   machine   login    user effective_user
1   x86_64 user001 user001        user001
2   x86_64 user001 user001        user001
3   x86_64 user001 user001        user001
4   x86_64 user001 user001        user001
5   x86_64 user001 user001        user001
6   x86_64 user001 user001        user001
7   x86_64 user001 user001        user001
8   x86_64 user001 user001        user001
9   x86_64 user001 user001        user001
10  x86_64 user001 user001        user001
Number of unique worker PIDs: 10 (as expected)

HenrikBengtsson · 2023-05-21T00:09:51Z

HenrikBengtsson
May 21, 2023
Maintainer

Hello. It's clear that something causes the parallel workers to completely crash, regardless of you running 'multicore' or 'multisession' workers. Since it happens to both backends, it's likely to be independent of the parallel backend. If you tried with plan(future.callr::callr), you'd probably see the same there.

Depending on your Linux setup, it could be that you're running out of memory, and Linux decides to terminate your workers. If you've seen some error from Linux like "Out of Memory (OOM) killer", then that's the reason.

Regardless, I'd suggest that you try with fewer workers to see if you can reproduce the crash. Try with workers = 2, or even a single one, which you can get by workers = I(1) [note the I()]. If that works for you, try to increase the number of workers until you experience the problem. Pay attention to the system's memory usage, e.g. look at top or htop.

If it still crashes with workers = I(1), then see if it even workers when running sequentially, i.e. plan(sequential). It could be that it crashes then too. You could even check that cases the first thing you do.

0 replies

spono · 2023-06-08T09:25:28Z

spono
Jun 8, 2023

Hi,
same issue on Windows 10 both using future::plan("multisession", workers = 4L) while with the sequential approach the Rstudio session crashes. [I think] It can't be a memory issue because I have 32GB and, monitoring the processes, only one peaked to appr. 4GB of usage.
The error was present also in the previous future version but it happened randomly. I'm going to roll back to earlier versions and report.

[ Sorry, I didn't know where else to post this. I didn't consider to open another bug report after seeing that you converted @Yunuuuu 's one to this discussion ]

Error and traceback() report right after the stop:

Error in unserialize(node$con) : 
  MultisessionFuture (<none>) failed to receive results from cluster RichSOCKnode #4 (PID 12624 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. The total size of the 22 globals exported is 941.90 KiB. The three largest globals are ‘drivers’ (456.91 KiB of class ‘list’), ‘structure_metrics’ (212.12 KiB of class ‘function’) and ‘is’ (158.55 KiB of class ‘function’)

> traceback()
17: stop(ex)
16: receiveMessageFromWorker(x)
15: resolved.ClusterFuture(future, run = FALSE, .signalEarly = FALSE)
14: resolved(future, run = FALSE, .signalEarly = FALSE)
13: collectValues(where, futures = futures, firstOnly = TRUE)
12: FutureRegistry(reg, action = "collect-first", earlySignal = TRUE)
11: await()
10: requestNode(await = function() {
        FutureRegistry(reg, action = "collect-first", earlySignal = TRUE)
    }, workers = workers)
9: run.ClusterFuture(future)
8: run(future)
7: run.Future(future)
6: run(future)
5: future({
       setThreads(threads)
       options(lidR.progress = FALSE)
       options(lidR.verbose = FALSE)
       options(lidR.raster.default = raster.default)
       y <- NULL
       if (.AUTOREAD == FALSE) {
           y <- do.call(.FUN, params)
       }
       else if (.AUTOREAD == TRUE & .AUTOCROP == FALSE) {
           bbox <- st_bbox(chunk)
           las <- readLAS(chunk)
           y <- NULL
           if (!is.empty(las)) {
               params[[first_p]] <- las
               params[[second_p]] <- bbox
               y <- do.call(.FUN, params)
           }
       }
       else if (.AUTOREAD == TRUE & .AUTOCROP == TRUE) {
    ...
4: engine_apply(clusters, FUN, ctg@processing_options, ctg@output_options, 
       opt[["globals"]], opt[["autoread"]], opt[["autocrop"]], ...)
3: catalog_apply(las, pixel_metrics, func = func, res = res, start = start, 
       ..., .options = options)
2: pixel_metrics.LAScatalog(ctg, .structure_metrics, res = gridRes)
1: pixel_metrics(ctg, .structure_metrics, res = gridRes)

Here the session after the error:

> sessionInfo()
R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=Italian_Italy.utf8  LC_CTYPE=Italian_Italy.utf8    LC_MONETARY=Italian_Italy.utf8 LC_NUMERIC=C                  
[5] LC_TIME=Italian_Italy.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.3.6 dplyr_1.0.10  terra_1.7-23  sf_1.0-13     lidR_4.0.3   

loaded via a namespace (and not attached):
  [1] colorspace_2.0-3     ellipsis_0.3.2       class_7.3-21         gld_2.6.5            httpcode_0.3.0       rstudioapi_0.14     
  [7] proxy_0.4-27         listenv_0.8.0        rlas_1.6.3           prodlim_2019.11.13   fansi_1.0.3          mvtnorm_1.1-3       
 [13] lubridate_1.8.0      codetools_0.2-19     splines_4.2.3        rootSolve_1.8.2.3    geojsonlint_0.4.0    jsonlite_1.8.4      
 [19] pROC_1.18.0          caret_6.0-93         cluster_2.1.4        compiler_4.2.3       httr_1.4.4           Matrix_1.5-3        
 [25] lazyeval_0.2.2       cli_3.3.0            prettyunits_1.1.1    tools_4.2.3          gtable_0.3.1         glue_1.6.2          
 [31] lmom_2.9             reshape2_1.4.4       V8_4.2.1             Rcpp_1.0.10          carData_3.0-5        cellranger_1.1.0    
 [37] raster_3.6-20        smoothr_1.0.1        vctrs_0.4.1          writexl_1.4.2        crul_1.2.0           nlme_3.1-162        
 [43] iterators_1.0.14     CAST_0.8.1           timeDate_4021.104    lwgeom_0.2-13        gower_1.0.1          stringr_1.4.1       
 [49] globals_0.16.2       lifecycle_1.0.3      future_1.32.0        MASS_7.3-58.2        scales_1.2.0         ipred_0.9-13        
 [55] hms_1.1.2            parallel_4.2.3       expm_0.999-6         sgsR_1.4.2           curl_4.3.3           Exact_3.1           
 [61] gridExtra_2.3        rpart_4.1.19         stringi_1.7.12       jsonvalidate_1.3.2   foreach_1.5.2        e1071_1.7-13        
 [67] permute_0.9-7        hardhat_1.2.0        boot_1.3-28.1        lava_1.6.10          geometry_0.4.7       rlang_1.0.6         
 [73] pkgconfig_2.0.3      lattice_0.20-45      purrr_0.3.5          recipes_1.0.1        tidyselect_1.2.0     parallelly_1.36.0   
 [79] plyr_1.8.7           magrittr_2.0.3       R6_2.5.1             DescTools_0.99.48    generics_0.1.3       rEO_0.1.0           
 [85] DBI_1.1.3            pillar_1.8.1         withr_2.5.0          mgcv_1.8-42          units_0.8-0          stars_0.5-6         
 [91] survival_3.5-3       abind_1.4-5          sp_1.5-1             nnet_7.3-18          tibble_3.1.8         future.apply_1.11.0 
 [97] ROSE_0.0-4           crayon_1.5.1         car_3.1-0            rmapshaper_0.4.6     KernSmooth_2.23-20   utf8_1.2.3          
[103] viridis_0.6.2        progress_1.2.2       grid_4.2.3           readxl_1.4.2         data.table_1.14.8    vegan_2.6-4         
[109] ModelMetrics_1.2.2.2 digest_0.6.29        classInt_0.4-7       tidyr_1.2.1          stats4_4.2.3         munsell_0.5.0       
[115] viridisLite_0.4.0    magic_1.6-1

0 replies

spono · 2023-06-08T18:09:37Z

spono
Jun 8, 2023

I dug more into the data and found the guilty file which returns the following error:

>   future::plan( "multisession", workers = 4L )
>   r = pixel_metrics( ctg, .structure_metrics, res = gridRes)
Processing [======>------------------------------------------------------------------------------------------]   7% (1/14) eta:  2m
Error: object 'paths' not found

Looks like the paths object is not exported to the dedicated session. For sure, paths is not part of the .structure_metrics function but I don't think it refers to this.

As a crosscheck, I then tried [with no positive results]:

using the option gc = TRUE within plan() ,
reducing the columns from 16 to 6 (using lidR catalog option select='xyzcar', which allows to subset the input file)
reducing the number of workers .

You were right: it seems that a particular file caused a function to increase a lot the memory usage, driving a session/worker to crash...even though I'm only guessing from its behaviour on a data subset. I believe this might explain also why 'paths' was not found (?).

[Nevertheless, I still can't understand why the same process worked fine a couple of weeks ago with the same lidR version]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Yunuuuu May 20, 2023

Replies: 3 comments

HenrikBengtsson May 21, 2023 Maintainer

spono Jun 8, 2023

spono Jun 8, 2023

Yunuuuu
May 20, 2023

HenrikBengtsson
May 21, 2023
Maintainer

spono
Jun 8, 2023

spono
Jun 8, 2023