Description
Issue
If the user hits Ctrl-C (signals a user interrupt) while the main R session and a worker communicates data, then the communication ends up in an unrecoverable corrupt. The only solution is to restart with a new cluster while waiting for the old cluster node to timeout (30 days?)
Suggestion
In R (> 3.5.0), we have suspendInterrupts(expr)
that suspends interrupts while evaluating an expression.
Could we wrap all communication calls, i.e. all serialize()
/unserialize()
calls in suspendInterrupts()
?
There should be no need to do this on workers. Also, this way the worker can be terminated by the operating system or a job scheduler by signaling a nicer interrupt signal.
It should probably also be sufficient to protect interactive R sessions. When running R in batch mode, hitting Ctrl-C often means we want the whole R process to terminate. OTH, with proper interrupt handling (e.g. protecting communication as above and then capture user interrupts outside), our R process could terminate nicely, which here means calling stopCluster()
etc.
Actions
Investigate exactly which type of interrupt signals are suspended.
Protect what can be protected in the existing parallelly code.
Document that Ctrl-\ can be used to kill R if above get stuck. (What happens in RStudio?)