-
Notifications
You must be signed in to change notification settings - Fork 374
Description
We're running into pipeline execution failing on our SGE cluster often due to errors like this.
Failed to read_int(<file_to_read>) (reason 1 of 1): "Future timed out after [60 seconds]."
Happens with cromwell 87 and 88.
This happens for a range of different tasks, but only sometimes; pipelines that throw this error work ok often. Even given the same input data, a pipeline might hit this error one time and run through without issue on the next attempt. There's probably some correspondence with disk traffic on our system, but I'm hoping there's some way to get cromwell to just wait longer for disk reads, or to pile on queued disk reads more slowly if that's the case.
We've tried adjusting system.io.command-backpressure-staleness
quite a bit because of information in issue #4057, but that hasn't helped. We've also tried adjusting downward system.io.throttle.number-of-requests
(per 100 seconds) in hopes of lowering I/O burden and ramping system.io.timeout.default
and system.io.timeout.copy
way up. All to no avail. I also started experimenting with adding parameters to change various akka toolkit timeouts in the akka
stanza of the config, but since they're not among things specified in https://github.com/broadinstitute/cromwell/blob/develop/core/src/main/resources/reference.conf I'm not sure if those are even picked up by cromwell.
Anyway, is there anything else you can suggest? For the adjustments we've already tried, like system.io.command-backpressure-staleness
, are there some values you'd recommend? Our only working solution that avoids this so far has been to set an unreasonably low concurrent-job-limit
for SGE in our cromwell config; that avoids this error by intrinsically reducing the number of possible I/O operations, but it also dramatically reduces the speed of our pipelines.