Read failures: Future timed out after [60 seconds]

We're running into pipeline execution failing on our SGE cluster often due to errors like this. 
`Failed to read_int(<file_to_read>) (reason 1 of 1): "Future timed out after [60 seconds]."`

Happens with cromwell 87 and 88.

This happens for a range of different tasks, but only sometimes; pipelines that throw this error work ok often. Even given the same input data, a pipeline might hit this error one time and run through without issue on the next attempt. There's probably some correspondence with disk traffic on our system, but I'm hoping there's some way to get cromwell to just wait longer for disk reads, or to pile on queued disk reads more slowly if that's the case.

We've tried adjusting `system.io.command-backpressure-staleness` quite a bit because of information in issue #4057, but that hasn't helped. We've also tried adjusting downward `system.io.throttle.number-of-requests` (per 100 seconds) in hopes of lowering I/O burden and ramping `system.io.timeout.default` and `system.io.timeout.copy` way up. All to no avail. I also started experimenting with adding parameters to change various akka toolkit timeouts in the `akka` stanza of the config, but since they're not among things specified in https://github.com/broadinstitute/cromwell/blob/develop/core/src/main/resources/reference.conf I'm not sure if those are even picked up by cromwell.

Anyway, is there anything else you can suggest? For the adjustments we've already tried, like `system.io.command-backpressure-staleness`, are there some values you'd recommend? Our only working solution that avoids this so far has been to set an unreasonably low `concurrent-job-limit` for SGE in our cromwell config; that avoids this error by intrinsically reducing the number of possible I/O operations, but it also dramatically reduces the speed of our pipelines.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Read failures: Future timed out after [60 seconds] #7726

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Read failures: Future timed out after [60 seconds] #7726

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions