ssh: Prevent timeout from deadlock #1265

JoshuaWatt · 2023-09-07T15:03:55Z

Using Popen.wait() on a process that has output sent to a pipe can potentially deadlock if the process produces enough output to fill the pipe, since it will stall and never terminate waiting for the pipe to have more space.

Instead, use Popen.communicate() with the timeout parameter. This will consume all output until EOF (preventing the process from stalling due to a full pipe), and then check the return code. In the event of a timeout error, Popen.communicate() doesn't loose any data, so it's safe to call it again after the Popen.kill() in the exception handler.

This likely was done this way because the timeout parameter was new in Python 3.3, but this shouldn't be a concern anymore

Description

Checklist

Documentation for the feature
Tests for the feature

The arguments and description in doc/configuration.rst have been updated

Add a section on how to use the feature to doc/usage.rst

Add a section on how to use the feature to doc/development.rst

PR has been tested

Man pages have been regenerated

Using Popen.wait() on a process that has output sent to a pipe can potentially deadlock if the process produces enough output to fill the pipe, since it will stall and never terminate waiting for the pipe to have more space. Instead, use Popen.communicate() with the timeout parameter. This will consume all output until EOF (preventing the process from stalling due to a full pipe), and then check the return code. In the event of a timeout error, Popen.communicate() doesn't loose any data, so it's safe to call it again after the Popen.kill() in the exception handler. This likely was done this way because the timeout parameter was new in Python 3.3, but this shouldn't be a concern anymore Signed-off-by: Joshua Watt <[email protected]>

codecov · 2023-09-07T16:53:03Z

Codecov Report

Patch coverage: 66.6% and project coverage change: -0.1% ⚠️

Comparison is base (2dcc631) 63.0% compared to head (b8f059f) 63.0%.

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #1265     +/-   ##
========================================
- Coverage    63.0%   63.0%   -0.1%     
========================================
  Files         160     160             
  Lines       11848   11849      +1     
========================================
  Hits         7472    7472             
- Misses       4376    4377      +1

Files Changed	Coverage Δ
labgrid/driver/sshdriver.py	`56.1% <50.0%> (-0.2%)`	⬇️
labgrid/util/ssh.py	`91.4% <100.0%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Bastian-Krause

Looks good, thanks!

jluebbe · 2023-10-13T13:06:08Z

@JoshuaWatt Did you actually experience the deadlock?

JoshuaWatt · 2023-10-13T14:06:00Z

We have seen timeouts and the speculation was that it was the deadlock. I don't think we've seen any since this patch merged..... but ya we also encountered the SSH <8.5 bug unfortunately.

JoshuaWatt · 2023-10-13T14:07:31Z

And unfortunately, the default SSH on Ubuntu 20.04 has the problem where it doesn't close stdout and there isn't an easy work-around, so it's hard to avoid :(

jluebbe · 2023-10-13T14:26:40Z

I've never seen SSH control master startup produce significant output (i.e. more than 8k), so I'm skeptical if a pipe deadlock is the cause.

In b8f059f you speculate:

This likely was done this way because the timeout parameter was new in Python 3.3, but this shouldn't be a concern anymore

A wait() with timeout was used since the initial implementation: e4862fa#diff-9d05d7cbdae8b8ebd1d2d5d387c295c966943c5133258be844cb4d38aefd90e8R41

We wanted to make sure that we don't depend on the state of the pipes, so use of wait() was intentional, as it directly covers the interesting cases:

immediate abort (due to a config error or similar)
normal startup (parent process exits after fork)
hang (via wait timeout)

A way to keep this behavior while handling a full pipe would be to use communicate with a very short timeout to start the reader threads and then using wait() to wait for the fork (instead of waiting for pipe closure).

JoshuaWatt · 2023-10-13T14:41:28Z

I've never seen SSH control master startup produce significant output (i.e. more than 8k), so I'm skeptical if a pipe deadlock is the cause.

In b8f059f you speculate:

This likely was done this way because the timeout parameter was new in Python 3.3, but this shouldn't be a concern anymore

A wait() with timeout was used since the initial implementation: e4862fa#diff-9d05d7cbdae8b8ebd1d2d5d387c295c966943c5133258be844cb4d38aefd90e8R41

We wanted to make sure that we don't depend on the state of the pipes, so use of wait() was intentional, as it directly covers the interesting cases:
* immediate abort (due to a config error or similar)

* normal startup (parent process exits after fork)

* hang (via wait timeout)
A way to keep this behavior while handling a full pipe would be to use communicate with a very short timeout to start the reader threads and then using wait() to wait for the fork (instead of waiting for pipe closure).

It feels like that's relying on internal implementation details (that the threads stick reading data after communicate() times out) a little too much. Not sure if that worse than the SSH version check though. The version check at least is pretty clear when it can be dropped.

jluebbe · 2023-10-13T15:03:57Z

OK. Still, what we want here is not "read until pipes are closed" but "read until the process has forked (or failed)"

So how about a helper communicate_until_fork, which loops over wait() with a short timeout and then communicate() with another short timeout until .returncode is no longer None.

JoshuaWatt · 2023-10-13T15:11:23Z

Ya, that seems reasonable

jluebbe · 2023-10-13T15:11:27Z

Or, alternatively, I think adding a timeout argument to ProcessWrapper.check_output() would do what we need here. But that's an even bigger change...

This reverts commit b8f059f. OpenSSH < 8.5 called with -f waits for the stderr pipe to be closed before forking. [1] fixes this behavior. The labgrid commit to be reverted calls communicate(), relying on a timely fork. Affected OpenSSH versions now run into a TimeoutExpired exception on communicate() leading to an ExecutionError being raised in labgrid. A wait() with timeout was used since the initial implementation [2]. We wanted to make sure that we don't depend on the state of the pipes, so use of wait() was intentional, as it directly covers the interesting cases: - immediate abort (due to a config error or similar) - normal startup (parent process exits after fork) - hang (via wait() timeout) Reverting the problematic commit avoids the complexity of having to maintain two different ways to start SSH as suggested in [3]. If timeouts still occur after the revert, we should implement this suggestion [4]. Discussion that lead to this patch can be found in the PR introducing the problematic commit [5] as well as in [3]. The other commit [6] in the PR [5] has a similar approach for `labgrid.util.SSHConnection._start_own_master()`. It shouldn't be problematic though: The ssh process is not expected to fork as it is not called with -f. [1] OpenSSH: 396d32f3a ("upstream: There are lots of place where we want to redirect stdin,") [2] e4862fa ("SSHDriver: ControlMaster & Driver") [3] labgrid-project#1278 [4] labgrid-project#1265 (comment) [5] labgrid-project#1265 [6] f9ca024 ("ssh: Prevent timeout from deadlock") Signed-off-by: Bastian Krause <[email protected]>

JoshuaWatt force-pushed the ssh-deadlock-fix branch from f2247f2 to bd0ce2b Compare September 7, 2023 15:47

JoshuaWatt force-pushed the ssh-deadlock-fix branch from bd0ce2b to b8f059f Compare September 7, 2023 16:44

Bastian-Krause approved these changes Sep 18, 2023

View reviewed changes

Bastian-Krause added the fix label Sep 18, 2023

Emantor approved these changes Sep 19, 2023

View reviewed changes

Emantor merged commit 342dc90 into labgrid-project:master Sep 19, 2023

Bastian-Krause mentioned this pull request Oct 9, 2023

sshdriver: fix communicate hang if ssh version < 8.5 #1278

Closed

Bastian-Krause mentioned this pull request Oct 20, 2023

Revert "sshdriver: Prevent timeout from deadlock" #1288

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ssh: Prevent timeout from deadlock #1265

ssh: Prevent timeout from deadlock #1265

Uh oh!

JoshuaWatt commented Sep 7, 2023

Uh oh!

codecov bot commented Sep 7, 2023 •

edited

Loading

Uh oh!

Bastian-Krause left a comment

Uh oh!

jluebbe commented Oct 13, 2023

Uh oh!

JoshuaWatt commented Oct 13, 2023

Uh oh!

JoshuaWatt commented Oct 13, 2023

Uh oh!

jluebbe commented Oct 13, 2023

Uh oh!

JoshuaWatt commented Oct 13, 2023

Uh oh!

jluebbe commented Oct 13, 2023

Uh oh!

JoshuaWatt commented Oct 13, 2023

Uh oh!

jluebbe commented Oct 13, 2023 •

edited

Loading

Uh oh!

Uh oh!

ssh: Prevent timeout from deadlock #1265

ssh: Prevent timeout from deadlock #1265

Uh oh!

Conversation

JoshuaWatt commented Sep 7, 2023

Uh oh!

codecov bot commented Sep 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Bastian-Krause left a comment

Choose a reason for hiding this comment

Uh oh!

jluebbe commented Oct 13, 2023

Uh oh!

JoshuaWatt commented Oct 13, 2023

Uh oh!

JoshuaWatt commented Oct 13, 2023

Uh oh!

jluebbe commented Oct 13, 2023

Uh oh!

JoshuaWatt commented Oct 13, 2023

Uh oh!

jluebbe commented Oct 13, 2023

Uh oh!

JoshuaWatt commented Oct 13, 2023

Uh oh!

jluebbe commented Oct 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 7, 2023 •

edited

Loading

jluebbe commented Oct 13, 2023 •

edited

Loading