Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ssh: Prevent timeout from deadlock #1265

Merged
merged 2 commits into from
Sep 19, 2023

Conversation

JoshuaWatt
Copy link
Contributor

Using Popen.wait() on a process that has output sent to a pipe can potentially deadlock if the process produces enough output to fill the pipe, since it will stall and never terminate waiting for the pipe to have more space.

Instead, use Popen.communicate() with the timeout parameter. This will consume all output until EOF (preventing the process from stalling due to a full pipe), and then check the return code. In the event of a timeout error, Popen.communicate() doesn't loose any data, so it's safe to call it again after the Popen.kill() in the exception handler.

This likely was done this way because the timeout parameter was new in Python 3.3, but this shouldn't be a concern anymore

Description

Checklist

  • Documentation for the feature
  • Tests for the feature
  • The arguments and description in doc/configuration.rst have been updated
  • Add a section on how to use the feature to doc/usage.rst
  • Add a section on how to use the feature to doc/development.rst
  • PR has been tested
  • Man pages have been regenerated

Using Popen.wait() on a process that has output sent to a pipe can
potentially deadlock if the process produces enough output to fill the
pipe, since it will stall and never terminate waiting for the pipe to
have more space.

Instead, use Popen.communicate() with the timeout parameter. This will
consume all output until EOF (preventing the process from stalling due
to a full pipe), and then check the return code. In the event of a
timeout error, Popen.communicate() doesn't loose any data, so it's safe
to call it again after the Popen.kill() in the exception handler.

This likely was done this way because the timeout parameter was new in
Python 3.3, but this shouldn't be a concern anymore

Signed-off-by: Joshua Watt <[email protected]>
Using Popen.wait() on a process that has output sent to a pipe can
potentially deadlock if the process produces enough output to fill the
pipe, since it will stall and never terminate waiting for the pipe to
have more space.

Instead, use Popen.communicate() with the timeout parameter. This will
consume all output until EOF (preventing the process from stalling due
to a full pipe), and then check the return code. In the event of a
timeout error, Popen.communicate() doesn't loose any data, so it's safe
to call it again after the Popen.kill() in the exception handler.

This likely was done this way because the timeout parameter was new in
Python 3.3, but this shouldn't be a concern anymore

Signed-off-by: Joshua Watt <[email protected]>
@codecov
Copy link

codecov bot commented Sep 7, 2023

Codecov Report

Patch coverage: 66.6% and project coverage change: -0.1% ⚠️

Comparison is base (2dcc631) 63.0% compared to head (b8f059f) 63.0%.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1265     +/-   ##
========================================
- Coverage    63.0%   63.0%   -0.1%     
========================================
  Files         160     160             
  Lines       11848   11849      +1     
========================================
  Hits         7472    7472             
- Misses       4376    4377      +1     
Files Changed Coverage Δ
labgrid/driver/sshdriver.py 56.1% <50.0%> (-0.2%) ⬇️
labgrid/util/ssh.py 91.4% <100.0%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@Bastian-Krause Bastian-Krause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@Emantor Emantor merged commit 342dc90 into labgrid-project:master Sep 19, 2023
@jluebbe
Copy link
Member

jluebbe commented Oct 13, 2023

@JoshuaWatt Did you actually experience the deadlock?

@JoshuaWatt
Copy link
Contributor Author

We have seen timeouts and the speculation was that it was the deadlock. I don't think we've seen any since this patch merged..... but ya we also encountered the SSH <8.5 bug unfortunately.

@JoshuaWatt
Copy link
Contributor Author

And unfortunately, the default SSH on Ubuntu 20.04 has the problem where it doesn't close stdout and there isn't an easy work-around, so it's hard to avoid :(

@jluebbe
Copy link
Member

jluebbe commented Oct 13, 2023

I've never seen SSH control master startup produce significant output (i.e. more than 8k), so I'm skeptical if a pipe deadlock is the cause.

In b8f059f you speculate:

This likely was done this way because the timeout parameter was new in Python 3.3, but this shouldn't be a concern anymore

A wait() with timeout was used since the initial implementation: e4862fa#diff-9d05d7cbdae8b8ebd1d2d5d387c295c966943c5133258be844cb4d38aefd90e8R41

We wanted to make sure that we don't depend on the state of the pipes, so use of wait() was intentional, as it directly covers the interesting cases:

  • immediate abort (due to a config error or similar)
  • normal startup (parent process exits after fork)
  • hang (via wait timeout)

A way to keep this behavior while handling a full pipe would be to use communicate with a very short timeout to start the reader threads and then using wait() to wait for the fork (instead of waiting for pipe closure).

@JoshuaWatt
Copy link
Contributor Author

I've never seen SSH control master startup produce significant output (i.e. more than 8k), so I'm skeptical if a pipe deadlock is the cause.

In b8f059f you speculate:

This likely was done this way because the timeout parameter was new in Python 3.3, but this shouldn't be a concern anymore

A wait() with timeout was used since the initial implementation: e4862fa#diff-9d05d7cbdae8b8ebd1d2d5d387c295c966943c5133258be844cb4d38aefd90e8R41

We wanted to make sure that we don't depend on the state of the pipes, so use of wait() was intentional, as it directly covers the interesting cases:

* immediate abort (due to a config error or similar)

* normal startup (parent process exits after fork)

* hang (via wait timeout)

A way to keep this behavior while handling a full pipe would be to use communicate with a very short timeout to start the reader threads and then using wait() to wait for the fork (instead of waiting for pipe closure).

It feels like that's relying on internal implementation details (that the threads stick reading data after communicate() times out) a little too much. Not sure if that worse than the SSH version check though. The version check at least is pretty clear when it can be dropped.

@jluebbe
Copy link
Member

jluebbe commented Oct 13, 2023

OK. Still, what we want here is not "read until pipes are closed" but "read until the process has forked (or failed)"

So how about a helper communicate_until_fork, which loops over wait() with a short timeout and then communicate() with another short timeout until .returncode is no longer None.

@JoshuaWatt
Copy link
Contributor Author

Ya, that seems reasonable

@jluebbe
Copy link
Member

jluebbe commented Oct 13, 2023

Or, alternatively, I think adding a timeout argument to ProcessWrapper.check_output() would do what we need here. But that's an even bigger change...

Bastian-Krause added a commit to Bastian-Krause/labgrid that referenced this pull request Oct 20, 2023
This reverts commit b8f059f.

OpenSSH < 8.5 called with -f waits for the stderr pipe to be closed
before forking. [1] fixes this behavior.
The labgrid commit to be reverted calls communicate(), relying on a
timely fork. Affected OpenSSH versions now run into a TimeoutExpired
exception on communicate() leading to an ExecutionError being raised in
labgrid.

A wait() with timeout was used since the initial implementation [2].
We wanted to make sure that we don't depend on the state of the pipes, so
use of wait() was intentional, as it directly covers the interesting cases:

- immediate abort (due to a config error or similar)
- normal startup (parent process exits after fork)
- hang (via wait() timeout)

Reverting the problematic commit avoids the complexity of having to
maintain two different ways to start SSH as suggested in [3].

If timeouts still occur after the revert, we should implement this
suggestion [4].

Discussion that lead to this patch can be found in the PR introducing
the problematic commit [5] as well as in [3].

The other commit [6] in the PR [5] has a similar approach for
`labgrid.util.SSHConnection._start_own_master()`. It shouldn't
be problematic though: The ssh process is not expected to fork as it is
not called with -f.

[1] OpenSSH: 396d32f3a ("upstream: There are lots of place where we want to redirect stdin,")
[2] e4862fa ("SSHDriver: ControlMaster & Driver")
[3] labgrid-project#1278
[4] labgrid-project#1265 (comment)
[5] labgrid-project#1265
[6] f9ca024 ("ssh: Prevent timeout from deadlock")

Signed-off-by: Bastian Krause <[email protected]>
rpoisel pushed a commit to honeytreelabs/labgrid that referenced this pull request Oct 24, 2023
This reverts commit b8f059f.

OpenSSH < 8.5 called with -f waits for the stderr pipe to be closed
before forking. [1] fixes this behavior.
The labgrid commit to be reverted calls communicate(), relying on a
timely fork. Affected OpenSSH versions now run into a TimeoutExpired
exception on communicate() leading to an ExecutionError being raised in
labgrid.

A wait() with timeout was used since the initial implementation [2].
We wanted to make sure that we don't depend on the state of the pipes, so
use of wait() was intentional, as it directly covers the interesting cases:

- immediate abort (due to a config error or similar)
- normal startup (parent process exits after fork)
- hang (via wait() timeout)

Reverting the problematic commit avoids the complexity of having to
maintain two different ways to start SSH as suggested in [3].

If timeouts still occur after the revert, we should implement this
suggestion [4].

Discussion that lead to this patch can be found in the PR introducing
the problematic commit [5] as well as in [3].

The other commit [6] in the PR [5] has a similar approach for
`labgrid.util.SSHConnection._start_own_master()`. It shouldn't
be problematic though: The ssh process is not expected to fork as it is
not called with -f.

[1] OpenSSH: 396d32f3a ("upstream: There are lots of place where we want to redirect stdin,")
[2] e4862fa ("SSHDriver: ControlMaster & Driver")
[3] labgrid-project#1278
[4] labgrid-project#1265 (comment)
[5] labgrid-project#1265
[6] f9ca024 ("ssh: Prevent timeout from deadlock")

Signed-off-by: Bastian Krause <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants