-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ssh: Prevent timeout from deadlock #1265
Conversation
Using Popen.wait() on a process that has output sent to a pipe can potentially deadlock if the process produces enough output to fill the pipe, since it will stall and never terminate waiting for the pipe to have more space. Instead, use Popen.communicate() with the timeout parameter. This will consume all output until EOF (preventing the process from stalling due to a full pipe), and then check the return code. In the event of a timeout error, Popen.communicate() doesn't loose any data, so it's safe to call it again after the Popen.kill() in the exception handler. This likely was done this way because the timeout parameter was new in Python 3.3, but this shouldn't be a concern anymore Signed-off-by: Joshua Watt <[email protected]>
f2247f2
to
bd0ce2b
Compare
Using Popen.wait() on a process that has output sent to a pipe can potentially deadlock if the process produces enough output to fill the pipe, since it will stall and never terminate waiting for the pipe to have more space. Instead, use Popen.communicate() with the timeout parameter. This will consume all output until EOF (preventing the process from stalling due to a full pipe), and then check the return code. In the event of a timeout error, Popen.communicate() doesn't loose any data, so it's safe to call it again after the Popen.kill() in the exception handler. This likely was done this way because the timeout parameter was new in Python 3.3, but this shouldn't be a concern anymore Signed-off-by: Joshua Watt <[email protected]>
bd0ce2b
to
b8f059f
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1265 +/- ##
========================================
- Coverage 63.0% 63.0% -0.1%
========================================
Files 160 160
Lines 11848 11849 +1
========================================
Hits 7472 7472
- Misses 4376 4377 +1
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
@JoshuaWatt Did you actually experience the deadlock? |
We have seen timeouts and the speculation was that it was the deadlock. I don't think we've seen any since this patch merged..... but ya we also encountered the SSH <8.5 bug unfortunately. |
And unfortunately, the default SSH on Ubuntu 20.04 has the problem where it doesn't close stdout and there isn't an easy work-around, so it's hard to avoid :( |
I've never seen SSH control master startup produce significant output (i.e. more than 8k), so I'm skeptical if a pipe deadlock is the cause. In b8f059f you speculate:
A wait() with timeout was used since the initial implementation: e4862fa#diff-9d05d7cbdae8b8ebd1d2d5d387c295c966943c5133258be844cb4d38aefd90e8R41 We wanted to make sure that we don't depend on the state of the pipes, so use of wait() was intentional, as it directly covers the interesting cases:
A way to keep this behavior while handling a full pipe would be to use communicate with a very short timeout to start the reader threads and then using wait() to wait for the fork (instead of waiting for pipe closure). |
It feels like that's relying on internal implementation details (that the threads stick reading data after communicate() times out) a little too much. Not sure if that worse than the SSH version check though. The version check at least is pretty clear when it can be dropped. |
OK. Still, what we want here is not "read until pipes are closed" but "read until the process has forked (or failed)" So how about a helper |
Ya, that seems reasonable |
Or, alternatively, I think adding a timeout argument to |
This reverts commit b8f059f. OpenSSH < 8.5 called with -f waits for the stderr pipe to be closed before forking. [1] fixes this behavior. The labgrid commit to be reverted calls communicate(), relying on a timely fork. Affected OpenSSH versions now run into a TimeoutExpired exception on communicate() leading to an ExecutionError being raised in labgrid. A wait() with timeout was used since the initial implementation [2]. We wanted to make sure that we don't depend on the state of the pipes, so use of wait() was intentional, as it directly covers the interesting cases: - immediate abort (due to a config error or similar) - normal startup (parent process exits after fork) - hang (via wait() timeout) Reverting the problematic commit avoids the complexity of having to maintain two different ways to start SSH as suggested in [3]. If timeouts still occur after the revert, we should implement this suggestion [4]. Discussion that lead to this patch can be found in the PR introducing the problematic commit [5] as well as in [3]. The other commit [6] in the PR [5] has a similar approach for `labgrid.util.SSHConnection._start_own_master()`. It shouldn't be problematic though: The ssh process is not expected to fork as it is not called with -f. [1] OpenSSH: 396d32f3a ("upstream: There are lots of place where we want to redirect stdin,") [2] e4862fa ("SSHDriver: ControlMaster & Driver") [3] labgrid-project#1278 [4] labgrid-project#1265 (comment) [5] labgrid-project#1265 [6] f9ca024 ("ssh: Prevent timeout from deadlock") Signed-off-by: Bastian Krause <[email protected]>
This reverts commit b8f059f. OpenSSH < 8.5 called with -f waits for the stderr pipe to be closed before forking. [1] fixes this behavior. The labgrid commit to be reverted calls communicate(), relying on a timely fork. Affected OpenSSH versions now run into a TimeoutExpired exception on communicate() leading to an ExecutionError being raised in labgrid. A wait() with timeout was used since the initial implementation [2]. We wanted to make sure that we don't depend on the state of the pipes, so use of wait() was intentional, as it directly covers the interesting cases: - immediate abort (due to a config error or similar) - normal startup (parent process exits after fork) - hang (via wait() timeout) Reverting the problematic commit avoids the complexity of having to maintain two different ways to start SSH as suggested in [3]. If timeouts still occur after the revert, we should implement this suggestion [4]. Discussion that lead to this patch can be found in the PR introducing the problematic commit [5] as well as in [3]. The other commit [6] in the PR [5] has a similar approach for `labgrid.util.SSHConnection._start_own_master()`. It shouldn't be problematic though: The ssh process is not expected to fork as it is not called with -f. [1] OpenSSH: 396d32f3a ("upstream: There are lots of place where we want to redirect stdin,") [2] e4862fa ("SSHDriver: ControlMaster & Driver") [3] labgrid-project#1278 [4] labgrid-project#1265 (comment) [5] labgrid-project#1265 [6] f9ca024 ("ssh: Prevent timeout from deadlock") Signed-off-by: Bastian Krause <[email protected]>
Using Popen.wait() on a process that has output sent to a pipe can potentially deadlock if the process produces enough output to fill the pipe, since it will stall and never terminate waiting for the pipe to have more space.
Instead, use Popen.communicate() with the timeout parameter. This will consume all output until EOF (preventing the process from stalling due to a full pipe), and then check the return code. In the event of a timeout error, Popen.communicate() doesn't loose any data, so it's safe to call it again after the Popen.kill() in the exception handler.
This likely was done this way because the timeout parameter was new in Python 3.3, but this shouldn't be a concern anymore
Description
Checklist