-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sshdriver: fix communicate hang if ssh version < 8.5 #1278
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #1278 +/- ##
======================================
Coverage 63.0% 63.0%
======================================
Files 160 160
Lines 11865 11875 +10
======================================
+ Hits 7479 7491 +12
+ Misses 4386 4384 -2
☔ View full report in Codecov by Sentry. |
Thanks for the PR! We've noticed this, too. While the commit openssh/openssh-portable@396d32f sounds a lot like the problem, we're also seeing this on OpenSSH 8.4 (Debian bullseye). I've bisected the problem to openssh/openssh-portable@396d32f. Could you verify that we are in fact seeing the same issue? Should be as simple as running By the way, the problem is also going away when omitting the |
d278576
to
157707e
Compare
You are correct, the 8.4 did not fix that issue, it indeed be fixed in 8.5. And yes, |
Signed-off-by: Larry Shen <[email protected]>
157707e
to
7ec4c26
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest one small simplification, rest looks good, thanks!
Signed-off-by: Larry Shen <[email protected]>
Signed-off-by: Larry Shen <[email protected]>
7ec4c26
to
f94bc01
Compare
Successfully tested with:
|
FTR: I have also modified the version check with #1281 (using |
I think the simpler fix is to revert #1265. It avoids the complexity of having to maintain two different ways to start SSH. If timeouts still occur after the revert, we should implement #1265 (comment). |
That's ok for me as I never saw issue before that commit, feel free to close this MR after your revert. |
This reverts commit b8f059f. OpenSSH < 8.5 called with -f waits for the stderr pipe to be closed before forking. [1] fixes this behavior. The labgrid commit to be reverted calls communicate(), relying on a timely fork. Affected OpenSSH versions now run into a TimeoutExpired exception on communicate() leading to an ExecutionError being raised in labgrid. A wait() with timeout was used since the initial implementation [2]. We wanted to make sure that we don't depend on the state of the pipes, so use of wait() was intentional, as it directly covers the interesting cases: - immediate abort (due to a config error or similar) - normal startup (parent process exits after fork) - hang (via wait() timeout) Reverting the problematic commit avoids the complexity of having to maintain two different ways to start SSH as suggested in [3]. If timeouts still occur after the revert, we should implement this suggestion [4]. Discussion that lead to this patch can be found in the PR introducing the problematic commit [5] as well as in [3]. The other commit [6] in the PR [5] has a similar approach for `labgrid.util.SSHConnection._start_own_master()`. It shouldn't be problematic though: The ssh process is not expected to fork as it is not called with -f. [1] OpenSSH: 396d32f3a ("upstream: There are lots of place where we want to redirect stdin,") [2] e4862fa ("SSHDriver: ControlMaster & Driver") [3] labgrid-project#1278 [4] labgrid-project#1265 (comment) [5] labgrid-project#1265 [6] f9ca024 ("ssh: Prevent timeout from deadlock") Signed-off-by: Bastian Krause <[email protected]>
This reverts commit b8f059f. OpenSSH < 8.5 called with -f waits for the stderr pipe to be closed before forking. [1] fixes this behavior. The labgrid commit to be reverted calls communicate(), relying on a timely fork. Affected OpenSSH versions now run into a TimeoutExpired exception on communicate() leading to an ExecutionError being raised in labgrid. A wait() with timeout was used since the initial implementation [2]. We wanted to make sure that we don't depend on the state of the pipes, so use of wait() was intentional, as it directly covers the interesting cases: - immediate abort (due to a config error or similar) - normal startup (parent process exits after fork) - hang (via wait() timeout) Reverting the problematic commit avoids the complexity of having to maintain two different ways to start SSH as suggested in [3]. If timeouts still occur after the revert, we should implement this suggestion [4]. Discussion that lead to this patch can be found in the PR introducing the problematic commit [5] as well as in [3]. The other commit [6] in the PR [5] has a similar approach for `labgrid.util.SSHConnection._start_own_master()`. It shouldn't be problematic though: The ssh process is not expected to fork as it is not called with -f. [1] OpenSSH: 396d32f3a ("upstream: There are lots of place where we want to redirect stdin,") [2] e4862fa ("SSHDriver: ControlMaster & Driver") [3] labgrid-project#1278 [4] labgrid-project#1265 (comment) [5] labgrid-project#1265 [6] f9ca024 ("ssh: Prevent timeout from deadlock") Signed-off-by: Bastian Krause <[email protected]>
Description
It looks there is some change to use
subprocess.communicate
to replacewait
in ssh driver recently, but it indeed makes sshdriver hang if ssh version < 8.5, then break our usage.You can quickly check it with docker example which use sshdriver, it will finally fails and print traceback:
I use next example to describe the issue:
test.py:
Above example will hang at
process.communicate
when setup master ssh connection if ssh < 8.4, while it's ok if ssh is newer.This means the ssh -f did not close stderr when put the ssh to background, so communicate can't read a EOF of stderr then hangs as next:
After check ssh source tree, I found this commit: openssh/openssh-portable@d14fe25 which fixed from V8.4P1. So, for ssh version <8.4, we shouldn't use
communicate
whenssh -f
put itself to background.As a result, this PR distinguish ssh version, for >=8.5, we use
communicate
; while for <8.4 we still back to old ways to usewait
(This may block if ssh produce lots of information, but for a master connection which did nothing just a communication setup, there is low chance to have any issue. And sooner or later ssh>8.4 will be the main stream. Otherwise, currently totally block for <8.5 user.Checklist
Fixes #1265 (for OpenSSH <8.5)