Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sshdriver: fix communicate hang if ssh version < 8.5 #1278

Closed
wants to merge 3 commits into from

Conversation

atline
Copy link
Contributor

@atline atline commented Oct 6, 2023

Description
It looks there is some change to use subprocess.communicate to replace wait in ssh driver recently, but it indeed makes sshdriver hang if ssh version < 8.5, then break our usage.

You can quickly check it with docker example which use sshdriver, it will finally fails and print traceback:

try:
    subprocess_timeout = timeout + 5
> stdout, _ = self.process.communicate(timeout=subprocess_timeout)

../../labgrid/driver/sshdriver.py:139:

I use next example to describe the issue:

test.py:

import subprocess
import os
import sys
import stat

ssh_user = "my_user"       # change to your account
ssh_passwd = "my_password" # change to your password

try:
    os.remove("/tmp/control-channel")
except:
    pass

# prepare passwd file
file = open("./passwd","w")
passwd_content = f"#!/bin/sh\necho {ssh_passwd}"
file.write(passwd_content)
file.close()
os.chmod("./passwd", stat.S_IRWXU)

# setup shared ssh socket, put it in background
env = {'SSH_ASKPASS': "./passwd", 'DISPLAY':'', 'SSH_ASKPASS_REQUIRE':'force'}
args = ['ssh', '-f', '-o', 'LogLevel=ERROR', '-x', '-o', 'ConnectTimeout=30', '-o', 'ControlPersist=300', '-o', 'UserKnownHostsFile=/dev/null', '-o', 'StrictHostKeyChecking=no', '-o', 'ServerAliveInterval=15', '-MN', '-S', '/tmp/control-channel', '-p', '22', '-l', ssh_user, 'localhost']
process = subprocess.Popen(args, env=env,
        #stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        #stdin=subprocess.DEVNULL,
        start_new_session=True)
sout, serr = process.communicate()  # hang here!
print(sout)
print(serr)

# use shared socket
args2 = ['ssh', '-o', 'LogLevel=ERROR', '-o', 'ControlPath=/tmp/control-channel', '-p', '22', '-l', ssh_user, 'localhost', 'uname -a']
process2 = subprocess.Popen(args2,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        stdin=subprocess.DEVNULL)
content, _ = process2.communicate()
print(content)

Above example will hang at process.communicate when setup master ssh connection if ssh < 8.4, while it's ok if ssh is newer.

  1. I see it stuck as next:
$ strace python3 test.py
lseek(3, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
fstat(3, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
read(3, 0x1744f00, 8192)                = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=26779, si_uid=1001, si_status=0, si_utime=1, si_stime=0} ---
read(3,
$ sudo lsof -a -c python -d 3
COMMAND   PID     USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
python3 26778 nxa13855    3r  FIFO   0,12      0t0 9186593 pipe
ls -l "/proc/26778/fd/3"
lr-x------ 1 nxa13855 atg 64 Oct  5 20:22 /proc/26778/fd/3 -> 'pipe:[9186593]'
$ lsof | grep 9186593
python3   26778                   nxa13855    3r     FIFO               0,12      0t0    9186593 pipe
ssh       26783                   nxa13855    2w     FIFO               0,12      0t0    9186593 pipe

This means the ssh -f did not close stderr when put the ssh to background, so communicate can't read a EOF of stderr then hangs as next:

subprocess.py", line 929, in communicate
stderr = self.stderr.read()

After check ssh source tree, I found this commit: openssh/openssh-portable@d14fe25 which fixed from V8.4P1. So, for ssh version <8.4, we shouldn't use communicate when ssh -f put itself to background.

As a result, this PR distinguish ssh version, for >=8.5, we use communicate; while for <8.4 we still back to old ways to use wait (This may block if ssh produce lots of information, but for a master connection which did nothing just a communication setup, there is low chance to have any issue. And sooner or later ssh>8.4 will be the main stream. Otherwise, currently totally block for <8.5 user.

Checklist

  • [✓] Tests for the feature
  • [✓] PR has been tested

Fixes #1265 (for OpenSSH <8.5)

@codecov
Copy link

codecov bot commented Oct 6, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (d8adccd) 63.0% compared to head (f94bc01) 63.0%.
Report is 2 commits behind head on master.

Additional details and impacted files
@@          Coverage Diff           @@
##           master   #1278   +/-   ##
======================================
  Coverage    63.0%   63.0%           
======================================
  Files         160     160           
  Lines       11865   11875   +10     
======================================
+ Hits         7479    7491   +12     
+ Misses       4386    4384    -2     
Files Coverage Δ
labgrid/driver/sshdriver.py 57.5% <76.9%> (+1.4%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Bastian-Krause
Copy link
Member

Thanks for the PR! We've noticed this, too.

While the commit openssh/openssh-portable@396d32f sounds a lot like the problem, we're also seeing this on OpenSSH 8.4 (Debian bullseye). I've bisected the problem to openssh/openssh-portable@396d32f. Could you verify that we are in fact seeing the same issue? Should be as simple as running autoreconf -ivf && ./configure && make -j200 in cloned OpenSSH and replacing the "ssh" with the absolute path to the compiled ssh binary.

By the way, the problem is also going away when omitting the stderr=subprocess.STDOUTin Popen().

@atline atline force-pushed the fix_ssh_driver_hang branch from d278576 to 157707e Compare October 6, 2023 11:09
@atline
Copy link
Contributor Author

atline commented Oct 6, 2023

Thanks for the PR! We've noticed this, too.

While the commit openssh/openssh-portable@396d32f sounds a lot like the problem, we're also seeing this on OpenSSH 8.4 (Debian bullseye). I've bisected the problem to openssh/openssh-portable@396d32f. Could you verify that we are in fact seeing the same issue? Should be as simple as running autoreconf -ivf && ./configure && make -j200 in cloned OpenSSH and replacing the "ssh" with the absolute path to the compiled ssh binary.

By the way, the problem is also going away when omitting the stderr=subprocess.STDOUTin Popen().

You are correct, the 8.4 did not fix that issue, it indeed be fixed in 8.5. And yes, remove stderr or stderr=subprocess.DEVNULL could fix it, as subprocess will not read stderr then. But that way, we won't see any error if the ssh command failed before put itself to background, E.g. any configure error in .ssh/config or other error.

@atline atline changed the title sshdriver: fix communicate hang if ssh version < 8.4 sshdriver: fix communicate hang if ssh version < 8.5 Oct 6, 2023
@atline atline force-pushed the fix_ssh_driver_hang branch from 157707e to 7ec4c26 Compare October 10, 2023 05:40
@atline atline requested a review from Bastian-Krause October 10, 2023 06:00
Copy link
Member

@Bastian-Krause Bastian-Krause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest one small simplification, rest looks good, thanks!

@Bastian-Krause Bastian-Krause self-assigned this Oct 10, 2023
@atline atline force-pushed the fix_ssh_driver_hang branch from 7ec4c26 to f94bc01 Compare October 10, 2023 12:28
@atline atline requested a review from Bastian-Krause October 10, 2023 13:24
@Bastian-Krause Bastian-Krause removed their assignment Oct 10, 2023
@Bastian-Krause
Copy link
Member

Successfully tested with:

  • Debian 11.8 (bullseye): OpenSSH_8.4p1 Debian-5+deb11u2, OpenSSL 1.1.1w 11 Sep 2023
  • Debian 12.2 (bookworm): OpenSSH_9.2p1 Debian-2+deb12u1, OpenSSL 3.0.11 19 Sep 2023

@ejoerns
Copy link
Contributor

ejoerns commented Oct 12, 2023

FTR: I have also modified the version check with #1281 (using @cached_property) since I wasn't aware that this had been done here, too. Maintainers might choose which variant to prefer 😏 I am fine with rebasing my work on this anyway.

@jluebbe
Copy link
Member

jluebbe commented Oct 13, 2023

I think the simpler fix is to revert #1265. It avoids the complexity of having to maintain two different ways to start SSH.

If timeouts still occur after the revert, we should implement #1265 (comment).

@atline
Copy link
Contributor Author

atline commented Oct 14, 2023

I think the simpler fix is to revert #1265. It avoids the complexity of having to maintain two different ways to start SSH.

If timeouts still occur after the revert, we should implement #1265 (comment).

That's ok for me as I never saw issue before that commit, feel free to close this MR after your revert.

Bastian-Krause added a commit to Bastian-Krause/labgrid that referenced this pull request Oct 20, 2023
This reverts commit b8f059f.

OpenSSH < 8.5 called with -f waits for the stderr pipe to be closed
before forking. [1] fixes this behavior.
The labgrid commit to be reverted calls communicate(), relying on a
timely fork. Affected OpenSSH versions now run into a TimeoutExpired
exception on communicate() leading to an ExecutionError being raised in
labgrid.

A wait() with timeout was used since the initial implementation [2].
We wanted to make sure that we don't depend on the state of the pipes, so
use of wait() was intentional, as it directly covers the interesting cases:

- immediate abort (due to a config error or similar)
- normal startup (parent process exits after fork)
- hang (via wait() timeout)

Reverting the problematic commit avoids the complexity of having to
maintain two different ways to start SSH as suggested in [3].

If timeouts still occur after the revert, we should implement this
suggestion [4].

Discussion that lead to this patch can be found in the PR introducing
the problematic commit [5] as well as in [3].

The other commit [6] in the PR [5] has a similar approach for
`labgrid.util.SSHConnection._start_own_master()`. It shouldn't
be problematic though: The ssh process is not expected to fork as it is
not called with -f.

[1] OpenSSH: 396d32f3a ("upstream: There are lots of place where we want to redirect stdin,")
[2] e4862fa ("SSHDriver: ControlMaster & Driver")
[3] labgrid-project#1278
[4] labgrid-project#1265 (comment)
[5] labgrid-project#1265
[6] f9ca024 ("ssh: Prevent timeout from deadlock")

Signed-off-by: Bastian Krause <[email protected]>
rpoisel pushed a commit to honeytreelabs/labgrid that referenced this pull request Oct 24, 2023
This reverts commit b8f059f.

OpenSSH < 8.5 called with -f waits for the stderr pipe to be closed
before forking. [1] fixes this behavior.
The labgrid commit to be reverted calls communicate(), relying on a
timely fork. Affected OpenSSH versions now run into a TimeoutExpired
exception on communicate() leading to an ExecutionError being raised in
labgrid.

A wait() with timeout was used since the initial implementation [2].
We wanted to make sure that we don't depend on the state of the pipes, so
use of wait() was intentional, as it directly covers the interesting cases:

- immediate abort (due to a config error or similar)
- normal startup (parent process exits after fork)
- hang (via wait() timeout)

Reverting the problematic commit avoids the complexity of having to
maintain two different ways to start SSH as suggested in [3].

If timeouts still occur after the revert, we should implement this
suggestion [4].

Discussion that lead to this patch can be found in the PR introducing
the problematic commit [5] as well as in [3].

The other commit [6] in the PR [5] has a similar approach for
`labgrid.util.SSHConnection._start_own_master()`. It shouldn't
be problematic though: The ssh process is not expected to fork as it is
not called with -f.

[1] OpenSSH: 396d32f3a ("upstream: There are lots of place where we want to redirect stdin,")
[2] e4862fa ("SSHDriver: ControlMaster & Driver")
[3] labgrid-project#1278
[4] labgrid-project#1265 (comment)
[5] labgrid-project#1265
[6] f9ca024 ("ssh: Prevent timeout from deadlock")

Signed-off-by: Bastian Krause <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants