slow sherlock response leads to port forwarding failure #41

mckenziephagen · 2022-03-24T18:09:25Z

In the past day or so, me and a couple colleagues have been having a difficult time connecting to Sherlock using forward. when the script gets to the setup_port_forwarding, it fails with this error mux_client_forward: forwarding request failed: Port forwarding failed muxclient: master forward request failed or this error Access denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed.

I believe this is because Sherlock seems to be connecting slowly, meaning that the port isn't ready when the script gets to that line. Adding sleep 30 to the start.sh script right before setup_port_forwarding seems to have fixed it.

Should this be a part of the main script, or could it be added to the Debugging part of the read me? It's not always necessary, just when Sherlock is "acting up", and it does add time to the setup process, which isn't ideal for when Sherlock isn't lagging.

The text was updated successfully, but these errors were encountered:

vsoch · 2022-03-24T18:20:25Z

hey @mckenziephagen ! That is definitely something that can happen with a busy cluster. I don't think we should make it default, but how about adding a param to be generated that you can specify a timeout? So this would come down to:

Add a question to the setup script to say something like "How many seconds would you like to wait for a connection (ideal for slower clusters, defaults to 0 seconds for no timeout).
The default (as stated above) would be 0, but you would choose 30, for example. This would write a variable (something like CONNECTION_WAIT_SECONDS to the params.sh.
Then in start.sh and start-node.sh you could add this variable as you already did!

Let me know what you think!

gokceneraslan · 2022-05-23T19:49:52Z

I have a password timeout related issue that might be related. Although I can nicely ssh Sherlock login node without a password, jupyter-gpu script asks for a password during port forwarding and fails.

Here is the output of bash start.sh jupyter-gpu:

== Waiting for job to start, using exponential backoff ==
Attempt 0: not ready yet... retrying in 1..
Attempt 1: not ready yet... retrying in 2..
Attempt 2: not ready yet... retrying in 4..
Attempt 3: not ready yet... retrying in 8..
Attempt 4: not ready yet... retrying in 16..
Attempt 5: not ready yet... retrying in 32..
Attempt 6: resources allocated to xxx!..
xxx
xxx
notebook running on xxx

== Setting up port forwarding ==
ssh -L 55555:localhost:55555 sherlock ssh -L 55555:localhost:55555 -N xxx &
 ❯ Permission denied, please try again.                                                                                                                                 
[email protected]'s password:
Permission denied, please try again.
[email protected]'s password:
[email protected]: Permission denied (gssapi-with-mic,password).

== Connecting to notebook ==
mux_client_forward: forwarding request failed: Port forwarding failed
muxclient: master forward request failed
[email protected]'s password: -------------------------------------------------------------------------------
The following dependent module(s) are not currently loaded: cuda/10.1 (required by: cudnn/7.6.4)
-------------------------------------------------------------------------------

The following have been reloaded with a version change:
  1) cuda/10.1.168 => cuda/11.5.0

[I 2022-05-23 12:39:54.612 ServerApp] jupyterlab | extension was successfully linked.
[W 2022-05-23 12:39:54.618 NotebookApp] 'password' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[I 2022-05-23 12:39:54.626 ServerApp] nbclassic | extension was successfully linked.
[I 2022-05-23 12:39:55.575 ServerApp] notebook_shim | extension was successfully linked.
[I 2022-05-23 12:39:55.708 ServerApp] notebook_shim | extension was successfully loaded.
[I 2022-05-23 12:39:55.710 LabApp] JupyterLab extension loaded from /home/users/user/.miniconda/lib/python3.9/site-packages/jupyterlab
[I 2022-05-23 12:39:55.710 LabApp] JupyterLab application directory is /home/users/user/.miniconda/share/jupyter/lab
[I 2022-05-23 12:39:55.719 ServerApp] jupyterlab | extension was successfully loaded.
[I 2022-05-23 12:39:55.741 ServerApp] nbclassic | extension was successfully loaded.
[I 2022-05-23 12:39:55.745 ServerApp] Serving notebooks from local directory: /scratch/users/user
[I 2022-05-23 12:39:55.745 ServerApp] Jupyter Server 1.17.0 is running at:
[I 2022-05-23 12:39:55.745 ServerApp] http://localhost:55555/lab
[I 2022-05-23 12:39:55.746 ServerApp]  or http://127.0.0.1:55555/lab
[I 2022-05-23 12:39:55.746 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).


== View logs in separate terminal ==
ssh sherlock cat /home/users/user/forward-util/jupyter-gpu.sbatch.out
ssh sherlock cat /home/users/user/forward-util/jupyter-gpu.sbatch.err

== Instructions ==
1. Password, output, and error printed to this terminal? Look at logs (see instruction above)
2. Browser: http://xxx:55555/ -> http://localhost:55555/...
3. To end session: bash end.sh jupyter-gpu

Running ssh -L 55555:localhost:55555 sherlock ssh -L 55555:localhost:55555 -N xxx manually asks for password too. Any ideas?

PS: I have the following bits in my .ssh/config:

Host sherlock
    User user
    Hostname login.sherlock.stanford.edu
    GSSAPIDelegateCredentials yes
    GSSAPIAuthentication yes
    ControlMaster auto
    ControlPersist yes
    ControlPath ~/.ssh/%C

vsoch · 2022-05-23T19:53:09Z

ping @akkornel I think he might be able to help here - I'm not sure what is changed / different with respect to these configs, and I'm not at Stanford so I can't test for ya!

gokceneraslan · 2022-05-23T20:08:59Z

OK, now it doesn't ask for a password any more \o/ No idea what happened. Maybe passwordless login thing takes some time to take full action.

akkornel · 2022-05-23T20:11:03Z

Hello! It looks like there are a number of things happening here. And @gokceneraslan, I think your issue is different from @mckenziephagen's issue. But first, I should open with some general notes.

If you're wanting to use Jupyter on Sherlock, I suggest checking out Sherlock OnDemand! This is a web platform that (via the Interactive Apps) lets you start a Jupyter Notbook job on a Sherlock compute node, and connect to it, all without needing to set up an SSH tunnel!

If you haven't checked out Sherlock OnDemand, please take a look! If it does not meet your needs, I encourage you to email a support request; even if we can't help you immediately, we do make notes of what people want, and we try to meet those needs once we have the time/people/funding to do so.

I'll answer some of the specific comments in separate posts!

vsoch · 2022-05-23T20:13:24Z

Thanks @akkornel ! 🥳

akkornel · 2022-05-23T20:18:41Z

Hi @mckenziephagen!

In the past day or so, me and a couple colleagues have been having a difficult time connecting to Sherlock using forward. when the script gets to the setup_port_forwarding, it fails with this error

Access denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed.

This message comes from the compute node you're trying to SSH to: If SLURM (the job scheduler) has identified the node for your job, but the job is still being set up, you could get this error.

mux_client_forward: forwarding request failed: Port forwarding failed muxclient: master forward request failed

Do you get this error at the same time as the pam_slurm_adopt error? If yes, that's not a surprise: You're not being allowed to connect to the compute node yet, so things like port forwarding on the compute node will fail.

One way to avoid this is to ensure the job is in the RUNNING state before trying to connect to the node. If the job is in a different state, even when a compute node has been identified, you won't be able to connect yet.

If you are getting the mux_client_forward error without the pam_slurm_adopt error, that's an issue worth reporting!

gokceneraslan · 2022-05-23T20:42:49Z

Thanks for super fast response @akkornel ! Yes, I read about Sherlock OnDemand, and it's awesome. But I need jupyterlab, not jupyter/jupyterhub. That's the only reason of my suffering, otherwise I would have simply used it. It would be AMAZING to have this checkbox (https://discourse.openondemand.org/t/jupyterlab-installed/594/2) in Sherlock OnDemand too btw!

Also thanks for suggestions. The error went away without me doing really anything. Maybe passwordless login thing in .ssh/config took some time to fully work, anyway.

vsoch · 2022-05-23T20:53:54Z

I feel your pain @gokceneraslan - it's too common that we use a word like "suffering" when we talk about HPC. 😆

akkornel · 2022-05-23T22:54:54Z

<<>>
ssh -L 55555:localhost:55555 sherlock ssh -L 55555:localhost:55555 -N xxx &
❯ Permission denied, please try again.
[email protected]'s password:
Permission denied, please try again.
[email protected]'s password:
[email protected]: Permission denied (gssapi-with-mic,password).

It doesn't look like your username is set correctly. Notice how it says [email protected]; the string user should not be there. Instead, it should be using your SUNetID. So, there's probably a configuration issue, in your script or elsewhere, where the string user needs to be replaced with your SUNetID.

OK, now it doesn't ask for a password any more \o/ No idea what happened. Maybe passwordless login thing takes some time to take full action.

Huh. OK! I don't know enough about forward to know what might have been changed, or how it interacts with other things, but I'm glad that it works now.

But I need jupyterlab, not jupyter/jupyterhub. That's the only reason of my suffering, otherwise I would have simply used it. It would be AMAZING to have this checkbox (https://discourse.openondemand.org/t/jupyterlab-installed/594/2) in Sherlock OnDemand too btw!

That's good to know, but I need to reiterate something I said in a previous comment: Please please please email in a support request asking for this! I don't work on or support Sherlock, so telling me your need in this GitHub issue won't get your request to the right people in the right way, and I don't want you to feel like your needs are not being recognized.

Think of it like a funding grant: There are funds available, but the only people who get them are those who ask for them through the correct pathway. In this case, instead of monetary funds, the resource is the time of the Sherlock sysadmins. The Sherlock sysadmins have asked that people submit their requests via email as described at https://www.sherlock.stanford.edu/docs/#support, and lots of requests do come in! (I hope it's an easier pathway than most grant application websites…) Unless requests come in for Jupyter Lab support, the time is going to be spent on things that people did ask for via email (like an updated CP2K, or an updated R Studio).

So, I do think it's worth it to send in a request for Jupyter Lab support. When you do, it would help to explain what Jupyter Lab provides you over the existing Jupyter Hub. And the OSC URL you had would also be good to include. Thanks!

(BTW, I'm not just writing this for you, I'm writing it for others who might come across this ticket in the future. Hello future people!)

@vsoch, I think at this point, we should wait to hear back from @mckenziephagen, as I want to make sure McKenzie has the opportunity to reply back and say if there are any more issues. Then, I think this is OK to close!

gokceneraslan · 2022-05-27T19:52:58Z

Thanks so much @akkornel ! I emailed to the Sherlock team and added the link to OSC too (SRCC #58605). Let's see, I hope they give it a try.

PS: Re username: I replaced my real username with user intentionally :)

mckenziephagen mentioned this issue Mar 24, 2022

add wait time for slow cluster response #42

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow sherlock response leads to port forwarding failure #41

slow sherlock response leads to port forwarding failure #41

mckenziephagen commented Mar 24, 2022

vsoch commented Mar 24, 2022

gokceneraslan commented May 23, 2022 •

edited

Loading

vsoch commented May 23, 2022

gokceneraslan commented May 23, 2022

akkornel commented May 23, 2022

vsoch commented May 23, 2022

akkornel commented May 23, 2022

gokceneraslan commented May 23, 2022

vsoch commented May 23, 2022

akkornel commented May 23, 2022

gokceneraslan commented May 27, 2022 •

edited

Loading

slow sherlock response leads to port forwarding failure #41

slow sherlock response leads to port forwarding failure #41

Comments

mckenziephagen commented Mar 24, 2022

vsoch commented Mar 24, 2022

gokceneraslan commented May 23, 2022 • edited Loading

vsoch commented May 23, 2022

gokceneraslan commented May 23, 2022

akkornel commented May 23, 2022

vsoch commented May 23, 2022

akkornel commented May 23, 2022

gokceneraslan commented May 23, 2022

vsoch commented May 23, 2022

akkornel commented May 23, 2022

gokceneraslan commented May 27, 2022 • edited Loading

gokceneraslan commented May 23, 2022 •

edited

Loading

gokceneraslan commented May 27, 2022 •

edited

Loading