-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow sherlock response leads to port forwarding failure #41
Comments
hey @mckenziephagen ! That is definitely something that can happen with a busy cluster. I don't think we should make it default, but how about adding a param to be generated that you can specify a timeout? So this would come down to:
Let me know what you think! |
I have a password timeout related issue that might be related. Although I can nicely ssh Sherlock login node without a password, jupyter-gpu script asks for a password during port forwarding and fails. Here is the output of
Running PS: I have the following bits in my
|
ping @akkornel I think he might be able to help here - I'm not sure what is changed / different with respect to these configs, and I'm not at Stanford so I can't test for ya! |
OK, now it doesn't ask for a password any more \o/ No idea what happened. Maybe passwordless login thing takes some time to take full action. |
Hello! It looks like there are a number of things happening here. And @gokceneraslan, I think your issue is different from @mckenziephagen's issue. But first, I should open with some general notes. If you're wanting to use Jupyter on Sherlock, I suggest checking out Sherlock OnDemand! This is a web platform that (via the Interactive Apps) lets you start a Jupyter Notbook job on a Sherlock compute node, and connect to it, all without needing to set up an SSH tunnel! If you haven't checked out Sherlock OnDemand, please take a look! If it does not meet your needs, I encourage you to email a support request; even if we can't help you immediately, we do make notes of what people want, and we try to meet those needs once we have the time/people/funding to do so. I'll answer some of the specific comments in separate posts! |
Thanks @akkornel ! 🥳 |
Hi @mckenziephagen!
This message comes from the compute node you're trying to SSH to: If SLURM (the job scheduler) has identified the node for your job, but the job is still being set up, you could get this error.
Do you get this error at the same time as the pam_slurm_adopt error? If yes, that's not a surprise: You're not being allowed to connect to the compute node yet, so things like port forwarding on the compute node will fail. One way to avoid this is to ensure the job is in the RUNNING state before trying to connect to the node. If the job is in a different state, even when a compute node has been identified, you won't be able to connect yet. If you are getting the |
Thanks for super fast response @akkornel ! Yes, I read about Sherlock OnDemand, and it's awesome. But I need jupyterlab, not jupyter/jupyterhub. That's the only reason of my suffering, otherwise I would have simply used it. It would be AMAZING to have this checkbox (https://discourse.openondemand.org/t/jupyterlab-installed/594/2) in Sherlock OnDemand too btw! Also thanks for suggestions. The error went away without me doing really anything. Maybe passwordless login thing in |
I feel your pain @gokceneraslan - it's too common that we use a word like "suffering" when we talk about HPC. 😆 |
It doesn't look like your username is set correctly. Notice how it says
Huh. OK! I don't know enough about forward to know what might have been changed, or how it interacts with other things, but I'm glad that it works now.
That's good to know, but I need to reiterate something I said in a previous comment: Please please please email in a support request asking for this! I don't work on or support Sherlock, so telling me your need in this GitHub issue won't get your request to the right people in the right way, and I don't want you to feel like your needs are not being recognized. Think of it like a funding grant: There are funds available, but the only people who get them are those who ask for them through the correct pathway. In this case, instead of monetary funds, the resource is the time of the Sherlock sysadmins. The Sherlock sysadmins have asked that people submit their requests via email as described at https://www.sherlock.stanford.edu/docs/#support, and lots of requests do come in! (I hope it's an easier pathway than most grant application websites…) Unless requests come in for Jupyter Lab support, the time is going to be spent on things that people did ask for via email (like an updated CP2K, or an updated R Studio). So, I do think it's worth it to send in a request for Jupyter Lab support. When you do, it would help to explain what Jupyter Lab provides you over the existing Jupyter Hub. And the OSC URL you had would also be good to include. Thanks! (BTW, I'm not just writing this for you, I'm writing it for others who might come across this ticket in the future. Hello future people!) @vsoch, I think at this point, we should wait to hear back from @mckenziephagen, as I want to make sure McKenzie has the opportunity to reply back and say if there are any more issues. Then, I think this is OK to close! |
Thanks so much @akkornel ! I emailed to the Sherlock team and added the link to OSC too (SRCC #58605). Let's see, I hope they give it a try. PS: Re username: I replaced my real username with |
In the past day or so, me and a couple colleagues have been having a difficult time connecting to Sherlock using forward. when the script gets to the setup_port_forwarding, it fails with this error
mux_client_forward: forwarding request failed: Port forwarding failed muxclient: master forward request failed
or this errorAccess denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed.
I believe this is because Sherlock seems to be connecting slowly, meaning that the port isn't ready when the script gets to that line. Adding
sleep 30
to the start.sh script right beforesetup_port_forwarding
seems to have fixed it.Should this be a part of the main script, or could it be added to the Debugging part of the read me? It's not always necessary, just when Sherlock is "acting up", and it does add time to the setup process, which isn't ideal for when Sherlock isn't lagging.
The text was updated successfully, but these errors were encountered: