Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User/client website hangs indefinetly when client-lab crashes after server "waiting to connect" #230

Open
afrankra opened this issue Jan 20, 2022 · 3 comments

Comments

@afrankra
Copy link

Bug description

In a batch system environment where the hub uses a batchspawner of any sort to start client-labs.
The hub will wait for the batchspawner to report that a batch job with the client-lab has started.
After this report has been received, the lab waits to connect and:
The hupyterhub interface for a user will hang if the client (lab) crashes before the connection can be established (after the web page reports "waiting to connect" [the client-lab started]).
In this circumstance the .base waits for the clients-lab forever(timeout), redirecting the user to the waiting page, even though clients-lab already crashed before the connection could be established.
If the clients-lab crashes after the connection is established, the user can just try again.
If the clients-lab crashes before the connection is established, the user is stuck waiting for ever (until the timeout).
Note that this is not the responsibility of the timeout. The timeout can be very long to allow for batch system's to schedule the clients-lab.
If the event "waiting to connect" happens and no connection can done because of a crash, the user needs to wait for the global timeout (this global timeout is meant to for the client-labs to be started, which will never happen in a crash).
A second timeout could be used to determine how long to wait for a connection after the client-lab started.

Expected behaviour

If the clients-lab crashes after the connection is established, the user can just try again.
If the clients-lab-server crashes before the connection is established but after waiting, the GUI should not hang. The user proxy redirect should be removed if the connection cannot be established after a timeout (not c.Spawner.start_timeout but a different timeout).

Actual behaviour

The hupyterhub interface for a user will hang if the client (lab) crashes before the connection can be established (after the web page reports "waiting to connect" [the client-lab started]).

How to reproduce

Jupyterhup starting jupyterlabs with a batchspawner.
Force batchscipt to exit before the start of the lab using exit 1 / return 1, etc.
Set c.Spawner.start_timeout big enough for the job to be scheduled and started.
Start a lab for a user using the web gui.
The batchjob will eventually start but the lab will not due to the crash simulated using the 'exit 1 or return 1'.
GUI will change from waiting to start to waiting to connect, and will never do so.
User cannot try again to start the lab due to redirects.

Your personal set up

Slurm
Centos7

Full environment
alembic==1.7.5
anyio==3.4.0
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
async-generator==1.10
attrs==21.4.0
Babel==2.9.1
backcall==0.2.0
batchspawner==1.1.0
beautifulsoup4==4.10.0
bleach==4.1.0
bs4==0.0.1
certifi==2021.10.8
certipy==0.1.3
cffi==1.15.0
charset-normalizer==2.0.10
colorama==0.4.4
commonmark==0.9.1
contextvars==2.4
cryptography==36.0.1
dataclasses==0.8
decorator==5.1.0
defusedxml==0.7.1
entrypoints==0.3
greenlet==1.1.2
idna==3.3
immutables==0.16
importlib-metadata==4.8.3
importlib-resources==5.4.0
ipykernel==5.5.6
ipython==7.16.2
ipython-genutils==0.2.0
jedi==0.17.2
Jinja2==3.0.3
json5==0.9.6
jsonschema==3.2.0
jupyter-client==7.1.0
jupyter-core==4.9.1
jupyter-server==1.13.1
jupyter-telemetry==0.1.0
jupyterhub==2.0.1
jupyterhub-moss==1.1.1
jupyterlab==3.2.5
jupyterlab-pygments==0.1.2
jupyterlab-server==2.10.2
Mako==1.1.6
MarkupSafe==2.0.1
mistune==0.8.4
nbclassic==0.3.4
nbclient==0.5.9
nbconvert==6.0.7
nbformat==5.1.3
nest-asyncio==1.5.4
nodeenv==1.6.0
notebook==6.4.6
oauthenticator==14.2.0
oauthlib==3.1.1
packaging==21.3
pamela==1.0.0
pandocfilters==1.5.0
parso==0.7.1
pexpect==4.8.0
pickleshare==0.7.5
pip-search==0.0.10
prometheus-client==0.12.0
prompt-toolkit==3.0.24
ptyprocess==0.7.0
pycparser==2.21
Pygments==2.11.1
pyOpenSSL==21.0.0
pyparsing==3.0.6
pyrsistent==0.18.0
python-dateutil==2.8.2
python-json-logger==2.0.2
pytz==2021.3
pyzmq==22.3.0
requests==2.27.0
rich==10.16.2
ruamel.yaml==0.17.20
ruamel.yaml.clib==0.2.6
Send2Trash==1.8.0
six==1.16.0
sniffio==1.2.0
soupsieve==2.3.1
SQLAlchemy==1.4.29
sudospawner==0.5.2
terminado==0.12.1
testpath==0.5.0
tornado==6.1
traitlets==4.3.3
typing_extensions==4.0.1
urllib3==1.26.7
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.2.3
wrapspawner==1.0.0
zipp==3.6.0
Configuration
import batchspawner
import jupyterhub_moss

c.Spawner.start_timeout = 1200
c.JupyterHub.log_level = 'DEBUG' 
c.Spawner.debug = True
@welcome
Copy link

welcome bot commented Jan 20, 2022

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

@minrk minrk transferred this issue from jupyterhub/jupyterhub Jan 20, 2022
@minrk
Copy link
Member

minrk commented Jan 20, 2022

Thanks for the report! I've migrated the issue to the batchspawner repo, which should be responsible for handling fault tolerance talking to batch systems.

@afrankra
Copy link
Author

FYI Sadly this issue will not be solvable at the spawner level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants