User/client website hangs indefinetly when client-lab crashes after server "waiting to connect" #230

afrankra · 2022-01-20T11:58:34Z

Bug description

In a batch system environment where the hub uses a batchspawner of any sort to start client-labs.
The hub will wait for the batchspawner to report that a batch job with the client-lab has started.
After this report has been received, the lab waits to connect and:
The hupyterhub interface for a user will hang if the client (lab) crashes before the connection can be established (after the web page reports "waiting to connect" [the client-lab started]).
In this circumstance the .base waits for the clients-lab forever(timeout), redirecting the user to the waiting page, even though clients-lab already crashed before the connection could be established.
If the clients-lab crashes after the connection is established, the user can just try again.
If the clients-lab crashes before the connection is established, the user is stuck waiting for ever (until the timeout).
Note that this is not the responsibility of the timeout. The timeout can be very long to allow for batch system's to schedule the clients-lab.
If the event "waiting to connect" happens and no connection can done because of a crash, the user needs to wait for the global timeout (this global timeout is meant to for the client-labs to be started, which will never happen in a crash).
A second timeout could be used to determine how long to wait for a connection after the client-lab started.

Expected behaviour

If the clients-lab crashes after the connection is established, the user can just try again.
If the clients-lab-server crashes before the connection is established but after waiting, the GUI should not hang. The user proxy redirect should be removed if the connection cannot be established after a timeout (not c.Spawner.start_timeout but a different timeout).

Actual behaviour

The hupyterhub interface for a user will hang if the client (lab) crashes before the connection can be established (after the web page reports "waiting to connect" [the client-lab started]).

How to reproduce

Jupyterhup starting jupyterlabs with a batchspawner.
Force batchscipt to exit before the start of the lab using exit 1 / return 1, etc.
Set c.Spawner.start_timeout big enough for the job to be scheduled and started.
Start a lab for a user using the web gui.
The batchjob will eventually start but the lab will not due to the crash simulated using the 'exit 1 or return 1'.
GUI will change from waiting to start to waiting to connect, and will never do so.
User cannot try again to start the lab due to redirects.

Your personal set up

Slurm
Centos7

Full environment

alembic==1.7.5
anyio==3.4.0
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
async-generator==1.10
attrs==21.4.0
Babel==2.9.1
backcall==0.2.0
batchspawner==1.1.0
beautifulsoup4==4.10.0
bleach==4.1.0
bs4==0.0.1
certifi==2021.10.8
certipy==0.1.3
cffi==1.15.0
charset-normalizer==2.0.10
colorama==0.4.4
commonmark==0.9.1
contextvars==2.4
cryptography==36.0.1
dataclasses==0.8
decorator==5.1.0
defusedxml==0.7.1
entrypoints==0.3
greenlet==1.1.2
idna==3.3
immutables==0.16
importlib-metadata==4.8.3
importlib-resources==5.4.0
ipykernel==5.5.6
ipython==7.16.2
ipython-genutils==0.2.0
jedi==0.17.2
Jinja2==3.0.3
json5==0.9.6
jsonschema==3.2.0
jupyter-client==7.1.0
jupyter-core==4.9.1
jupyter-server==1.13.1
jupyter-telemetry==0.1.0
jupyterhub==2.0.1
jupyterhub-moss==1.1.1
jupyterlab==3.2.5
jupyterlab-pygments==0.1.2
jupyterlab-server==2.10.2
Mako==1.1.6
MarkupSafe==2.0.1
mistune==0.8.4
nbclassic==0.3.4
nbclient==0.5.9
nbconvert==6.0.7
nbformat==5.1.3
nest-asyncio==1.5.4
nodeenv==1.6.0
notebook==6.4.6
oauthenticator==14.2.0
oauthlib==3.1.1
packaging==21.3
pamela==1.0.0
pandocfilters==1.5.0
parso==0.7.1
pexpect==4.8.0
pickleshare==0.7.5
pip-search==0.0.10
prometheus-client==0.12.0
prompt-toolkit==3.0.24
ptyprocess==0.7.0
pycparser==2.21
Pygments==2.11.1
pyOpenSSL==21.0.0
pyparsing==3.0.6
pyrsistent==0.18.0
python-dateutil==2.8.2
python-json-logger==2.0.2
pytz==2021.3
pyzmq==22.3.0
requests==2.27.0
rich==10.16.2
ruamel.yaml==0.17.20
ruamel.yaml.clib==0.2.6
Send2Trash==1.8.0
six==1.16.0
sniffio==1.2.0
soupsieve==2.3.1
SQLAlchemy==1.4.29
sudospawner==0.5.2
terminado==0.12.1
testpath==0.5.0
tornado==6.1
traitlets==4.3.3
typing_extensions==4.0.1
urllib3==1.26.7
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.2.3
wrapspawner==1.0.0
zipp==3.6.0

Configuration

import batchspawner
import jupyterhub_moss

c.Spawner.start_timeout = 1200
c.JupyterHub.log_level = 'DEBUG' 
c.Spawner.debug = True

The text was updated successfully, but these errors were encountered:

welcome · 2022-01-20T11:58:36Z

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.

You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

minrk · 2022-01-20T12:20:57Z

Thanks for the report! I've migrated the issue to the batchspawner repo, which should be responsible for handling fault tolerance talking to batch systems.

afrankra · 2022-01-20T12:44:03Z

FYI Sadly this issue will not be solvable at the spawner level.

minrk transferred this issue from jupyterhub/jupyterhub Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User/client website hangs indefinetly when client-lab crashes after server "waiting to connect" #230

User/client website hangs indefinetly when client-lab crashes after server "waiting to connect" #230

afrankra commented Jan 20, 2022

welcome bot commented Jan 20, 2022

minrk commented Jan 20, 2022

afrankra commented Jan 20, 2022

User/client website hangs indefinetly when client-lab crashes after server "waiting to connect" #230

User/client website hangs indefinetly when client-lab crashes after server "waiting to connect" #230

Comments

afrankra commented Jan 20, 2022

Bug description

Expected behaviour

Actual behaviour

How to reproduce

Your personal set up

welcome bot commented Jan 20, 2022

minrk commented Jan 20, 2022

afrankra commented Jan 20, 2022