You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a batch system environment where the hub uses a batchspawner of any sort to start client-labs.
The hub will wait for the batchspawner to report that a batch job with the client-lab has started.
After this report has been received, the lab waits to connect and:
The hupyterhub interface for a user will hang if the client (lab) crashes before the connection can be established (after the web page reports "waiting to connect" [the client-lab started]).
In this circumstance the .base waits for the clients-lab forever(timeout), redirecting the user to the waiting page, even though clients-lab already crashed before the connection could be established.
If the clients-lab crashes after the connection is established, the user can just try again.
If the clients-lab crashes before the connection is established, the user is stuck waiting for ever (until the timeout).
Note that this is not the responsibility of the timeout. The timeout can be very long to allow for batch system's to schedule the clients-lab.
If the event "waiting to connect" happens and no connection can done because of a crash, the user needs to wait for the global timeout (this global timeout is meant to for the client-labs to be started, which will never happen in a crash).
A second timeout could be used to determine how long to wait for a connection after the client-lab started.
Expected behaviour
If the clients-lab crashes after the connection is established, the user can just try again.
If the clients-lab-server crashes before the connection is established but after waiting, the GUI should not hang. The user proxy redirect should be removed if the connection cannot be established after a timeout (not c.Spawner.start_timeout but a different timeout).
Actual behaviour
The hupyterhub interface for a user will hang if the client (lab) crashes before the connection can be established (after the web page reports "waiting to connect" [the client-lab started]).
How to reproduce
Jupyterhup starting jupyterlabs with a batchspawner.
Force batchscipt to exit before the start of the lab using exit 1 / return 1, etc.
Set c.Spawner.start_timeout big enough for the job to be scheduled and started.
Start a lab for a user using the web gui.
The batchjob will eventually start but the lab will not due to the crash simulated using the 'exit 1 or return 1'.
GUI will change from waiting to start to waiting to connect, and will never do so.
User cannot try again to start the lab due to redirects.
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋
Welcome to the Jupyter community! 🎉
minrk
transferred this issue from jupyterhub/jupyterhub
Jan 20, 2022
Thanks for the report! I've migrated the issue to the batchspawner repo, which should be responsible for handling fault tolerance talking to batch systems.
Bug description
In a batch system environment where the hub uses a batchspawner of any sort to start client-labs.
The hub will wait for the batchspawner to report that a batch job with the client-lab has started.
After this report has been received, the lab waits to connect and:
The hupyterhub interface for a user will hang if the client (lab) crashes before the connection can be established (after the web page reports "waiting to connect" [the client-lab started]).
In this circumstance the .base waits for the clients-lab forever(timeout), redirecting the user to the waiting page, even though clients-lab already crashed before the connection could be established.
If the clients-lab crashes after the connection is established, the user can just try again.
If the clients-lab crashes before the connection is established, the user is stuck waiting for ever (until the timeout).
Note that this is not the responsibility of the timeout. The timeout can be very long to allow for batch system's to schedule the clients-lab.
If the event "waiting to connect" happens and no connection can done because of a crash, the user needs to wait for the global timeout (this global timeout is meant to for the client-labs to be started, which will never happen in a crash).
A second timeout could be used to determine how long to wait for a connection after the client-lab started.
Expected behaviour
If the clients-lab crashes after the connection is established, the user can just try again.
If the clients-lab-server crashes before the connection is established but after waiting, the GUI should not hang. The user proxy redirect should be removed if the connection cannot be established after a timeout (not c.Spawner.start_timeout but a different timeout).
Actual behaviour
The hupyterhub interface for a user will hang if the client (lab) crashes before the connection can be established (after the web page reports "waiting to connect" [the client-lab started]).
How to reproduce
Jupyterhup starting jupyterlabs with a batchspawner.
Force batchscipt to exit before the start of the lab using exit 1 / return 1, etc.
Set c.Spawner.start_timeout big enough for the job to be scheduled and started.
Start a lab for a user using the web gui.
The batchjob will eventually start but the lab will not due to the crash simulated using the 'exit 1 or return 1'.
GUI will change from waiting to start to waiting to connect, and will never do so.
User cannot try again to start the lab due to redirects.
Your personal set up
Slurm
Centos7
Full environment
Configuration
The text was updated successfully, but these errors were encountered: