-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deadlocked accounts - jhub removes route but does not delete server? #314
Comments
We've observed something similar with a different batchspawner derivative (Slurm). And the fix is to delete the user and restart the hub. We've been unable to reliably trigger it to reproduce it, so for now we just monitor for it and clear it when we run into the issue. Sometimes it occurs after issues on the system. Our setup includes a layer of wrapspawner on top, and what we've seen is that when users are in this situation they have an empty child spawner state --- no Job ID for instance, so it's like there's no way to verify the job is running. Tracing through the code I kind of had the same conclusion you came to, something weird happening during When you have a user in this state, have you looked at the server table entry in the database for that user to see what it looks like? |
Looking through the current users I found another user with a route entry in the proxy but no running job in htcondor. The spawner of the user has a job id and a job poll indicating that the job is running at the node set in the route - which means that the poll cannot be running anymore (since the job was removed about two hours ago). The spawner also has a server associated without it, with the same node and additionally the port. Which is not surprising, since the periodic route check (every 5 minutes) should have deleted the route otherwise. This is very similar to the other case, except that the ConfigProxy never logged that it actually deleted the route (which I left out of my log summary). The route seems to have stayed in the proxy all the time, but the poll also had the timeout exception and didn't continue. When the htcondor job was removed, the jhub did not notice. |
I see something similar periodically, however restarting the hub is sufficient to workaround the issue. I do not need to delete the hub user. |
Restarting the hub works here as well, as the the route targets will be checked for responsiveness before recreating them during startup. In our case removing the user is simpler, as the user entry will be just recreated on next login. The users spawners options will get reset to defaults, but no one has complained yet. |
I might have found something in the code - searching for the "server stopped" output, I found it in two instances of "async def user_stopped" definitions, One is in app.py during init_spawners() where it is added as callback to the poll once a server has been determined as running. It deletes the route, then the server when the poll task is terminated. The other is in handlers/base.py in spawn_single_user where it is also added as callback to the poll, after the route for a newly spawned server has been added. The callback removes the route first but possibly raises an exception from the call which will prevent the server removal that is called afterwards. This seems to match my log exactly, the TimeoutError exception appears to be thrown in the poll callback, so the server removal is never reached. The route removal seems to fail with timeout because the route is already gone for some reason? Anyway, if the reraise is replaced with a log message, this might improve things - worst case is that the route remains with the server entry being gone. This should be corrected by the periodic route check, I think. |
In handlers/base.py I patched as follows:
and will observe the behavior now. I will report once I have new info... |
Just a short followup: since I applied the patch, I have encountered neither another problem of that kind nor has my log output been reached. So it is not conclusive so far - at least the patch "works" in that it does causes no other problems at least... |
Finally my log entry was reached, so I can at least be sure that it is in there - I was starting to doubt that. Unfortunately, the exception occurred during a server restart, when the proxy was already shut down (connection refused), so sort of unrelated to what we are interested in. But at least no deadlocked accounts since adding the patch even if it is not for that reason. |
Hi,
so we have a jupyterhub instance (v 5.0.0) which uses a htcondor batch cluster to spawn server jobs on.
Occasionally we run into deadlocked user accounts which have a server route but the server job is gone and the jhub does not notice - the only solution currently is to delete the user account completely to get rid of the route.
Today we had another case and going through the logs I understand that this is caused (in this case) by the jhub somehow detecting a server stop and removing the route, but readding it shortly there after, because it is "missing" - meaning, the server entry is probably still in the db even after the route is removed.
This feels like a bug to me, maybe a timing issue? If routes and servers are removed together, it seems like the deadlock would be avoided. Hoping for more insights...
Best
Kruno
Here are the logs (jupyterhub + htcondor), boiled down to the relevant parts, with some comments added by me
The text was updated successfully, but these errors were encountered: