Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlocked accounts - jhub removes route but does not delete server? #314

Open
lodott72 opened this issue Nov 15, 2024 · 8 comments
Open

Comments

@lodott72
Copy link

Hi,

so we have a jupyterhub instance (v 5.0.0) which uses a htcondor batch cluster to spawn server jobs on.

Occasionally we run into deadlocked user accounts which have a server route but the server job is gone and the jhub does not notice - the only solution currently is to delete the user account completely to get rid of the route.

Today we had another case and going through the logs I understand that this is caused (in this case) by the jhub somehow detecting a server stop and removing the route, but readding it shortly there after, because it is "missing" - meaning, the server entry is probably still in the db even after the route is removed.

This feels like a bug to me, maybe a timing issue? If routes and servers are removed together, it seems like the deadlock would be avoided. Hoping for more insights...

Best
Kruno

Here are the logs (jupyterhub + htcondor), boiled down to the relevant parts, with some comments added by me

# last poll for the job
[D 2024-11-15 15:26:50.332 JupyterHub batchspawner:314] Spawner querying job: sudo -E -u lodott /var/lib/condor/util/condor_q.sh 1612
# server stop detected by jhub - don't know how/why
# there should have been two more polls here (every 30secs)
[W 2024-11-15 15:28:17.258 JupyterHub base:1290] User lodott server stopped, with exit code: 1
# jhub request route removes from proxy
[I 2024-11-15 15:28:17.258 JupyterHub proxy:356] Removing user lodott from proxy (/user/lodott/)
# user requests are ongoing but eventually run into a Timeout exception
# I think the exception might be responsible for killing the poll here
[E 2024-11-15 15:28:55.991 JupyterHub spawner:1459] Unhandled error in poll callback for <CondorOptionsSpawner object at 0x7fbe07e0bf50>
...
TimeoutError: Repeated api_request to proxy path "/user/lodott" failed.
### lots of client requests with 302 followed 500, because no route
# periodic route check puts back route because its "missing" - the server entry in db was apparently not deleted
[D 2024-11-15 15:31:16.190 JupyterHub proxy:392] Checking routes
...
[W 2024-11-15 15:31:16.208 JupyterHub proxy:416] Adding missing route for /user/lodott/ (Server(url=http://batchj004:40002/user/lodott/, bind_url=http://batchj004:40002/user/lodott/))
### client requests with 302 which are now routed again but server state is unclear
# htcondor actually removes notebook job due to	
11/15/24 17:39:11 (1612.0) (1799193): Job 1612.0 is being removed: Job removed by SYSTEM_PERIODIC_REMOVE due to Job runtime longer than reserved.
# at this stage the htcondor job is gone but since there is no polling, the jhub never notices - deadlock
@manics manics transferred this issue from jupyterhub/jupyterhub Nov 15, 2024
@rcthomas
Copy link
Contributor

We've observed something similar with a different batchspawner derivative (Slurm). And the fix is to delete the user and restart the hub. We've been unable to reliably trigger it to reproduce it, so for now we just monitor for it and clear it when we run into the issue. Sometimes it occurs after issues on the system.

Our setup includes a layer of wrapspawner on top, and what we've seen is that when users are in this situation they have an empty child spawner state --- no Job ID for instance, so it's like there's no way to verify the job is running. Tracing through the code I kind of had the same conclusion you came to, something weird happening during stop().

When you have a user in this state, have you looked at the server table entry in the database for that user to see what it looks like?

@lodott72
Copy link
Author

Looking through the current users I found another user with a route entry in the proxy but no running job in htcondor.

The spawner of the user has a job id and a job poll indicating that the job is running at the node set in the route - which means that the poll cannot be running anymore (since the job was removed about two hours ago).

The spawner also has a server associated without it, with the same node and additionally the port. Which is not surprising, since the periodic route check (every 5 minutes) should have deleted the route otherwise.

This is very similar to the other case, except that the ConfigProxy never logged that it actually deleted the route (which I left out of my log summary). The route seems to have stayed in the proxy all the time, but the poll also had the timeout exception and didn't continue. When the htcondor job was removed, the jhub did not notice.

@ryanlovett
Copy link
Contributor

I see something similar periodically, however restarting the hub is sufficient to workaround the issue. I do not need to delete the hub user.

@lodott72
Copy link
Author

Restarting the hub works here as well, as the the route targets will be checked for responsiveness before recreating them during startup.

In our case removing the user is simpler, as the user entry will be just recreated on next login. The users spawners options will get reset to defaults, but no one has complained yet.

@lodott72
Copy link
Author

I might have found something in the code - searching for the "server stopped" output, I found it in two instances of "async def user_stopped" definitions,

One is in app.py during init_spawners() where it is added as callback to the poll once a server has been determined as running. It deletes the route, then the server when the poll task is terminated.

The other is in handlers/base.py in spawn_single_user where it is also added as callback to the poll, after the route for a newly spawned server has been added. The callback removes the route first but possibly raises an exception from the call which will prevent the server removal that is called afterwards.

This seems to match my log exactly, the TimeoutError exception appears to be thrown in the poll callback, so the server removal is never reached. The route removal seems to fail with timeout because the route is already gone for some reason?

Anyway, if the reraise is replaced with a log message, this might improve things - worst case is that the route remains with the server entry being gone. This should be corrected by the periodic route check, I think.

@lodott72
Copy link
Author

In handlers/base.py I patched as follows:

   except Exception as e:
        PROXY_DELETE_DURATION_SECONDS.labels(
            status=ProxyDeleteStatus.failure
        ).observe(time.perf_counter() - proxy_deletion_start_time)
        self.log.warning(
            "Route removal encountered an exception for user %s - %s", user.name, e
        )
        #raise

and will observe the behavior now. I will report once I have new info...

@lodott72
Copy link
Author

Just a short followup: since I applied the patch, I have encountered neither another problem of that kind nor has my log output been reached.

So it is not conclusive so far - at least the patch "works" in that it does causes no other problems at least...

@lodott72
Copy link
Author

lodott72 commented Dec 5, 2024

Finally my log entry was reached, so I can at least be sure that it is in there - I was starting to doubt that.

Unfortunately, the exception occurred during a server restart, when the proxy was already shut down (connection refused), so sort of unrelated to what we are interested in.

But at least no deadlocked accounts since adding the patch even if it is not for that reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants