-
-
Notifications
You must be signed in to change notification settings - Fork 450
Description
We are deploying jupyter-ai in a JupyterHub server at UC Berkeley supporting 120 students, where our datahub team is consistently seeing unexpected behavior that appears linked to the jupyter-ai handling of proxy ports that slowly exhausts available ports and crashes the entire hub.
I'll try and provide as much detail as I can, but the issue is not straight-forward to reproduce. I'd be happy to follow up offline or connect you to the datahub team at Berkeley more directly, as the detailed logs come from student activity we can't post directly here.
The datahub team describes the issue as follows:
From the logs, we observed that after a user server is culled (i.e., the route /user/ is removed by both the proxy and the hub), the proxy continues to attempt connections to endpoints like /user//api/ai and /user//api/collaboration/room. This behavior can be clearly seen in the logs I provided.
Because these routes are stale, the proxy repeatedly tries and fails to reach them, opening new ephemeral ports with each attempt. Over time, this leads to exhaustion of available ephemeral ports. This issue appears to be isolated to this particular hub; we have not seen similar behavior in other hubs.
When the proxy runs out of ephemeral ports, it cannot establish any new connections — which directly caused the outages we've observed over the past week. The crashes experienced by students during your class are a side effect of this issue. With thousands of ephemeral ports in use, the proxy consumes a significant amount of memory and eventually reaches a point where it can no longer handle the connection load.
Clearly this is a significant challenge for deploying this; it's fine for us to disable it for the time being, but wanted to share the report. I know there's probably some further specific details you might need to help track this down, please let me know if there's anything our team can do to help you (and us) figure out what is going wrong here. Appreciate all you do.
cc @balajialg (UC Berkeley DataHub)