Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of pending pods for GESIS server #2995

Closed
rgaiacs opened this issue Jun 5, 2024 · 11 comments
Closed

Large number of pending pods for GESIS server #2995

rgaiacs opened this issue Jun 5, 2024 · 11 comments
Assignees

Comments

@rgaiacs
Copy link
Collaborator

rgaiacs commented Jun 5, 2024

This started around 8:00 am CEST of 5 June 2024.

OVH


Screenshot 2024-06-05 at 09-31-42 Pod Activity - Dashboards - Grafana

GESIS


Screenshot 2024-06-05 at 09-31-11 Pod Activity - Dashboards - Grafana

kubectl get -n gesis pods --sort-by='{.metadata.creationTimestamp}' | grep Terminating produces

jupyter-deepchecks-2ddeepchecks-2dgeaqun29                       0/1     Terminating       0               38m
jupyter-psuastro528-2dtltdemo-2dstart-2dr2ww36mg                 0/1     Terminating       0               37m
jupyter-ohbm-2dcomcom-2d202-2dommunity-2dsurvey-2d9l4zs8t3       0/1     Terminating       0               37m
jupyter-intermine-2dinter-2d-2dws-2dpython-2ddocs-2dsngwm720     0/1     Terminating       0               37m
jupyter-deltares-2dra2ce-2df4pmreer                              0/1     Terminating       0               37m
jupyter-china-2dluoyaxiong-2dkeshihua50-2de8ah20zg               0/1     Terminating       0               36m
jupyter-py4basics-2dpy4basics-2egithub-2eio-2d16sqkroj           0/1     Terminating       0               36m
jupyter-wjbmattingly-2dpython-5ffor-5fdh-2d7yufgbtt              0/1     Terminating       0               34m
jupyter-defeo-2dcimpa-2drabat-2dp85f236l                         0/1     Terminating       0               34m
jupyter-deltares-2dra2ce-2de71ih71s                              0/1     Terminating       0               34m
jupyter-codingalzi-2dalgopy-2dke3xmcz4                           0/1     Terminating       0               34m
jupyter-ianhi-2dmpl-2dinteractions-2dktn2iy4l                    0/1     Terminating       0               32m
jupyter-apache-2dsedona-2di24qd44h                               0/1     Terminating       0               31m
jupyter-sbrg-2dssbio-2dhvfq3bjw                                  0/1     Terminating       0               30m
jupyter-jrzaurin-2dinfinitoml-2dq26uboq7                         0/1     Terminating       0               29m
jupyter-guiwitz-2dyrd2024-2da9qhs327                             0/1     Terminating       0               29m
jupyter-tadpoleloop-2didle-5fapocalypse-2dzr81a9ag               0/1     Terminating       0               28m
jupyter-trustimaging-2dstride-2dbghm32h5                         0/1     Terminating       0               28m
jupyter-tadpoleloop-2didle-5fapocalypse-2d9nzc4g29               0/1     Terminating       0               27m
jupyter-tadpoleloop-2didle-5fapocalypse-2dyw17cs7c               0/1     Terminating       0               27m
jupyter-guiwitz-2dyrd2024-2d608uvoqc                             0/1     Terminating       0               27m
jupyter-gbrunkhorst-2dwebsite-2d42o8wdbv                         0/1     Terminating       0               27m
jupyter-trustimaging-2dstride-2da60i1kvc                         0/1     Terminating       0               27m
jupyter-deltares-2dra2ce-2djifu02j0                              0/1     Terminating       0               27m
jupyter-deltares-2dra2ce-2dest1glcr                              0/1     Terminating       0               27m
jupyter-joegreiner-2dbind-2danalysissandbox-2dtwk8wst9           0/1     Terminating       0               26m
jupyter-joegreiner-2dbind-2danalysissandbox-2dtfi260rg           0/1     Terminating       0               25m
jupyter-tadpoleloop-2didle-5fapocalypse-2dc1lal175               0/1     Terminating       0               25m
jupyter-guiwitz-2dyrd2024-2dm6xnrr63                             0/1     Terminating       0               25m
jupyter-tadpoleloop-2didle-5fapocalypse-2dmhnyqouy               0/1     Terminating       0               25m
jupyter-apache-2dsedona-2dm6ix15ks                               0/1     Terminating       0               25m
jupyter-frederic-2djunier-2dpremiere-2d9vnocr4c                  0/1     Terminating       0               22m
jupyter-gesiscss-2dorc2-2dtest-2dbuild-2do0cu52nn                0/1     Terminating       0               22m
jupyter-saphety-2ddevelopers-2dapi-2dnotebooks-2dp7b7sv02        0/1     Terminating       0               19m
jupyter-mdtanker-2dinvert4geom-2dkr1uwerf                        0/1     Terminating       0               15m
jupyter-guiwitz-2dyrd2024-2drrt8j13z                             0/1     Terminating       0               15m
jupyter-joegreiner-2dbind-2danalysissandbox-2dwf3qn2vo           0/1     Terminating       0               15m
jupyter-ploixs-2denergycommunitymodel-2d4oq74dkr                 0/1     Terminating       0               15m
jupyter-nelsonfrz-2dgauss-2dalgorithm-2dp43r3p5b                 0/1     Terminating       0               15m
jupyter-lucasjamar-2dpandas-2dor-2djfp3g4py                      0/1     Terminating       0               14m
jupyter-lucasjamar-2dpandas-2dor-2dvernw7nz                      0/1     Terminating       0               13m
jupyter-lucasjamar-2dpandas-2dor-2dimh47u9f                      0/1     Terminating       0               12m
jupyter-dynamicslab-2dpysindy-2dg5pje18e                         0/1     Terminating       0               12m
jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dkvt8gbxl                0/1     Terminating       0               12m
jupyter-meg-2dsimula-2din-2ds-2dbrain-2dtransport-2dhrdln0yz     0/1     Terminating       0               12m
jupyter-harpaf13-2ddata-2ecopper-2dphosphides-2driehw043         0/1     Terminating       0               11m
jupyter-apache-2dspark-2d0rjq4kie                                0/1     Terminating       0               10m
jupyter-meg-2dsimula-2din-2ds-2dbrain-2dtransport-2d9aztyszv     0/1     Terminating       0               9m11s
jupyter-anonymous-2d4383682-2d6fqnuwd9                           0/1     Terminating       0               8m33s
jupyter-theinze-2dmachinelearningdemo-2dftg5cews                 0/1     Terminating       0               7m5s
jupyter-meg-2dsimula-2din-2ds-2dbrain-2dtransport-2dtactif8v     0/1     Terminating       0               7m1s
jupyter-release-2dcandidate-2dlunazodiaco-2d7yl93v78             0/1     Terminating       0               5m58s
jupyter-meg-2dsimula-2din-2ds-2dbrain-2dtransport-2dvf6jyalm     0/1     Terminating       0               5m18s
jupyter-theinze-2dmachinelearningdemo-2d3b6168kp                 0/1     Terminating       0               5m18s
jupyter-joegreiner-2dbind-2danalysissandbox-2d9t3u66oz           0/1     Terminating       0               5m16s
jupyter-ploixs-2denergycommunitymodel-2db0unxfj4                 0/1     Terminating       0               5m11s
jupyter-bogdanhel-2dqpl-2dqi-2dsdk-2d3e3i3oqp                    0/1     Terminating       0               5m5s

This is because GESIS server is not able to download the Docker images fast enough.


Screenshot 2024-06-05 at 10-05-25 Node Activity - Dashboards - Grafana

CurveNote

Not effected yet.

Screenshot 2024-06-05 at 09-34-42 Pod Activity - Dashboards - Grafana

@rgaiacs rgaiacs self-assigned this Jun 5, 2024
@rgaiacs
Copy link
Collaborator Author

rgaiacs commented Jun 6, 2024

@arnim my feeling says that something is different from after the machine reboot. But I can't see a correlation.


Screenshot 2024-06-06 at 09-44-18 View panel - Pod Activity - Dashboards - Grafana


Screenshot 2024-06-06 at 09-46-24 View panel - Node Activity - Dashboards - Grafana


Screenshot 2024-06-06 at 09-48-08 View panel - Node Activity - Dashboards - Grafana

We clean the image cache a few times but only on some we have a spike on pull.

@rgaiacs
Copy link
Collaborator Author

rgaiacs commented Jun 7, 2024

cc @arnim

The number of pending pods at GESIS server reduced. In the last 24 hours, we had

Screenshot 2024-06-07 at 08-35-09 View panel - Pod Activity - Dashboards - Grafana

My impression is that the pending pods are waiting for a image to be available because the image needs to be build on our server, pushed to Docker Hub and downloaded again to our server.

The peak of pending pods has a time correlation with the peak of build pods.

Screenshot 2024-06-07 at 08-35-18 View panel - Pod Activity - Dashboards - Grafana

@manics
Copy link
Member

manics commented Jun 7, 2024

Do you think we need to add an additional limit to BinderHub for the number of pending spawns, to prevent too many pods being created or queued up?

@rgaiacs
Copy link
Collaborator Author

rgaiacs commented Jun 7, 2024

Do you think we need to add an additional limit to BinderHub for the number of pending spawns, to prevent too many pods being created or queued up?

My hypothesis is that BinderHub receives a launch request and allocates a new Kubernetes pod for the launch. Because the image required for the Kubernetes pod does not exists yet, the pod goes into pending mode. BinderHub adds a building request to the queue. During some periods, the number of new image build is larger than GESIS server capacity and the pending pods start to accumulate.

Screenshot 2024-06-07 at 15-56-58 Pod Activity - Dashboards - Grafana

Screenshot 2024-06-07 at 15-57-07 Pod Activity - Dashboards - Grafana

@rgaiacs
Copy link
Collaborator Author

rgaiacs commented Aug 1, 2024

I'm still puzzled of why we have big peaks of pending pods.

Screenshot 2024-08-01 at 10-15-17 Pod Activity - Dashbo

I understand a few (less than 10) pods pending because of network, for example, the image is larger than usual. I don't understand almost 40 pods pending at the same time.

I checked

  • number of builds
  • number of pulls
  • latency
  • disk available

and I did not find any correlation.

@manics @sgibson91 do you have any clue of where should I look for a correlation? Thanks!

@manics
Copy link
Member

manics commented Aug 1, 2024

Is a particular spike of pods related to the same repository?

Assuming your prometheus labels are the same as on Curvenote try this prometheus query for a time window around the spike: sum(label_replace(kube_pod_status_phase{phase="Pending",pod=~"jupyter-.*"}, "repo", "$1", "pod", "jupyter-(.+)-[^-]+")) by (repo)
E.g. https://prometheus.binder.curvenote.dev/graph?g0.expr=sum(label_replace(kube_pod_status_phase%7Bphase%3D%22Pending%22%2Cpod%3D~%22jupyter-.*%22%7D%2C%20%22repo%22%2C%20%22%241%22%2C%20%22pod%22%2C%20%22jupyter-(.%2B)-%5B%5E-%5D%2B%22))%20by%20(repo)&g0.tab=0&g0.display_mode=lines&g0.show_exemplars=0&g0.range_input=30m&g0.end_input=2024-08-01%2010%3A30%3A00&g0.moment_input=2024-08-01%2010%3A30%3A00

@rgaiacs
Copy link
Collaborator Author

rgaiacs commented Aug 1, 2024

Thanks @manics.

Is a particular spike of pods related to the same repository?

Yes, there is.

Screenshot 2024-08-01 at 14-57-15 Prometheus Time Serie

A big number of the pending pods were the same repository. My assumption is that someone is giving a lecture / course. 10+ learners access mybinder.org at the same time. Because the server does not have the Docker image cached, all 10+ users are on the pending status until the Docker image is downloaded.

I'm closing this as "won't fix".

cc @arnim

@rgaiacs
Copy link
Collaborator Author

rgaiacs commented Aug 5, 2024

@arnim This is my current understanding of the problem:

  1. User visits https://mybinder.org/ and requests a launch
  2. mybinder.org assigns the request to GESIS server
  3. GESIS server allocates a kubernetes pod to fulfil the request
  4. Kubernetes waits for image to be available to pod
  5. BinderHub waiting times out. BinderHub sends a request to terminate the pod to Kubernetes.
  6. Kubernetes can only kill the pod when it is running but the pod can only run when the image is available. Kubernetes goes into a scenario where it is waiting a pod to start only to be killed.

The short video highlights one pod in this scenario described above: https://github.com/user-attachments/assets/131cda9f-c9c6-4195-b219-0d7a48e217ff

@manics
Copy link
Member

manics commented Aug 5, 2024

You're right... easily reproducible:

$ kubectl run xxx --image=quay.io/jupyter/datascience-notebook
pod/xxx created

$ kubectl get pod
NAME   READY   STATUS              RESTARTS   AGE
xxx    0/1     ContainerCreating   0          7s

$ kubectl delete pod xxx
# Takes ages.... 'kubectl get events --watch' shows the image is pulled before it's deleted
pod "xxx" deleted

If the image doesn't exist then you should get Error: ImagePullBackOff and it should be possible to delete the pod straight away.

I found this issue kubernetes/kubernetes#121435

@rgaiacs
Copy link
Collaborator Author

rgaiacs commented Aug 7, 2024

OVH and GESIS is observing a large number of pending pods. Curvenote looks OK.

OVH

Screenshot 2024-08-07 at 16-59-26 Pod Activity - Dashbo

GESIS

Screenshot 2024-08-07 at 17-01-28 Pod Activity - Dashbo

It is getting worse and worse.

Curvenote

Screenshot 2024-08-07 at 17-00-21 Pod Activity - Dashbo

@rgaiacs
Copy link
Collaborator Author

rgaiacs commented Aug 27, 2024

I'm closing this in favour of #3056.

@rgaiacs rgaiacs closed this as completed Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants