Configure some CPU/memory requests for hub and proxy pods in basehub #2127

consideRatio · 2023-02-01T19:55:53Z

Currently, the hub and proxy pod requests very little CPU/Memory, but various pods by default have already 100m in requests. This could starve our hub/proxy pod of CPU. I think for the sake of stability, we should grant the hub pod 1 full CPU, and allow them to request memory to an extent making us confident we won't get Evicted/OOMKilled either.

It also seems that the hub/proxy pod request of 128 MB memory isn't covering the need. It would be good to request memory more than we typically use so that we don't risk being evicted or OOMKilled.

  prod                        proxy-7f5dbcbd68-zqd22                                 10m (0%)      0 (0%)      64Mi (0%)        1Gi (4%)       94d
  prod                        hub-5db4d78fdc-bwbvv                                   10m (0%)      0 (0%)       128Mi (0%)       2Gi (9%)       60m

This could have been relevant for the incident in #2126, if it wasn't it would be good to rule it out by having these increased requests.

Config in basehub

If we provide a 10m request and other pods on the node has 100m requests and going full throttle - they will get a ten times larger share of CPU than the hub pod. On core nodes with 4 CPU it means that our hub pod would only get 0.4 CPU.

I understand it as the hub pod can benefit of up to 1 full CPU from time to time, but I'm a bit confused about it. I recall a grafana dashboard I've seen in the past presented metrics in a way that fails to capture the peaks properly unless zoomed in.

@yuvipanda I think we could put 50m or 100m in requests here for the hub pod to reduce the risk of getting throttled before 1 CPU if in competition with other pods. What do you think?

Hub pod

infrastructure/helm-charts/basehub/values.yaml

Lines 433 to 439 in a9d8816

    
           resources: 
        
             requests: 
        
               # Very small unit, since we don't want any CPU guarantees 
        
               cpu: 0.01 
        
               memory: 128Mi 
        
             limits: 
        
               memory: 2Gi

Proxy pod

infrastructure/helm-charts/basehub/values.yaml

Lines 139 to 146 in a9d8816

    
           resources: 
        
             requests: 
        
               # FIXME: We want no guarantees here!!! 
        
               # This is lowest possible value 
        
               cpu: 0.01 
        
               memory: 64Mi 
        
             limits: 
        
               memory: 1Gi

The text was updated successfully, but these errors were encountered:

yuvipanda · 2023-02-03T03:45:44Z

I think looking at observed usage metrics and setting appropriate requests and limits is a good idea! We don't want them to be too high (especially in shared clusters) as that might increase overall cost, but we already have data for this so I leave it to you to figure out a decent number and get it there! I agree that the current situation has to change

Ref https://2i2c.freshdesk.com/a/tickets/414 We should figure out better defaults in 2i2c-org#2127, but as LEAP is getting close to publication on some stuff, this will help us with stabilizing the infrastructure.

consideRatio · 2023-02-10T18:27:27Z

I'm looking at openscapes hub pod, and I observe that the hub responsiveness can peak.

I wonder if that metric including all requests, and if some requests are slow while others fast. The 50th percentile stays low but the 99th is often in seconds.

It's a long running connection kept open, serving progressbar responses via [EventSource](https://developer.mozilla.org/en-US/docs/Web/API/EventSource). So it can't be treated as a regular HTTP request / response. Getting rid of this unmasks more real problems in hub response latency by removing this noise. Ref 2i2c-org/infrastructure#2127 (comment)

yuvipanda · 2023-02-13T20:02:08Z

@consideRatio good catch, I opened jupyterhub/grafana-dashboards#59 as a 'fix' on grafana

It's a long running connection kept open, serving progressbar responses via [EventSource](https://developer.mozilla.org/en-US/docs/Web/API/EventSource). So it can't be treated as a regular HTTP request / response. Getting rid of this unmasks more real problems in hub response latency by removing this noise. Ref 2i2c-org/infrastructure#2127 (comment)

consideRatio mentioned this issue Feb 1, 2023

[Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason #2126

Closed

5 tasks

yuvipanda mentioned this issue Feb 3, 2023

Bump up LEAP hub component resource requirements #2138

Merged

yuvipanda mentioned this issue Feb 13, 2023

Exclude SpawnProgressAPIHandler from latency metrics jupyterhub/grafana-dashboards#59

Merged

damianavila added this to DEPRECATED Engineering and Product Backlog Feb 14, 2023

damianavila moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Feb 15, 2023

This was referenced Mar 8, 2023

ingress-controller pod that routes all incoming traffic to k8d pods got evicted? #2322

Closed

resource requests/limits updates, mainly to ingress-nginx and prometheus #2324

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure some CPU/memory requests for hub and proxy pods in basehub #2127

Configure some CPU/memory requests for hub and proxy pods in basehub #2127

consideRatio commented Feb 1, 2023 •

edited

Loading

yuvipanda commented Feb 3, 2023

consideRatio commented Feb 10, 2023 •

edited

Loading

yuvipanda commented Feb 13, 2023

Configure some CPU/memory requests for hub and proxy pods in basehub #2127

Configure some CPU/memory requests for hub and proxy pods in basehub #2127

Comments

consideRatio commented Feb 1, 2023 • edited Loading

Config in basehub

Hub pod

Proxy pod

yuvipanda commented Feb 3, 2023

consideRatio commented Feb 10, 2023 • edited Loading

yuvipanda commented Feb 13, 2023

consideRatio commented Feb 1, 2023 •

edited

Loading

consideRatio commented Feb 10, 2023 •

edited

Loading