-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Incident] leap
clusters prod
hub - massive node scale up, hub pod restarted for unknown reason
#2126
Comments
Grafana reportsObservationsThe ingress-nginx pod had run reliably, and the proxy pod had run reliably, for a long time. The hub pod however had been restarted. I didn't get to verify it was related to My theory is that the hub pod was running on a node with prometheus-server that hogged almost all of the memory, which made the hub pod get evicted when the node ran low on memory. So why did that happen just then? It appears that ~100+ dask-worker nodes were added as a consequence of someone using Hmmm but it seems that the hub pod stayed consistently at a memory level, but that it became unresponsive. Maybe it was restarted by the livenessProbe failing 30 times in a row - as would happen if the hub was unresponsive for five minutes... I wonder if we with CPU requests of 10milli-cores (0.01 CPU) ended up CPU starved also? I'm not sure. We should probably at least have 100m for these pods as they otherwise could be outcompeted to less than 1 CPU by pods with higher requests. Several pods have 100m requests and is therefore allocated 10x more CPU than the hub/proxy pods. I opened #2127. Core node 1 / 4
Core node 2 / 4
100+ nodesIn a matter of a minute, 100 nodes were added, probably along with 100+ dask worker pods.
|
leap
clusters prod
hub - massive node scalupe, proxy pod evictedleap
clusters prod
hub - massive node scalupe, hub pod restarted for unknown reason
Reference actions already taken. Then we can write the incident report. |
leap
clusters prod
hub - massive node scalupe, hub pod restarted for unknown reasonleap
clusters prod
hub - massive node scale up, hub pod restarted for unknown reason
leap
clusters prod
hub - massive node scale up, hub pod restarted for unknown reasonleap
clusters prod
hub - massive node scale up, hub pod restarted for unknown reason
leap
clusters prod
hub - massive node scale up, hub pod restarted for unknown reasonleap
clusters prod
hub - massive node scale up, hub pod restarted for unknown reason
Triage: Is there any update for the community member who reported support ticket or should that be closed https://2i2c.freshdesk.com/a/tickets/414 |
I think the ticket should be closed (cc @consideRatio who was involved in the incident). |
2023-02-01 Heavy use of dask-gateway induced critical pod evictionsTimelineAll times in UTC+1
What went wrong
Follow-up improvements
|
leap
clusters prod
hub - massive node scale up, hub pod restarted for unknown reasonleap
clusters prod
hub - massive node scale up, hub pod restarted for unknown reason
Inicident report PR in 2i2c-org/incident-reports#4 |
Summary
Impact on users
Important information
Tasks and updates
After-action report template
The text was updated successfully, but these errors were encountered: