-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add incident report: 2023-02-01 leap hub, dask-gateway induced outage #4
Add incident report: 2023-02-01 leap hub, dask-gateway induced outage #4
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some comments. In general an incident report should contain more than just links back to the issue.
@@ -0,0 +1,20 @@ | |||
# 2023-02-01 Heavy use of dask-gateway induced critical pod evictions | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add summary statement and impact.
|
||
All times in UTC+1 | ||
|
||
- 2023-02-01 - [Summary of issue updated between ~8-9 PM](https://github.com/2i2c-org/infrastructure/issues/2126#issuecomment-1412554908) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I'd expect to see a full timeline here not a link - eg https://sre.google/sre-book/example-postmortem/
However looking at the other incident reports in this repo it doesn't seem like we have a clear shared expectation of what should be here.
|
||
## What went wrong | ||
|
||
- I believe various critical pods on core nodes pods got evicted when prometheus started scraping from ~200 nodes metrics exporters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do better than belief? Is there not evidence to support your hypothesis here? That's what I'd expect the incident report/postmortem to reveal through analysis of the contributing factors/root causes/triggers.
|
||
## Follow-up improvements | ||
|
||
- #2127 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make these full markdown links with a summary line
## What went wrong | ||
|
||
- I believe various critical pods on core nodes pods got evicted when prometheus started scraping from ~200 nodes metrics exporters | ||
- I think its likely, but I can't say for sure, that the dask scheduler pod also would run into resource limitations with this amount of workers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what went wrong? I'm not sure what this speculation is for in terms of analyzing the incident. Is this a lesson learned?
@pnasrat thank you for the review! These are all relevant points, I'll leave them unanswered for now since I'm feeling a time pressure to focus on other things. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not going to be receive any updates after 8 months, so let's get it merged as is.
Marked as draft pending time available to refine this further based on feedback!
Related
leap
clustersprod
hub - massive node scale up, hub pod restarted for unknown reason infrastructure#2126