Skip to content

Commit

Permalink
Add incident report: 2023-02-01 leap hub, dask-gateway induced outage
Browse files Browse the repository at this point in the history
  • Loading branch information
consideRatio committed Mar 16, 2023
1 parent cc52c15 commit b77f88b
Showing 1 changed file with 20 additions and 0 deletions.
20 changes: 20 additions & 0 deletions reports/2023-02-01-daskhub-induced-evictions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# 2023-02-01 Heavy use of dask-gateway induced critical pod evictions

## Timeline

All times in UTC+1

- 2023-02-01 - [Summary of issue updated between ~8-9 PM](https://github.com/2i2c-org/infrastructure/issues/2126#issuecomment-1412554908)

## What went wrong

- I believe various critical pods on core nodes pods got evicted when prometheus started scraping from ~200 nodes metrics exporters
- I think its likely, but I can't say for sure, that the dask scheduler pod also would run into resource limitations with this amount of workers

## Follow-up improvements

- #2127
- #2213
- #2209
- #2324
- #2366

0 comments on commit b77f88b

Please sign in to comment.