Skip to content

Commit 2115af1

Browse files
authored
Blog post about our incident report (#483)
1 parent 88ad971 commit 2115af1

File tree

6 files changed

+73
-2
lines changed

6 files changed

+73
-2
lines changed

content/authors/admin/avatar.jpg

-12.6 KB
Binary file not shown.

content/authors/admin/avatar.png

11.8 KB
Loading
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
---
2+
title: "Incident report: UC Merced user throttling during class startup"
3+
date: "2025-08-29"
4+
category: incident-report
5+
tags:
6+
- cloud
7+
- open-source
8+
---
9+
10+
On August 29, 2025 our cloud infrastructure team experienced an incident with the UC Merced community hub when students tried to login simultaneously at the start of class. For more detailed technical information about this incident, see our [full incident report](https://github.com/2i2c-org/incident-reports/blob/main/reports/2025-08-29-ucmerced-too-many-users-throttled.pdf).
11+
12+
## What happened
13+
14+
- Students experienced issues when trying to login to the hub at the same time during the start of class.
15+
- The concurrent spawn limit was reached quickly due to the large number of users starting up simultaneously.
16+
- New nodes had to be brought up by the autoscaler, which took roughly 10 minutes from start to end.
17+
- Users who tried again after 1 minute weren't guaranteed to get their servers started immediately since new nodes were still spinning up.
18+
- This was an "expected" scale-up event but the lack of clear messaging caused users to interpret it as instability.
19+
20+
## What we learned
21+
22+
- We need better communication so users understand when infrastructure slowness is "expected" vs. "unstable".
23+
- We need better alerting for concurrent user startup throttling - we found out about this issue from users rather than automated monitoring.
24+
- We learned that JupyterHub's metrics don't properly expose `429 status` codes in our dashboards.
25+
- This will happen again if we don't have proper scaling limits and node provisioning strategies for sudden user influxes.
26+
27+
## Resolution
28+
29+
We implemented several fixes:
30+
- Increased the concurrent spawn limit from 64 to 100.
31+
- Put UC Merced users on larger nodes to reduce the number of node spinups needed. this will cost more in cloud but result in fewer scale-up events.
32+
- Created action items to improve logging, alerting, and monitoring for similar incidents
33+
34+
## Acknowledgements
35+
36+
- Thanks to UC Merced students and instructors for reporting the issue through our support system.

content/blog/README.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,15 @@ Follow these steps:
2525

2626
Here's an example from a recent talk that Yuvi gave:
2727

28-
- Here's the GitHub issue: https://github.com/2i2c-org/2i2c-org.github.io/issues/470
29-
- Here's the blog post that followed: ../../content/blog/2025/doepy-yuvi/index.md
28+
- Example GitHub issue: https://github.com/2i2c-org/2i2c-org.github.io/issues/470
29+
- The blog post that followed: ../../content/blog/2025/doepy-yuvi/index.md
3030

31+
### Example incident report
32+
33+
Incident reports are a way to provide transparency about what happened and what we learned as we run infrastructure. We create incident reports at https://github.com/2i2c-org/incident-reports so blog post GitHub issues don't need to have any content other than a link to the report.
34+
35+
- Example incident report issue: https://github.com/2i2c-org/2i2c-org.github.io/issues/482
36+
- The blog post that followed: ../../content/blog/2025/incident-ucmerced-throttling
3137
## How to write our Hugo directives
3238

3339
Here are a few example Hugo directives for quick reference.
File renamed without changes.
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
# THIS IS A TEMPLATE FOR COPY-PASTING TO MAKE IT EASIER TO CREATE BLOG POSTS
3+
# folder name: incident-short-title/
4+
title: "Incident report: Brief description of the incident"
5+
date: "2999-01-01"
6+
category: incident-report
7+
tags:
8+
- cloud
9+
---
10+
11+
On MMMM DD, YYYY our cloud infrastructure team experienced an incident with the XXXXX community hub. [See this issue for the full report](LINK TO ISSUE IN 2i2c-org/incident-reports).
12+
13+
## What happened
14+
15+
- ___ thing happened.
16+
- This resulted in ___.
17+
- It happened because ___.
18+
- It was resolved by ___.
19+
20+
## What we learned
21+
22+
- We need to ___.
23+
- We learned that ___.
24+
- This will happen again if ___.
25+
26+
## Acknowledgements
27+
28+
- Thanks to ___ for helping us identify the problem.
29+
- Thanks to ___ for helping us implement a solution.

0 commit comments

Comments
 (0)