Skip to content

Commit

Permalink
Add links for PagerDuty and Alertmanager and explain some variables
Browse files Browse the repository at this point in the history
  • Loading branch information
sgibson91 committed Jan 24, 2025
1 parent 3e050a1 commit f158ee3
Showing 1 changed file with 23 additions and 1 deletion.
24 changes: 23 additions & 1 deletion docs/howto/features/storage-quota.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,8 +126,12 @@ Once this is deployed, the hub will automatically enforce the storage quota for
## Enabling alerting through Prometheus Alertmanager

Once we have enabled storage quotas, we want to be alerted when the disk usage of the NFS server exceeds a certain threshold so that we can take appropriate action.

To do this, we need to create a Prometheus rule that will alert us when the disk usage of the NFS server exceeds a certain threshold using Alertmanager.
We will then forward Alertmanager's alert to PagerDuty.

```{note}
Use these resource to learn more about [PagerDuty's Prometheus integration](https://www.pagerduty.com/docs/guides/prometheus-integration-guide/) and [Prometheus' Alertmanager configuration](https://prometheus.io/docs/alerting/latest/configuration/)
```

First, we need to enable Alertmanager in the hub's support values file (for example, [here's the one for the `nasa-veda` cluster](https://github.com/2i2c-org/infrastructure/blob/main/config/clusters/nasa-veda/support.values.yaml)).

Expand Down Expand Up @@ -158,6 +162,13 @@ prometheus:
summary: "jupyterhub-home-nfs EBS volume full in namespace {{ $labels.namespace }}"
```

```{note}
The important variables to note here are:
- `expr`: This is what Prometheus will evaluate
- `for`: This is a duration over which Prometheus will collect data to evaluate `expr`
```
And finally, we need to configure Alertmanager to send alerts to PagerDuty.
```yaml
Expand All @@ -179,6 +190,17 @@ prometheus:
namespace: <hub_name>
```
```{note}
The important variables to understand here are:

- `group_wait`: How long Alertmanager will initially wait to send a notification to PagerDuty for a group of alerts
- `group_interval`: How long Alertmanager will wait to send a notification to PagerDuty for new alerts in a group for which an initial notification has already been sent
- `repeat_interval`: How long Alertmanager will wait to send a notification to PagerDuty again if it has already sent a successful notification
- `match`: These labels are used to group fired alerts together and is how we manage separate incidents per hub per cluster in PagerDuty

[Read more about these configuration options.](https://prometheus.io/docs/alerting/latest/configuration/#route)
```

## Increasing the size of the volume used by the NFS server

If the volume used by the NFS server is close to being full, we may need to increase the size of the volume. This can be done by following the instructions in the [Increase the size of an AWS EBS volume](howto:increase-size-aws-ebs) guide.
Expand Down

0 comments on commit f158ee3

Please sign in to comment.