Add links for PagerDuty and Alertmanager and explain some variables

2i2c-org · Jan 24, 2025 · f158ee3 · f158ee3
1 parent 3e050a1
commit f158ee3
Showing 1 changed file with 23 additions and 1 deletion.
diff --git a/docs/howto/features/storage-quota.md b/docs/howto/features/storage-quota.md
@@ -126,8 +126,12 @@ Once this is deployed, the hub will automatically enforce the storage quota for
 ## Enabling alerting through Prometheus Alertmanager
 
 Once we have enabled storage quotas, we want to be alerted when the disk usage of the NFS server exceeds a certain threshold so that we can take appropriate action.
-
 To do this, we need to create a Prometheus rule that will alert us when the disk usage of the NFS server exceeds a certain threshold using Alertmanager.
+We will then forward Alertmanager's alert to PagerDuty.
+
+```{note}
+Use these resource to learn more about [PagerDuty's Prometheus integration](https://www.pagerduty.com/docs/guides/prometheus-integration-guide/) and [Prometheus' Alertmanager configuration](https://prometheus.io/docs/alerting/latest/configuration/)
+```
 
 First, we need to enable Alertmanager in the hub's support values file (for example, [here's the one for the `nasa-veda` cluster](https://github.com/2i2c-org/infrastructure/blob/main/config/clusters/nasa-veda/support.values.yaml)).
 
@@ -158,6 +162,13 @@ prometheus:
                 summary: "jupyterhub-home-nfs EBS volume full in namespace {{ $labels.namespace }}"
 ```
 
+```{note}
+The important variables to note here are:
+
+- `expr`: This is what Prometheus will evaluate
+- `for`: This is a duration over which Prometheus will collect data to evaluate `expr`
+```
+
 And finally, we need to configure Alertmanager to send alerts to PagerDuty.
 
 ```yaml
@@ -179,6 +190,17 @@ prometheus:
               namespace: <hub_name>
 ```
 
+```{note}
+The important variables to understand here are:
+
+- `group_wait`: How long Alertmanager will initially wait to send a notification to PagerDuty for a group of alerts
+- `group_interval`: How long Alertmanager will wait to send a notification to PagerDuty for new alerts in a group for which an initial notification has already been sent
+- `repeat_interval`: How long Alertmanager will wait to send a notification to PagerDuty again if it has already sent a successful notification
+- `match`: These labels are used to group fired alerts together and is how we manage separate incidents per hub per cluster in PagerDuty
+
+[Read more about these configuration options.](https://prometheus.io/docs/alerting/latest/configuration/#route)
+```
+
 ## Increasing the size of the volume used by the NFS server
 
 If the volume used by the NFS server is close to being full, we may need to increase the size of the volume. This can be done by following the instructions in the [Increase the size of an AWS EBS volume](howto:increase-size-aws-ebs) guide.