From f158ee3a70e433a75e112ce50df2ac6adbc9b07a Mon Sep 17 00:00:00 2001 From: Sarah Gibson Date: Fri, 24 Jan 2025 14:15:16 +0000 Subject: [PATCH] Add links for PagerDuty and Alertmanager and explain some variables --- docs/howto/features/storage-quota.md | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/docs/howto/features/storage-quota.md b/docs/howto/features/storage-quota.md index 226521ffb..11bfc7c0b 100644 --- a/docs/howto/features/storage-quota.md +++ b/docs/howto/features/storage-quota.md @@ -126,8 +126,12 @@ Once this is deployed, the hub will automatically enforce the storage quota for ## Enabling alerting through Prometheus Alertmanager Once we have enabled storage quotas, we want to be alerted when the disk usage of the NFS server exceeds a certain threshold so that we can take appropriate action. - To do this, we need to create a Prometheus rule that will alert us when the disk usage of the NFS server exceeds a certain threshold using Alertmanager. +We will then forward Alertmanager's alert to PagerDuty. + +```{note} +Use these resource to learn more about [PagerDuty's Prometheus integration](https://www.pagerduty.com/docs/guides/prometheus-integration-guide/) and [Prometheus' Alertmanager configuration](https://prometheus.io/docs/alerting/latest/configuration/) +``` First, we need to enable Alertmanager in the hub's support values file (for example, [here's the one for the `nasa-veda` cluster](https://github.com/2i2c-org/infrastructure/blob/main/config/clusters/nasa-veda/support.values.yaml)). @@ -158,6 +162,13 @@ prometheus: summary: "jupyterhub-home-nfs EBS volume full in namespace {{ $labels.namespace }}" ``` +```{note} +The important variables to note here are: + +- `expr`: This is what Prometheus will evaluate +- `for`: This is a duration over which Prometheus will collect data to evaluate `expr` +``` + And finally, we need to configure Alertmanager to send alerts to PagerDuty. ```yaml @@ -179,6 +190,17 @@ prometheus: namespace: ``` +```{note} +The important variables to understand here are: + +- `group_wait`: How long Alertmanager will initially wait to send a notification to PagerDuty for a group of alerts +- `group_interval`: How long Alertmanager will wait to send a notification to PagerDuty for new alerts in a group for which an initial notification has already been sent +- `repeat_interval`: How long Alertmanager will wait to send a notification to PagerDuty again if it has already sent a successful notification +- `match`: These labels are used to group fired alerts together and is how we manage separate incidents per hub per cluster in PagerDuty + +[Read more about these configuration options.](https://prometheus.io/docs/alerting/latest/configuration/#route) +``` + ## Increasing the size of the volume used by the NFS server If the volume used by the NFS server is close to being full, we may need to increase the size of the volume. This can be done by following the instructions in the [Increase the size of an AWS EBS volume](howto:increase-size-aws-ebs) guide.