Skip to content

Commit

Permalink
Add monitoring documentation (#326)
Browse files Browse the repository at this point in the history
* Add monitoring documentation

* Address PR review feedback

* Address PR review feedback (2)

---------

Co-authored-by: Ismail Alidzhikov <[email protected]>
  • Loading branch information
dimitar-kostadinov and ialidzhikov authored Jan 28, 2025
1 parent bcdabe0 commit 39693a4
Show file tree
Hide file tree
Showing 3 changed files with 82 additions and 1 deletion.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Gardener extension controller which deploys pull-through caches for container re

- [Configuring the Registry Cache Extension](docs/usage/registry-cache/configuration.md) - learn what is the use-case for a pull-through cache, how to enable it and configure it
- [How to provide credentials for upstream repository?](docs/usage/registry-cache/upstream-credentials.md)
- [Registry Cache Observability](docs/usage/registry-cache/observability.md) - learn what metrics and alerts are exposed and how to view the registry cache logs
- [Configuring the Registry Mirror Extension](docs/usage/registry-mirror/configuration.md) - learn what is the use-case for a registry mirror, how to enable and configure it

## Local Setup and Development
Expand Down
2 changes: 1 addition & 1 deletion docs/usage/registry-cache/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ The registry-cache extension deploys a StatefulSet with a volume claim template.

The `providerConfig.caches[].volume.size` field is the size of the registry cache volume. Defaults to `10Gi`. The size must be a positive quantity (greater than 0).
This field is immutable. See [Increase the cache disk size](#increase-the-cache-disk-size) on how to resize the disk.
The extension defines [alerts](https://github.com/gardener/gardener-extension-registry-cache/blob/v0.10.0/pkg/component/registrycaches/monitoring.go#L40-L105) for the volume. See [Alerting for Users](https://github.com/gardener/gardener/blob/master/docs/monitoring/alerting.md#alerting-for-users) on how to enable notifications for Shoot cluster alerts.
The extension defines alerts for the volume. More information about the registry cache alerts and how to enable notifications for them can be found in the [alerts documentation](observability.md#alerts).

The `providerConfig.caches[].volume.storageClassName` field is the name of the StorageClass used by the registry cache volume.
This field is immutable. If the field is not specified, then the [default StorageClass](https://kubernetes.io/docs/concepts/storage/storage-classes/#default-storageclass) will be used.
Expand Down
80 changes: 80 additions & 0 deletions docs/usage/registry-cache/observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Registry Cache Observability

The `registry-cache` extension exposes metrics for the registry caches running in the Shoot cluster so that they can be easily viewed by cluster owners and operators in the Shoot's Prometheus and Plutono instances. The exposed monitoring data provides an overview of the performance of the pull-through caches, including hit rate and network traffic data.

## Metrics

A registry cache serves [several metrics](https://github.com/distribution/distribution/blob/v3.0.0-rc.2/registry/proxy/proxymetrics.go#L12-L21). The metrics are scraped by the [Shoot's Prometheus instance](https://github.com/gardener/gardener/blob/master/docs/monitoring/README.md#shoot-prometheus).

The `Registry Caches` dashboard in the Shoot's Plutono instance contains several panels which are built using the registry cache metrics. From the `Registry` dropdown menu you can select the upstream for which you wish the metrics to be displayed (by default, metrics are summed for all upstream registries).

Following is a list of all exposed registry cache metrics. The `upstream_host` label can be used to determine the upstream host to which the metrics are related, while the `type` label can be used to determine weather the metric is for an image `blob` or an image `manifest`:

#### registry_proxy_requests_total

The number of total incoming request received.
- Type: Counter
- Labels: `upstream_host` `type`

#### registry_proxy_hits_total

The number of total cache hits; i.e. the requested content exists in the registry cache's image store and it is served from there (upstream is not contacted at all for serving the requested content).
- Type: Counter
- Labels: `upstream_host` `type`

#### registry_proxy_misses_total

The number of total cache misses; i.e. the requested content does not exist in the registry cache's image store and it is fetched from the upstream.
- Type: Counter
- Labels: `upstream_host` `type`

#### registry_proxy_pulled_bytes_total

The size of total bytes that the registry cache has pulled from the upstream.
- Type: Counter
- Labels: `upstream_host` `type`

#### registry_proxy_pushed_bytes_total

The size of total bytes pushed to the registry cache's clients.
- Type: Counter
- Labels: `upstream_host` `type`

## Alerts

There are two alerts defined for the registry cache `PersistentVolume` in the Shoot's Prometheus instance:

#### RegistryCachePersistentVolumeUsageCritical

This indicates that the registry cache `PersistentVolume` is almost full and less than 5% is free. When there is no available disk space, no new images will be cached. However, image pull operations are not affected. An alert is fired when the following expression evaluates to true:

```
100 * (
kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"^cache-volume-registry-.+$"}
/
kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"^cache-volume-registry-.+$"}
) < 5
```

#### RegistryCachePersistentVolumeFullInFourDays

This indicates that the registry cache `PersistentVolume` is expected to fill up within four days based on recent sampling. An alert is fired when the following expression evaluates to true:

```
100 * (
kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"^cache-volume-registry-.+$"}
/
kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"^cache-volume-registry-.+$"}
) < 15
and
predict_linear(kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"^cache-volume-registry-.+$"}[30m], 4 * 24 * 3600) <= 0
```

Users can subscribe to these alerts by following the Gardener [alerting guide](https://github.com/gardener/gardener/blob/master/docs/monitoring/alerting.md#alerting-for-users).

## Logging

To view the registry cache logs in Plutono, navigate to the `Explore` tab and select `vali` from the `Explore` dropdown menu. Afterwards enter the following `vali` query:

- `{container_name="registry-cache"}` to view the logs for all registries.
- `{pod_name=~"registry-<upstream_host>.+"}` to view the logs for specific upstream registry.

0 comments on commit 39693a4

Please sign in to comment.