-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alertmanager: Initialize skipped Grafana Alertmanagers receiving requests #10691
base: main
Are you sure you want to change the base?
Alertmanager: Initialize skipped Grafana Alertmanagers receiving requests #10691
Conversation
💻 Deploy preview available: https://deploy-preview-mimir-10691-zb444pucvq-vp.a.run.app/docs/mimir/latest/ |
pkg/alertmanager/alertmanager.go
Outdated
@@ -104,7 +104,6 @@ type Config struct { | |||
PersisterConfig PersisterConfig | |||
|
|||
GrafanaAlertmanagerCompatibility bool | |||
GrafanaAlertmanagerTenantSuffix string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated fix, this was not being used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs look great! Thank you!
…mestamp for tenants not yet in the receivingRequests map
…nandezc/initialize_skipped_grafana_alertmanagers_receiving_requests
@@ -727,12 +743,35 @@ func (am *MultitenantAlertmanager) computeConfig(cfgs alertspb.AlertConfigDescs) | |||
AlertConfigDesc: cfgs.Mimir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concern: This method is doing a lot and appears to have code that doesn't match our current -grafana
tenant strategy. For example, it's not clear to me what the following is meant to be doing anymore and why we don't need it in the new startAlertmanager
code:
if cfgs.Mimir.RawConfig == am.fallbackConfig || cfgs.Mimir.RawConfig == "" {
level.Debug(am.logger).Log("msg", "using grafana config with the default globals", "user", cfgs.Mimir.User)
cfg, err := am.createUsableGrafanaConfig(cfgs.Grafana, am.fallbackConfig)
return cfg, true, err
}
If it is in fact, not needed in startAlertmanager
we might want to start cleaning up some of the redundant code that doesn't fit the current strategy. Or at least extract it somewhere where it is clear it's not part of the functional flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That bit of code gets executed after we've checked for usable (promoted, non-default, non-empty) Grafana configuration. If we reach this far, the tenant has a Grafana configuration we can use to start their Alertmanager.
It does indeed not match our current approach. It's part of the original one, where Grafana and non-Grafana tenants were the same, and they couldn't be distinguished by a suffix.
It's not necessary to add computeConfig()
to the startAlertmanager()
function as it will only be called for skipped Grafana Alertmanagers. If a Grafana Alertmanager was skipped, it had no usable configuration, so we can just use a default config.
pkg/alertmanager/multitenant.go
Outdated
@@ -1025,6 +1084,33 @@ func (am *MultitenantAlertmanager) serveRequest(w http.ResponseWriter, req *http | |||
http.Error(w, "the Alertmanager is not configured", http.StatusPreconditionFailed) | |||
} | |||
|
|||
// startAlertmanager will start the Alertmanager for a tenant, using the fallback configuration if no config is found. | |||
func (am *MultitenantAlertmanager) startAlertmanager(ctx context.Context, userID string) (*Alertmanager, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method can be combined with alertmanagerFromFallbackConfig
, they are effectively doing the same thing. Probably keep the startAlertmanager
name and make sure the correct errors are returned from the combined method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to change the behavior of Mimir Alertmanager tenants, at least not yet. I was planning on doing that in a future PR, when applying strict initialization to all tenants.
pkg/alertmanager/multitenant.go
Outdated
|
||
// If the Alertmanager initialization was skipped, start the Alertmanager. | ||
if _, ok := am.receivingRequests.Load(userID); ok { | ||
userAM, err = am.startAlertmanager(req.Context(), userID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a more interesting metric to collect: When a skipped config was initialized because of a request. This will give us a feel for flappiness and how effectively our idle timeout value is.
…nandezc/initialize_skipped_grafana_alertmanagers_receiving_requests
85a54d5
to
79c89f8
Compare
bf0cee2
to
5ba975f
Compare
…nandezc/initialize_skipped_grafana_alertmanagers_receiving_requests
…ways use the default config in startAlertmanager()
845c08c
to
8e88697
Compare
Description
If the
grafana_alertmanager_conditionally_skip_tenant_suffix
option is configured and a Grafana Alertmanager tenant doesn't have a promoted, non-default, non-empty configuration, we skip initializing it.The problem is that clients making requests against an uninitialized Alertmanager get
406
s. Unless a Grafana Alertmanager has a "usable" configuration, users won't be able to test templates, test and get receivers, create silences, etc.This PR makes the multi-tenant Alertmanager start per-tenant Grafana Alertmanagers on incoming requests. This way, requests can be handled even if tenants were initially skipped.
It also adds a grace period for idle Alertmanagers. Whenever a skipped Alertmanager gets a request, we start the Alertmanager and keep a timestamp indicating when this request was received. After the grace period elapses, we shut down the Alertmanager.
Testing
I tested this PR by spinning up two Alertmanagers (read-write mode) with:
multitenancy_enabled: true
grafana_alertmanager_compatibility_enabled: true
grafana_alertmanager_conditionally_skip_tenant_suffix: -grafana
I then created 200 Alertmanager tenants with empty configurations:
The 99 tenants with empty configuration and matching the suffix were initially skipped. I then sent test alerts for each tenant matching the suffix. Alertmanagers for each of them were started.
After the default grace period passed, all Alertmanagers for tenants matching the configured suffix were stopped, except the one using a promoted, non-default, non-empty configuration.
Query for tenants skipped per instance
Query for active Alertmanagers by type (Mimir/Grafana)