Alertmanager: Initialize skipped Grafana Alertmanagers receiving requests #10691

santihernandezc · 2025-02-19T17:50:14Z

Description

If the grafana_alertmanager_conditionally_skip_tenant_suffix option is configured and a Grafana Alertmanager tenant doesn't have a promoted, non-default, non-empty configuration, we skip initializing it.

The problem is that clients making requests against an uninitialized Alertmanager get 406s. Unless a Grafana Alertmanager has a "usable" configuration, users won't be able to test templates, test and get receivers, create silences, etc.

This PR makes the multi-tenant Alertmanager start per-tenant Grafana Alertmanagers on incoming requests. This way, requests can be handled even if tenants were initially skipped.

It also adds a grace period for idle Alertmanagers. Whenever a skipped Alertmanager gets a request, we start the Alertmanager and keep a timestamp indicating when this request was received. After the grace period elapses, we shut down the Alertmanager.

Testing

I tested this PR by spinning up two Alertmanagers (read-write mode) with:

multitenancy_enabled: true
grafana_alertmanager_compatibility_enabled: true
grafana_alertmanager_conditionally_skip_tenant_suffix: -grafana

I then created 200 Alertmanager tenants with empty configurations:

100 of them not matching the configured suffix and using an empty configuration
99 of them matching the suffix and using an empty configurations
1 of them matching the suffix and using a promoted, non-default, non-empty config

The 99 tenants with empty configuration and matching the suffix were initially skipped. I then sent test alerts for each tenant matching the suffix. Alertmanagers for each of them were started.

After the default grace period passed, all Alertmanagers for tenants matching the configured suffix were stopped, except the one using a promoted, non-default, non-empty configuration.

Query for tenants skipped per instance

Query for active Alertmanagers by type (Mimir/Grafana)

…g requests

…sync

github-actions · 2025-02-20T16:28:03Z

💻 Deploy preview available: https://deploy-preview-mimir-10691-zb444pucvq-vp.a.run.app/docs/mimir/latest/

santihernandezc · 2025-02-20T16:28:06Z

pkg/alertmanager/alertmanager.go

@@ -104,7 +104,6 @@ type Config struct {
 	PersisterConfig   PersisterConfig

 	GrafanaAlertmanagerCompatibility bool
-	GrafanaAlertmanagerTenantSuffix  string


Unrelated fix, this was not being used here.

tacole02

Docs look great! Thank you!

…mestamp for tenants not yet in the receivingRequests map

…nandezc/initialize_skipped_grafana_alertmanagers_receiving_requests

pkg/alertmanager/multitenant.go

JacobsonMT · 2025-02-27T19:23:15Z

pkg/alertmanager/multitenant.go

@@ -727,12 +743,35 @@ func (am *MultitenantAlertmanager) computeConfig(cfgs alertspb.AlertConfigDescs)
 		AlertConfigDesc: cfgs.Mimir,


Concern: This method is doing a lot and appears to have code that doesn't match our current -grafana tenant strategy. For example, it's not clear to me what the following is meant to be doing anymore and why we don't need it in the new startAlertmanager code:

if cfgs.Mimir.RawConfig == am.fallbackConfig || cfgs.Mimir.RawConfig == "" { level.Debug(am.logger).Log("msg", "using grafana config with the default globals", "user", cfgs.Mimir.User) cfg, err := am.createUsableGrafanaConfig(cfgs.Grafana, am.fallbackConfig) return cfg, true, err }

If it is in fact, not needed in startAlertmanager we might want to start cleaning up some of the redundant code that doesn't fit the current strategy. Or at least extract it somewhere where it is clear it's not part of the functional flow.

That bit of code gets executed after we've checked for usable (promoted, non-default, non-empty) Grafana configuration. If we reach this far, the tenant has a Grafana configuration we can use to start their Alertmanager.

It does indeed not match our current approach. It's part of the original one, where Grafana and non-Grafana tenants were the same, and they couldn't be distinguished by a suffix.

It's not necessary to add computeConfig() to the startAlertmanager() function as it will only be called for skipped Grafana Alertmanagers. If a Grafana Alertmanager was skipped, it had no usable configuration, so we can just use a default config.

JacobsonMT · 2025-02-27T19:24:09Z

pkg/alertmanager/multitenant.go

@@ -1025,6 +1084,33 @@ func (am *MultitenantAlertmanager) serveRequest(w http.ResponseWriter, req *http
 	http.Error(w, "the Alertmanager is not configured", http.StatusPreconditionFailed)
 }

+// startAlertmanager will start the Alertmanager for a tenant, using the fallback configuration if no config is found.
+func (am *MultitenantAlertmanager) startAlertmanager(ctx context.Context, userID string) (*Alertmanager, error) {


I think this method can be combined with alertmanagerFromFallbackConfig, they are effectively doing the same thing. Probably keep the startAlertmanager name and make sure the correct errors are returned from the combined method.

I didn't want to change the behavior of Mimir Alertmanager tenants, at least not yet. I was planning on doing that in a future PR, when applying strict initialization to all tenants.

JacobsonMT · 2025-02-27T19:29:01Z

pkg/alertmanager/multitenant.go

+
+	// If the Alertmanager initialization was skipped, start the Alertmanager.
+	if _, ok := am.receivingRequests.Load(userID); ok {
+		userAM, err = am.startAlertmanager(req.Context(), userID)


This seems like a more interesting metric to collect: When a skipped config was initialized because of a request. This will give us a feel for flappiness and how effectively our idle timeout value is.

…nandezc/initialize_skipped_grafana_alertmanagers_receiving_requests

…ways use the default config in startAlertmanager()

santihernandezc added 5 commits February 19, 2025 18:47

(WIP) Alertmanager: Initialize skipped Grafana Alertmanagers receivin…

e63e411

…g requests

remove unnecessary lines, refactor,

8bbf5de

use sync.Map instead of map + mutex

eefbcbb

add gauge for number of Alertmanagers skipped during the last config …

7883412

…sync

make doc, make reference-help

40572a2

santihernandezc commented Feb 20, 2025

View reviewed changes

santihernandezc changed the title ~~(WIP) Alertmanager: Initialize skipped Grafana Alertmanagers receiving requests~~ Alertmanager: Initialize skipped Grafana Alertmanagers receiving requests Feb 20, 2025

santihernandezc marked this pull request as ready for review February 20, 2025 16:32

santihernandezc requested review from a team and tacole02 as code owners February 20, 2025 16:33

tacole02 approved these changes Feb 21, 2025

View reviewed changes

santihernandezc added 4 commits February 21, 2025 14:15

reduce the amount of store operations by only storing a zero-value ti…

d7bd126

…mestamp for tenants not yet in the receivingRequests map

remove unnecessary zeroTimeUnix var

395cf74

wording in logs

7699941

Merge branch 'main' of https://github.com/grafana/mimir into santiher…

e23a766

…nandezc/initialize_skipped_grafana_alertmanagers_receiving_requests

JacobsonMT reviewed Feb 27, 2025

View reviewed changes

santihernandezc added 5 commits February 28, 2025 11:35

use LoadOrStore()

2201e08

receivingRequests -> lastRequestTime

83d8c88

fix receiving alerts -> receiving requests

67cafb6

Merge branch 'main' of https://github.com/grafana/mimir into santiher…

51524d0

…nandezc/initialize_skipped_grafana_alertmanagers_receiving_requests

improve redability in computeConfig()

79c89f8

santihernandezc force-pushed the santihernandezc/initialize_skipped_grafana_alertmanagers_receiving_requests branch from 85a54d5 to 79c89f8 Compare February 28, 2025 14:30

santihernandezc requested review from JacobsonMT February 28, 2025 14:43

santihernandezc added 3 commits March 3, 2025 12:25

Add counter for on-request initializations

373336b

fix custom mimir config being ignored in grafana tenants, tests

45246b6

fix order of expects in tests

39da214

make test diff smaller

5ba975f

santihernandezc force-pushed the santihernandezc/initialize_skipped_grafana_alertmanagers_receiving_requests branch from bf0cee2 to 5ba975f Compare March 3, 2025 14:49

Merge branch 'main' of https://github.com/grafana/mimir into santiher…

01303a9

…nandezc/initialize_skipped_grafana_alertmanagers_receiving_requests

santihernandezc mentioned this pull request Mar 3, 2025

Alertmanager: Strict initialization #10785

Draft

santihernandezc added 2 commits March 5, 2025 11:15

handle errNotUploadingFallback errors

c688474

delete tenant from skipped list if it's not owned by the instance, al…

8e88697

…ways use the default config in startAlertmanager()

santihernandezc force-pushed the santihernandezc/initialize_skipped_grafana_alertmanagers_receiving_requests branch from 845c08c to 8e88697 Compare March 5, 2025 13:28

prevent race conditions when starting Alertmanagers, refactor

c3eb098

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alertmanager: Initialize skipped Grafana Alertmanagers receiving requests #10691

Alertmanager: Initialize skipped Grafana Alertmanagers receiving requests #10691

santihernandezc commented Feb 19, 2025 •

edited

Loading

github-actions bot commented Feb 20, 2025 •

edited

Loading

santihernandezc Feb 20, 2025

tacole02 left a comment

JacobsonMT Feb 27, 2025

santihernandezc Feb 28, 2025

JacobsonMT Feb 27, 2025

santihernandezc Feb 28, 2025 •

edited

Loading

JacobsonMT Feb 27, 2025

		@@ -727,12 +743,35 @@ func (am *MultitenantAlertmanager) computeConfig(cfgs alertspb.AlertConfigDescs)
		AlertConfigDesc: cfgs.Mimir,

Alertmanager: Initialize skipped Grafana Alertmanagers receiving requests #10691

Are you sure you want to change the base?

Alertmanager: Initialize skipped Grafana Alertmanagers receiving requests #10691

Conversation

santihernandezc commented Feb 19, 2025 • edited Loading

Description

Testing

github-actions bot commented Feb 20, 2025 • edited Loading

santihernandezc Feb 20, 2025

Choose a reason for hiding this comment

tacole02 left a comment

Choose a reason for hiding this comment

JacobsonMT Feb 27, 2025

Choose a reason for hiding this comment

santihernandezc Feb 28, 2025

Choose a reason for hiding this comment

JacobsonMT Feb 27, 2025

Choose a reason for hiding this comment

santihernandezc Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

JacobsonMT Feb 27, 2025

Choose a reason for hiding this comment

santihernandezc commented Feb 19, 2025 •

edited

Loading

github-actions bot commented Feb 20, 2025 •

edited

Loading

santihernandezc Feb 28, 2025 •

edited

Loading