Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alertmanager: Initialize skipped Grafana Alertmanagers receiving requests #10691

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

santihernandezc
Copy link
Contributor

@santihernandezc santihernandezc commented Feb 19, 2025

Description

If the grafana_alertmanager_conditionally_skip_tenant_suffix option is configured and a Grafana Alertmanager tenant doesn't have a promoted, non-default, non-empty configuration, we skip initializing it.

The problem is that clients making requests against an uninitialized Alertmanager get 406s. Unless a Grafana Alertmanager has a "usable" configuration, users won't be able to test templates, test and get receivers, create silences, etc.

This PR makes the multi-tenant Alertmanager start per-tenant Grafana Alertmanagers on incoming requests. This way, requests can be handled even if tenants were initially skipped.

It also adds a grace period for idle Alertmanagers. Whenever a skipped Alertmanager gets a request, we start the Alertmanager and keep a timestamp indicating when this request was received. After the grace period elapses, we shut down the Alertmanager.

Testing

I tested this PR by spinning up two Alertmanagers (read-write mode) with:

  • multitenancy_enabled: true
  • grafana_alertmanager_compatibility_enabled: true
  • grafana_alertmanager_conditionally_skip_tenant_suffix: -grafana

I then created 200 Alertmanager tenants with empty configurations:

  • 100 of them not matching the configured suffix and using an empty configuration
  • 99 of them matching the suffix and using an empty configurations
  • 1 of them matching the suffix and using a promoted, non-default, non-empty config

The 99 tenants with empty configuration and matching the suffix were initially skipped. I then sent test alerts for each tenant matching the suffix. Alertmanagers for each of them were started.

After the default grace period passed, all Alertmanagers for tenants matching the configured suffix were stopped, except the one using a promoted, non-default, non-empty configuration.

Query for tenants skipped per instance

Screenshot 2025-02-20 at 5 20 16 PM

Query for active Alertmanagers by type (Mimir/Grafana)

Screenshot 2025-02-20 at 5 23 46 PM

Copy link
Contributor

github-actions bot commented Feb 20, 2025

@@ -104,7 +104,6 @@ type Config struct {
PersisterConfig PersisterConfig

GrafanaAlertmanagerCompatibility bool
GrafanaAlertmanagerTenantSuffix string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated fix, this was not being used here.

@santihernandezc santihernandezc changed the title (WIP) Alertmanager: Initialize skipped Grafana Alertmanagers receiving requests Alertmanager: Initialize skipped Grafana Alertmanagers receiving requests Feb 20, 2025
@santihernandezc santihernandezc marked this pull request as ready for review February 20, 2025 16:32
@santihernandezc santihernandezc requested review from a team and tacole02 as code owners February 20, 2025 16:33
Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs look great! Thank you!

@@ -727,12 +743,35 @@ func (am *MultitenantAlertmanager) computeConfig(cfgs alertspb.AlertConfigDescs)
AlertConfigDesc: cfgs.Mimir,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concern: This method is doing a lot and appears to have code that doesn't match our current -grafana tenant strategy. For example, it's not clear to me what the following is meant to be doing anymore and why we don't need it in the new startAlertmanager code:

	if cfgs.Mimir.RawConfig == am.fallbackConfig || cfgs.Mimir.RawConfig == "" {
		level.Debug(am.logger).Log("msg", "using grafana config with the default globals", "user", cfgs.Mimir.User)
		cfg, err := am.createUsableGrafanaConfig(cfgs.Grafana, am.fallbackConfig)
		return cfg, true, err
	}

If it is in fact, not needed in startAlertmanager we might want to start cleaning up some of the redundant code that doesn't fit the current strategy. Or at least extract it somewhere where it is clear it's not part of the functional flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That bit of code gets executed after we've checked for usable (promoted, non-default, non-empty) Grafana configuration. If we reach this far, the tenant has a Grafana configuration we can use to start their Alertmanager.

It does indeed not match our current approach. It's part of the original one, where Grafana and non-Grafana tenants were the same, and they couldn't be distinguished by a suffix.

It's not necessary to add computeConfig() to the startAlertmanager() function as it will only be called for skipped Grafana Alertmanagers. If a Grafana Alertmanager was skipped, it had no usable configuration, so we can just use a default config.

@@ -1025,6 +1084,33 @@ func (am *MultitenantAlertmanager) serveRequest(w http.ResponseWriter, req *http
http.Error(w, "the Alertmanager is not configured", http.StatusPreconditionFailed)
}

// startAlertmanager will start the Alertmanager for a tenant, using the fallback configuration if no config is found.
func (am *MultitenantAlertmanager) startAlertmanager(ctx context.Context, userID string) (*Alertmanager, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method can be combined with alertmanagerFromFallbackConfig, they are effectively doing the same thing. Probably keep the startAlertmanager name and make sure the correct errors are returned from the combined method.

Copy link
Contributor Author

@santihernandezc santihernandezc Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to change the behavior of Mimir Alertmanager tenants, at least not yet. I was planning on doing that in a future PR, when applying strict initialization to all tenants.


// If the Alertmanager initialization was skipped, start the Alertmanager.
if _, ok := am.receivingRequests.Load(userID); ok {
userAM, err = am.startAlertmanager(req.Context(), userID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a more interesting metric to collect: When a skipped config was initialized because of a request. This will give us a feel for flappiness and how effectively our idle timeout value is.

@santihernandezc santihernandezc force-pushed the santihernandezc/initialize_skipped_grafana_alertmanagers_receiving_requests branch from 85a54d5 to 79c89f8 Compare February 28, 2025 14:30
@santihernandezc santihernandezc force-pushed the santihernandezc/initialize_skipped_grafana_alertmanagers_receiving_requests branch from bf0cee2 to 5ba975f Compare March 3, 2025 14:49
…nandezc/initialize_skipped_grafana_alertmanagers_receiving_requests
@santihernandezc santihernandezc force-pushed the santihernandezc/initialize_skipped_grafana_alertmanagers_receiving_requests branch from 845c08c to 8e88697 Compare March 5, 2025 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants