Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different aggregation groups can share the same nflog #3808

Open
grobinson-grafana opened this issue Apr 15, 2024 · 1 comment
Open

Different aggregation groups can share the same nflog #3808

grobinson-grafana opened this issue Apr 15, 2024 · 1 comment

Comments

@grobinson-grafana
Copy link
Contributor

What did you do?

In certain cases, it is possible for two (or more) different aggregation groups to share an nflog. The conditions for this to happen are the following:

  1. There must be at least two routes with the same matchers and group labels (group_by).
  2. At least one of the routes must have continue: true.
  3. Both routes must use the same receivers.

For example, take the following configuration:

receivers:
  - name: test1
    webhook_configs:
      - url: http://127.0.0.1:8080/test1
  - name: test2
    webhook_configs:
      - url: http://127.0.0.1:8080/test2
route:
  receiver: test1
  group_wait: 15s
  group_interval: 30s
  routes:
    - continue: true
      receiver: test1
      matchers:
      - foo=bar
    - continue: true
      receiver: test1
      matchers:
      - foo=bar

This configuration meets all of the requirements to share an nflog.

When an Alertmanager is run with this configuration, you can see from the flushing alerts and Notify success lines that two aggregation groups are created and then flushed:

ts=2024-04-15T08:33:35.749Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert=[3fff2c2][active]
ts=2024-04-15T08:33:50.750Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}/{foo=\"bar\"}:{}" msg=flushing alerts=[[3fff2c2][active]]
ts=2024-04-15T08:33:50.749Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}/{foo=\"bar\"}:{}" msg=flushing alerts=[[3fff2c2][active]]
ts=2024-04-15T08:33:50.753Z caller=notify.go:863 level=debug component=dispatcher receiver=test1 integration=webhook[0] aggrGroup="{}/{foo=\"bar\"}:{}" alerts=[[3fff2c2][active]] msg="Notify success" attempts=1 duration=2.743458ms
ts=2024-04-15T08:33:50.753Z caller=notify.go:863 level=debug component=dispatcher receiver=test1 integration=webhook[0] aggrGroup="{}/{foo=\"bar\"}:{}" alerts=[[3fff2c2][active]] msg="Notify success" attempts=1 duration=2.85275ms

However, reading the nflog file (after shutdown) shows just one nflog entry on disk:

N
>
{}/{foo="bar"}:{}
test1webhook*
            ����˗�2	����ދ��
                               ���˗�%

However, when each route uses a different receiver then there are two entries on disk:

N
>
{}/{foo="bar"}:{}
test1webhook*
            ������2	����ދ��
                               �����N
>
{}/{foo="bar"}:{}
test2webhook*
            ���в�2	����ދ��
                               ��в�%

I expect this to create even more issues if the routes have different timers (i.e. group_wait, group_interval and repeat_interval) or active and mute time intervals:

route:  
  receiver: test1  
  group_wait: 15s  
  group_interval: 30s  
  routes:  
    - continue: true  
      receiver: test1  
      matchers:  
      - foo=bar  
    - continue: true  
      receiver: test1  
      matchers:  
      - foo=bar  
      group_wait: 30s  
      group_interval: 5m  
      mute_time_intervals:  
        - evenings

What did you expect to see?

I expected to see a separate entry for each aggregation group. Here is the nflog from the first example, but instead running the code in this branch:

P
@
{}/{foo="bar"}/1:{}
test1webhook*
            ���ػ��2	����ދ��
                               Ԯ�ػ��P
@
{}/{foo="bar"}/0:{}
test1webhook*
            ������2	����ދ��
                               Ԯ����%
grobinson-grafana added a commit to grobinson-grafana/alertmanager that referenced this issue Apr 15, 2024
This commit replaces the code in route.Key() with that of route.ID(),
and removes route.ID(). The motivation behind this change is to fix
a number of bugs caused by conflicting group keys such as
"Different aggregation groups can share the same nflog" (prometheus#3808)
and also prevent an issue where groups are incorrectly marked as muted
when they are not.

Signed-off-by: George Robinson <[email protected]>
@grobinson-grafana
Copy link
Contributor Author

This is related to #3817.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant