Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IF event's attribute (environment) value filter not working in Issue Alert #83685

Open
realkosty opened this issue Jan 17, 2025 · 20 comments
Open

Comments

@realkosty
Copy link
Contributor

realkosty commented Jan 17, 2025

Environment

SaaS (https://sentry.io/)

Steps to Reproduce

  1. 2 alerts (see customer case) are identical except 1 uses event's {attribute="environment"} value {match="equals"} {value="beta"} filter, and the other one has the same environment selected at the top instead.
  2. An issue (see customer case) where ALL events match all other conditions in both alerts and are 100% from the same matching environment.

Image

Expected Result

both alerts fire

Actual Result

only the alert with top-level environment filter fires

Product Area

Alerts

Link

No response

DSN

No response

Version

No response

@getsantry
Copy link
Contributor

getsantry bot commented Jan 17, 2025

Assigning to @getsentry/support for routing ⏲️

@getsantry
Copy link
Contributor

getsantry bot commented Jan 21, 2025

Routing to @getsentry/product-owners-alerts for triage ⏲️

@getsantry getsantry bot moved this from Waiting for: Support to Waiting for: Product Owner in GitHub Issues with 👀 3 Jan 21, 2025
@rachrwang
Copy link

@ceorourke - can you help take a look at this one? Thanks!

@ceorourke
Copy link
Member

I have spent a fair amount of time looking into this without being able to reproduce it.

First I tried to reproduce it with the simplest parts - I made a rule using the environment picker and a rule using the filters for the environment. Both rules fired.

Then I wrote a test for the same so I could trace what may have been happening, but they both fired as well. We then added in the additional filters the customer's rule had and still didn't encounter any problems.

Next we dug through the rule processing pipeline and how we evaluate the environment in the two different ways shown in the rule but can't find any problem. We dug through the logs and tried to figure out where it may have encountered a problem but could not find anything.

Our best guess for now is that because the rules have the "Number of events in an issue is more than 100 in 1w" condition that means it goes through our delayed processing pipeline (a buffer that's flushed every minute) and perhaps the rules were put into different buckets so that when we got the number of events in the last hour it was on two slightly different timestamps and it failed on that. We can't find any problem with the environment options.

I did notice that the rules were created one minute after the other as if the creator already had some problem with a separate rule, is that the case? It seems unlikely that someone would choose to test the two different environment options. Maybe we can look deeper into the original problem rule, if it exists.

@JParisFerrer
Copy link

Really appreciate the detailed notes @ceorourke!

I did notice that the rules were created one minute after the other as if the creator already had some problem with a separate rule, is that the case? It seems unlikely that someone would choose to test the two different environment options. Maybe we can look deeper into the original problem rule, if it exists

Yes, this is the case. @realkosty may have linked it in the customer case you have, it will have been invalidated since then and I could provide him the latest link. We observed a false negative in the slightly more complicated alert rule (it mainly includes more actions, and the tag filter uses in instead of eq), which triggered Kosty to suggest this simplified experiment which also had false negative.

Our best guess for now is that because the rules have the "Number of events in an issue is more than 100 in 1w" condition that means it goes through our delayed processing pipeline (a buffer that's flushed every minute) and perhaps the rules were put into different buckets so that when we got the number of events in the last hour it was on two slightly different timestamps and it failed on that

This is an interesting theory, but I'm not sure if it applies. Based on your description here, it sounds like after 1 minute, this bucket mismatch should no longer be a concern? But it appears that these issues that are triggering one alert and not the other have >120 issues, and receive them spread over hours, not seconds. Does that disprove that, or did I misunderstand the delayed pipeline?

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Jan 29, 2025
@ceorourke
Copy link
Member

For the delayed pipeline it'd be like if the time window was 5 minutes the buckets would be from say 1:05 - 1:10 and then 1:06 - 1:11, so the total number of events in each bucket may be different.

I don't see an alert link for the original rule but I'll ask about it.

@JParisFerrer
Copy link

For the delayed pipeline it'd be like if the time window was 5 minutes the buckets would be from say 1:05 - 1:10 and then 1:06 - 1:11, so the total number of events in each bucket may be different.

Got it -- the alert window here is 7days, threshold at 100, and these had volumes of 120, 140, etc. It sounds like this theory doesn't hold up then?

I'm going to adjust the the two debugging alerts to remove this filter and see what we can learn.

@mitsuyuki418
Copy link

I am facing the same issue.

The event attribute filter doesn't work for environment.
But instead, I found tag filter works for environment.

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Feb 5, 2025
@mifu67
Copy link
Contributor

mifu67 commented Feb 5, 2025

Hi @mitsuyuki418,

We are investigating the issue. Do you mind sharing the other conditions on your rule?

In the meantime, please continue to use the tag filter for environment as a workaround.

@mitsuyuki418
Copy link

@mifu67 I'm sorry, but I found it's only about the preview in my case.

When I use the environment tag, it shows the preview items, but with the environment attribute, it show nothing like this:

Image

But an alert was triggered for both tags and attribute. So there's no problem in my case for the alert triggers. ( Hopefully, it would be nice to show the correct preview for the env attributes :) )

Thanks,

@mifu67
Copy link
Contributor

mifu67 commented Feb 6, 2025

@mitsuyuki418,

Thank you for the additional information!

@realkosty
Copy link
Contributor Author

@ceorourke @mifu67 we made progress investigating it with customer:

  1. consistently reproducible in their project
  2. can be repro'd without oldest adopted release associated... filter
  3. can be repro'd with just 2 events (if count filter set to >1 event in 1w)
  4. affects both minidumps (original report) and handled errors

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Feb 18, 2025
@ceorourke
Copy link
Member

Are you able to reproduce it outside of their project, or figure out what it is about their events that's different and can answer why we can't reproduce it anywhere else?

@realkosty
Copy link
Contributor Author

@ceorourke yes! we're working hard on a proper repro by capturing event envelopes and hoping to replay them in a test org so that hopefully you can then run it locally in a debugger.

Question: how does our alerting behave wrt super delayed events? I.e. should we fudge timestamps to make them fresh when creating a repro?

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Feb 24, 2025
@ceorourke
Copy link
Member

Alerts are evaluated just after event ingestion as a post processing step

@realkosty
Copy link
Contributor Author

@ceorourke do you know if alert evaluation step discards late events or treats them as if they just occurred (only looking and received timestamp)?

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Feb 24, 2025
@ceorourke
Copy link
Member

When we make the query to determine if the event frequency filter passes we use the current time and the duration set on the alert to determine the window of time we're looking at
https://github.com/getsentry/sentry/blob/master/src/sentry/rules/conditions/event_frequency.py#L283-L284

@realkosty
Copy link
Contributor Author

@ceorourke gotcha, thanks! and the triggering event itself - doesn't matter if it's a time capsule from a month ago, we will still alert?

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Feb 25, 2025
@ceorourke
Copy link
Member

That might be a question better directed towards SNS or whichever team manages event ingestion - I don't know if delayed events retain the original timestamp somewhere. If they did I could imagine that might change the results of the snuba query - as far as alerting is concerned, we run through the logic to determine if it should fire just after ingestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

7 participants