New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#3513: Mark muted alerts #3793
#3513: Mark muted alerts #3793
Conversation
6baeccc
to
ff97f73
Compare
dispatch/dispatch.go
Outdated
@@ -182,6 +184,7 @@ func (d *Dispatcher) run(it provider.AlertIterator) { | |||
for _, ag := range groups { | |||
if ag.empty() { | |||
ag.stop() | |||
d.marker.DeleteByGroupKey(ag.GroupKey()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to delete the marker when the aggregation group is deleted. There is a similar mechanism in provider/mem/mem.go
:
alertmanager/provider/mem/mem.go
Line 108 in 14cbe63
m.Delete(alert.Fingerprint()) |
2021070
to
9a433cf
Compare
ff2401d
to
3cec1e8
Compare
07bdde5
to
8ffc8ae
Compare
This commit updates TimeMuteStage and TimeActiveStage to mark groups as muted when its alerts are muted by an active or mute time interval, and remove any existing markers when outside all active and mute time intervals. Signed-off-by: George Robinson <[email protected]>
8ffc8ae
to
9342bc2
Compare
@@ -107,7 +108,7 @@ func NewDispatcher( | |||
ap provider.Alerts, | |||
r *Route, | |||
s notify.Stage, | |||
mk types.AlertMarker, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mk
was never used, here I am replacing it with types.GroupMarker
.
@@ -145,8 +147,8 @@ func (d *Dispatcher) Run() { | |||
} | |||
|
|||
func (d *Dispatcher) run(it provider.AlertIterator) { | |||
cleanup := time.NewTicker(30 * time.Second) | |||
defer cleanup.Stop() | |||
maintenance := time.NewTicker(30 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this as all other "cleanup" functions in Alertmanager are called maintenance.
@@ -175,28 +177,30 @@ func (d *Dispatcher) run(it provider.AlertIterator) { | |||
} | |||
d.metrics.processingDuration.Observe(time.Since(now).Seconds()) | |||
|
|||
case <-cleanup.C: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this logic to a method called doMaintenance
. The main purpose of this was being able to test it.
for _, ag := range groups { | ||
if ag.empty() { | ||
ag.stop() | ||
d.marker.DeleteByGroupKey(ag.routeID, ag.GroupKey()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We delete the marker for the aggregation group when the aggregation group is deleted itself.
@@ -374,6 +378,7 @@ type aggrGroup struct { | |||
labels model.LabelSet | |||
opts *RouteOpts | |||
logger log.Logger | |||
routeID string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We needed to add routeID
to aggrGroup
because it's used in d.marker.DeleteByGroupKey(ag.routeID, ag.GroupKey())
.
@@ -447,6 +453,7 @@ func (ag *aggrGroup) run(nf notifyFunc) { | |||
ctx = notify.WithRepeatInterval(ctx, ag.opts.RepeatInterval) | |||
ctx = notify.WithMuteTimeIntervals(ctx, ag.opts.MuteTimeIntervals) | |||
ctx = notify.WithActiveTimeIntervals(ctx, ag.opts.ActiveTimeIntervals) | |||
ctx = notify.WithRouteID(ctx, ag.routeID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to add it to the context so it can be extracted in TimeMuteStage
and TimeActiveStage
.
@@ -691,3 +691,48 @@ type limits struct { | |||
func (l limits) MaxNumberOfAggregationGroups() int { | |||
return l.groups | |||
} | |||
|
|||
func TestDispatcher_DoMaintenance(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I am testing that the marker is deleted when the aggregation group is garbage collected.
@@ -827,12 +828,6 @@ func TestTimeMuteStage(t *testing.T) { | |||
} | |||
eveningsAndWeekends := map[string][]timeinterval.TimeInterval{ | |||
"evenings": {{ | |||
Weekdays: []timeinterval.WeekdayRange{{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed Weekdays
so I could test a group being muted by both evenings
and weekends
at the same time.
|
||
// Get the names of all time intervals for the context. | ||
muteTimeIntervalNames := make([]string, 0, len(test.intervals)) | ||
for name := range test.intervals { | ||
muteTimeIntervalNames = append(muteTimeIntervalNames, name) | ||
} | ||
// Sort the names so we can compare mutedBy with test.mutedBy. | ||
sort.Strings(muteTimeIntervalNames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be sorted because test.intervals
is a map.
require.NoError(t, prom_testutil.GatherAndCompare(r, strings.NewReader(` | ||
# HELP alertmanager_marked_alerts How many alerts by state are currently marked in the Alertmanager regardless of their expiry. | ||
# TYPE alertmanager_marked_alerts gauge | ||
alertmanager_marked_alerts{state="active"} 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I need to add metrics for marked groups? @gotjosh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so - I wouldn't really understand their historically use-case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// RouteID extracts a RouteID from the context. Iff none exists, the | ||
// // second argument is false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// RouteID extracts a RouteID from the context. Iff none exists, the | |
// // second argument is false. | |
// RouteID extracts a RouteID from the context. If none exists, the | |
// second argument is false. |
The other comments have the same typo of iff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh iff is not a spelling mistake, it refers to if and only if.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today I learned 🤯
Signed-off-by: George Robinson <[email protected]>
This pull request updates TimeMuteStage and TimeActiveStage to mark groups as muted when its alerts are muted by an active or mute time interval, and remove any existing markers when outside all active and mute time intervals.
It is based on #3792, #3794 and #3795.