-
Notifications
You must be signed in to change notification settings - Fork 56
How alert rule evaluation works in Grafana
In Grafana, alert rules must be evaluated, at some interval, in order to determine if an alert rule is firing. The alert rule must continue to be evaluated at the interval because the conditions that caused the alert to fire in the first place might no longer be present, in which case the alert will be resolved.
In this document we will look at how alert evaluation works, including an introduction to data frames, the different kinds of data frames, translating data frames into states, and state management.
Each alert rule can contain one or more queries or expressions. An alert rule can contain just queries, just expressions, or a combination of the two, provided the output is a single floating point number. This floating point number is called the condition, and if it is greater than zero then the alert rule is said to be firing, otherwise it is said to be normal. This is how it is possible to use either an Instant vector, a Reduce expression, or a Math expression with a boolean comparison as the condition.
Either queries or expressions that return time series data, that is a sequence of floating point numbers over time, cannot be used as the condition because there is no single floating point number. Instead, time series data must be reduced to a single floating point number via either a selector function such as first
and last
, or an aggregation function such as avg
, min
or max
.
Evaluation starts in the eval package, and can be found in pkg/services/ngalert/eval
. In this package is an interface called ConditionEvaluator
, and has two methods:
-
EvaluateRaw
, which returns the raw data frames from the expression service -
Evaluate
, which separates the raw data frames fromEvaluateRaw
into data frames for the condition, data frames for individual queries and expressions, and an error
When an alert rule is evaluated, the Evaluate
method is called. Each alert rule has its own instance of conditionEvaluator
which contains the data pipeline, a reference to the expression service, the condition, and a timeout. The value returned is a slice of results; each containing the data frames, values, and state of an instance, with an instance being a distinct label set. This slice of results is then passed to the state manager, where for each instance its state is determined (firing, normal, etc) and an alert is either sent or resolved in the Alertmanager.
Data frames are the intermediate representation of data between a datasource and Grafana. The datasource plugin is responsible for querying the datasource, and transforming the result into series of data frames for Grafana.
A data frame is a columnar-orientated table structure, which means it stores data by column rather than by row. Each frame has an name (which appears to be optional, as it is not set in Prometheus) and zero or more fields, each containing a name, a type (i.e. time.Time
, []*float64
), and a length representing the number of rows.
A data frame can take one of a number of different logical kinds such Time series
, Numeric
and Heatmap
; and each logical kind can be in a number of different formats such as Wide
, Long
and Multi
.
A data type definition, or data type declaration, represents both the kind and the format. For example, TimeSeriesWide
has a kind Time series
and format Wide
.
The following is an example of a TimeSeriesWide
and NumericWide
for a single time series with a single value:
Data frames for Ref ID: A, num frames: 1
Fields for frame: 0, num fields: 2
Name: Time, Type: []time.Time, Rows: 1
2023-01-09 14:32:30 +0000 UTC
Name: Value, Type: []*float64, Rows: 1
1.000000
Data frames for Ref ID: B, num frames: 1
Fields for frame: 0, num fields: 1
Name: B, Type: []*float64, Rows: 1
1.000000
The following is an example of a TimeSeriesWide
and NumericWide
for a single time series with multiple values:
Data frames for Ref ID: A, num frames: 1
Fields for frame: 0, num fields: 2
Name: Time, Type: []time.Time, Rows: 4
2023-01-09 14:32:30 +0000 UTC
2023-01-09 14:32:45 +0000 UTC
2023-01-09 14:33:00 +0000 UTC
2023-01-09 14:33:15 +0000 UTC
Name: Value, Type: []*float64, Rows: 4
1.000000
2.000000
3.000000
4.000000
Data frames for Ref ID: B, num frames: 1
Fields for frame: 0, num fields: 1
Name: B, Type: []*float64, Rows: 1
1.000000
The following is an example of a TimeSeriesWide
and NumericWide
for two time series with multiple values:
Data frames for Ref ID: A, num frames: 2
Fields for frame: 0, num fields: 2
Name: Time, Type: []time.Time, Rows: 4
2023-01-09 14:32:30 +0000 UTC
2023-01-09 14:32:45 +0000 UTC
2023-01-09 14:33:00 +0000 UTC
2023-01-09 14:33:15 +0000 UTC
Name: Value, Type: []*float64, Rows: 4
1.000000
2.000000
3.000000
4.000000
Fields for frame: 1, num fields: 2
Name: Time, Type: []time.Time, Rows: 4
2023-01-09 14:34:00 +0000 UTC
2023-01-09 14:34:15 +0000 UTC
2023-01-09 14:34:30 +0000 UTC
2023-01-09 14:34:45 +0000 UTC
Name: Value, Type: []*float64, Rows: 4
5.000000
6.000000
7.000000
8.000000
Data frames for Ref ID: B, num frames: 2
Fields for frame: 0, num fields: 1
Name: B, Type: []*float64, Rows: 1
1.000000
Fields for frame: 1, num fields: 1
Name: B, Type: []*float64, Rows: 1
5.000000
The following is an example of a TimeSeriesWide
and NumericWide
for a PostgreSQL table with two columns value
and ts
:
Data frames for Ref ID: B, num frames: 1
Fields for frame: 0, num fields: 1
Name: B, Type: []*float64, Rows: 1
1.000000
Data frames for Ref ID: A, num frames: 1
Fields for frame: 0, num fields: 2
Name: ts, Type: []time.Time, Rows: 4
2023-01-09 16:22:54.055134 +0000 +0000
2023-01-09 16:23:03.385304 +0000 +0000
2023-01-09 16:23:06.971062 +0000 +0000
2023-01-09 16:23:09.340637 +0000 +0000
Name: value, Type: []*float64, Rows: 4
1.000000
1.000000
1.000000
1.000000
The following is an example of a TimeSeriesWide
and NumericWide
for a PostgreSQL table with three columns: host
, value
and ts
; containing two rows with different timestamps:
SELECT * FROM test;
host | value | ts
------+-------+----------------------------
foo | 1 | 2023-01-09 16:39:17.722849
bar | 1 | 2023-01-09 16:39:20.90213
(2 rows)
You can see here that missing timestamps for foo
and bar
have been replaced with nil
:
Data frames for Ref ID: B, num frames: 2
Fields for frame: 0, num fields: 1
Name: B, Type: []*float64, Rows: 1
0.000000
Fields for frame: 1, num fields: 1
Name: B, Type: []*float64, Rows: 1
0.000000
Data frames for Ref ID: A, num frames: 2
Fields for frame: 0, num fields: 2
Name: Time, Type: []time.Time, Rows: 2
2023-01-09 16:39:17.722849 +0000 +0000
2023-01-09 16:39:20.90213 +0000 +0000
Name: value, Type: []*float64, Rows: 2
nil
1.000000
Fields for frame: 1, num fields: 2
Name: Time, Type: []time.Time, Rows: 2
2023-01-09 16:39:17.722849 +0000 +0000
2023-01-09 16:39:20.90213 +0000 +0000
Name: value, Type: []*float64, Rows: 2
1.000000
nil
The state manager accepts results from Evaluate
via ProcessEvalResults
. This method iterates over each result in the slice containing the data frames, values, and state of an instance, with an instance being a distinct label set, and calls setNextState
. The setNextState
method either creates a new state, or updates the existing state for the instance, appending the result to the list of past results for this instance. It then uses the information it has from the previous evaluation to decide if the state is Normal
, Alerting
, Pending
, NoData
or Error
. Here an optional screenshot can also be taken, and both custom annotations and labels are expanded.
For each state that is Alerting
, NoData
, Error
or resolved, an alert is created and sent to the Alertmanager.