Skip to content

How alert rule evaluation works in Grafana

George Robinson edited this page Jan 9, 2023 · 6 revisions

Introduction

In Grafana, alert rules must be evaluated, at some interval, in order to determine if an alert rule is firing. The alert rule must continue to be evaluated at the interval because the conditions that caused the alert to fire in the first place might no longer be present, in which case the alert will be resolved.

In this document we will look at how alert evaluation works, including an introduction to data frames, the different kinds of data frames, translating data frames into states, and state management.

Queries, expressions and conditions

Each alert rule can contain one or more queries or expressions. An alert rule can contain just queries, just expressions, or a combination of the two, provided the output is a single floating point number. This floating point number is called the condition, and if it is greater than zero then the alert rule is said to be firing, otherwise it is said to be normal. This is how it is possible to use either an Instant vector, a Reduce expression, or a Math expression with a boolean comparison as the condition.

Either queries or expressions that return time series data, that is a sequence of floating point numbers over time, cannot be used as the condition because there is no single floating point number. Instead, time series data must be reduced to a single floating point number via either a selector function such as first and last, or an aggregation function such as avg, min or max.

How does evaluation work?

Evaluation starts in the eval package, and can be found in pkg/services/ngalert/eval. In this package is an interface called ConditionEvaluator, and has two methods:

  1. EvaluateRaw, which returns the raw data frames from the expression service
  2. Evaluate, which separates the raw data frames from EvaluateRaw into data frames for the condition, data frames for individual queries and expressions, and an error

When an alert rule is evaluated, the Evaluate method is called. Each alert rule has its own instance of conditionEvaluator which contains the data pipeline, a reference to the expression service, the condition, and a timeout. The value returned is a slice of results; each containing the data frames, values, and state of an instance, with an instance being a distinct label set. This slice of results is then passed to the state manager, where for each instance its state is determined (firing, normal, etc) and an alert is either sent or resolved in the Alertmanager.

What do data frames look it?

Data frames are the intermediate representation of data between a datasource and Grafana. The datasource plugin is responsible for querying the datasource, and transforming the result into series of data frames for Grafana.

A data frame is a columnar-orientated table structure, which means it stores data by column rather than by row. Each frame has an name (which appears to be optional, as it is not set in Prometheus) and zero or more fields, each containing a name, a type (i.e. time.Time, []*float64), and a length representing the number of rows.

A data frame can take one of a number of different logical kinds such Time series, Numeric and Heatmap; and each logical kind can be in a number of different formats such as Wide, Long and Multi.

A data type definition, or data type declaration, represents both the kind and the format. For example, TimeSeriesWide has a kind Time series and format Wide.

Examples

Prometheus

The following is an example of a TimeSeriesWide and NumericWide for a single time series with a single value:

Data frames for Ref ID: A, num frames: 1
	Fields for frame: 0, num fields: 2
		Name: Time, Type: []time.Time, Rows: 1
			2023-01-09 14:32:30 +0000 UTC
		Name: Value, Type: []*float64, Rows: 1
			1.000000
Data frames for Ref ID: B, num frames: 1
	Fields for frame: 0, num fields: 1
		Name: B, Type: []*float64, Rows: 1
			1.000000

The following is an example of a TimeSeriesWide and NumericWide for a single time series with multiple values:

Data frames for Ref ID: A, num frames: 1
	Fields for frame: 0, num fields: 2
		Name: Time, Type: []time.Time, Rows: 4
			2023-01-09 14:32:30 +0000 UTC
			2023-01-09 14:32:45 +0000 UTC
			2023-01-09 14:33:00 +0000 UTC
			2023-01-09 14:33:15 +0000 UTC
		Name: Value, Type: []*float64, Rows: 4
			1.000000
			2.000000
			3.000000
			4.000000
Data frames for Ref ID: B, num frames: 1
	Fields for frame: 0, num fields: 1
		Name: B, Type: []*float64, Rows: 1
			1.000000

The following is an example of a TimeSeriesWide and NumericWide for two time series with multiple values:

Data frames for Ref ID: A, num frames: 2
	Fields for frame: 0, num fields: 2
		Name: Time, Type: []time.Time, Rows: 4
			2023-01-09 14:32:30 +0000 UTC
			2023-01-09 14:32:45 +0000 UTC
			2023-01-09 14:33:00 +0000 UTC
			2023-01-09 14:33:15 +0000 UTC
		Name: Value, Type: []*float64, Rows: 4
			1.000000
			2.000000
			3.000000
			4.000000
	Fields for frame: 1, num fields: 2
		Name: Time, Type: []time.Time, Rows: 4
			2023-01-09 14:34:00 +0000 UTC
			2023-01-09 14:34:15 +0000 UTC
			2023-01-09 14:34:30 +0000 UTC
			2023-01-09 14:34:45 +0000 UTC
		Name: Value, Type: []*float64, Rows: 4
			5.000000
			6.000000
			7.000000
			8.000000
Data frames for Ref ID: B, num frames: 2
	Fields for frame: 0, num fields: 1
		Name: B, Type: []*float64, Rows: 1
			1.000000
	Fields for frame: 1, num fields: 1
		Name: B, Type: []*float64, Rows: 1
			5.000000

PostgreSQL

The following is an example of a TimeSeriesWide and NumericWide for a PostgreSQL table with two columns value and ts:

Data frames for Ref ID: B, num frames: 1
	Fields for frame: 0, num fields: 1
		Name: B, Type: []*float64, Rows: 1
			1.000000
Data frames for Ref ID: A, num frames: 1
	Fields for frame: 0, num fields: 2
		Name: ts, Type: []time.Time, Rows: 4
			2023-01-09 16:22:54.055134 +0000 +0000
			2023-01-09 16:23:03.385304 +0000 +0000
			2023-01-09 16:23:06.971062 +0000 +0000
			2023-01-09 16:23:09.340637 +0000 +0000
		Name: value, Type: []*float64, Rows: 4
			1.000000
			1.000000
			1.000000
			1.000000

The following is an example of a TimeSeriesWide and NumericWide for a PostgreSQL table with three columns: host, value and ts; containing two rows with different timestamps:

SELECT * FROM test;
 host | value |             ts
------+-------+----------------------------
 foo  |     1 | 2023-01-09 16:39:17.722849
 bar  |     1 | 2023-01-09 16:39:20.90213
(2 rows)

You can see here that missing timestamps for foo and bar have been replaced with nil:

Data frames for Ref ID: B, num frames: 2
	Fields for frame: 0, num fields: 1
		Name: B, Type: []*float64, Rows: 1
			0.000000
	Fields for frame: 1, num fields: 1
		Name: B, Type: []*float64, Rows: 1
			0.000000
Data frames for Ref ID: A, num frames: 2
	Fields for frame: 0, num fields: 2
		Name: Time, Type: []time.Time, Rows: 2
			2023-01-09 16:39:17.722849 +0000 +0000
			2023-01-09 16:39:20.90213 +0000 +0000
		Name: value, Type: []*float64, Rows: 2
			nil
			1.000000
	Fields for frame: 1, num fields: 2
		Name: Time, Type: []time.Time, Rows: 2
			2023-01-09 16:39:17.722849 +0000 +0000
			2023-01-09 16:39:20.90213 +0000 +0000
		Name: value, Type: []*float64, Rows: 2
			1.000000
			nil

How does state management work?

The state manager accepts results from Evaluate via ProcessEvalResults. This method iterates over each result in the slice containing the data frames, values, and state of an instance, with an instance being a distinct label set, and calls setNextState. The setNextState method either creates a new state, or updates the existing state for the instance, appending the result to the list of past results for this instance. It then uses the information it has from the previous evaluation to decide if the state is Normal, Alerting, Pending, NoData or Error. Here an optional screenshot can also be taken, and both custom annotations and labels are expanded.

For each state that is Alerting, NoData, Error or resolved, an alert is created and sent to the Alertmanager.