Skip to content

Commit 6776694

Browse files
committed
Add SLA reporting docs
1 parent 5dbea87 commit 6776694

File tree

4 files changed

+232
-0
lines changed

4 files changed

+232
-0
lines changed

doc/10-Sla-Reporting.md

Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
# SLA Reporting
2+
3+
A Service Level Agreement (SLA) is a legally binding contract between a service provider and a customer.
4+
Its purpose is to define the level of service that the supplier promises to deliver to the customer.
5+
6+
Icinga DB is designed to automatically identify and record the most relevant checkable events exclusively in a separate
7+
table. By default, these events are retained forever unless you have set the retention
8+
[`sla-days` option](03-Configuration.md#retention). It is important to note that Icinga DB records the raw events in
9+
the database without any interpretation. In order to generate and visualise SLA reports of specific hosts and services
10+
based on the accumulated events over time, [Icinga Reporting](https://icinga.com/docs/icinga-reporting/latest/doc/02-Installation/)
11+
is the optimal complement, facilitating comprehensive SLA report generation within a specific timeframe.
12+
13+
## Technical Description
14+
15+
!!! info
16+
17+
This documentation provides a detailed technical explanation of how Icinga DB fulfils all the
18+
necessary requirements for the generation of an accurate service level agreement (SLA).
19+
20+
Icinga DB provides a built-in support for automatically storing the relevant events of your hosts and services without
21+
a manual action. Generally, these events are every **hard** state change a particular checkable encounters and all the
22+
downtimes scheduled for that checkable throughout its entire lifetime. It is important to note that the aforementioned
23+
events are not analogous to those utilised by [Icinga DB Web](https://icinga.com/docs/icinga-db-web/latest/) for
24+
visualising host and service states.
25+
26+
!!! info
27+
28+
[Acknowledgements](https://icinga.com/docs/icinga-2/latest/doc/08-advanced-topics/#acknowledgements) are special
29+
events that mainly control the Icinga 2 notification logic behaviour, and acknowledging a host problem will not
30+
have an impact on the SLA result for that host.
31+
32+
In case of a hard state change of a monitored host or service, Icinga DB records the precise temporal occurrence of
33+
that state change in milliseconds within the `sla_history_state` table. The following image serves as a visual
34+
illustration of the relational and representational aspects of state change events.
35+
36+
![SLA history state](images/sla_history_state.png)
37+
38+
In contrast, two timestamps are retained for downtimes, one indicating the commencement of the downtime as
39+
`downtime_start` and the other denoting its end time, designated as `downtime_end` within the `sla_history_downtime`
40+
table. For the sake of completeness, the following image provides also a visual representation of the
41+
`sla_history_downtime` table and its relations.
42+
43+
![SLA history downtime](images/sla_history_downtime.png)
44+
45+
In certain circumstances, namely when a checkable is created and subsequently never deleted, this approach has been
46+
empirically [demonstrated](#computing-sla-ok-percent) to be sufficient. Nevertheless, in the case of a host being
47+
deleted and then recreated a couple of days later, the generation of SLA reports in
48+
[Icinga Reporting](https://icinga.com/docs/icinga-reporting/latest/doc/02-Installation/) for that host at
49+
the end of the week may yield disparate results, depending on the host state prior to its deletion.
50+
51+
In order to generate SLA reports with the greatest possible accuracy, we have decided to supplement the existing data
52+
with information regarding the **creation** and **deletion** of hosts and services in a new `sla_lifecycle` table,
53+
introduced in Icinga DB **1.3.0**. This new table has a composite primary key consisting of two columns: `id` and
54+
`delete_time`. In this way, the `delete_time` is used to indicate whether a specific checkable has been deleted,
55+
with a `0` value denoting non-deleted object. In a perfect world, we would use `NULL` instead of `0` but due to the
56+
[primary key `constraints`](https://dev.mysql.com/doc/refman/8.4/en/create-table.html#create-table-indexes-keys),
57+
this is not possible. The `id` column represents either the service or host ID for which that particular sla lifecycle
58+
is being generated.
59+
60+
![SLA lifecycle](images/sla_lifecycle.png)
61+
62+
The upgrade script for `1.3.0` generates a `create_time` SLA lifecycle entry for all existing hosts and services.
63+
However, since that script has no knowledge of the creation time of these existing objects, the timestamp for them is
64+
produced in the following manner: As previously outlined, Icinga DB has the capability to store timestamps for both
65+
hard state changes and downtimes since its first stable release. This enables the upgrade script to identify the least
66+
event timestamp of a given checkable from the `sla_history_state` and `sla_history_downtime` tables. In cases where no
67+
timestamps can be obtained from the aforementioned tables, it simply fallbacks to `now`, i.e. in such situations,
68+
the creation time of the checkable in question is set to the current timestamp.
69+
70+
### Events Processing
71+
72+
It is noteworthy that Icinga DB does not record checkable **soft** state changes, in the regular `service_state`
73+
and `host_state` tables. However, for the `state_history`, for instance, all state changes are retrieved from Redis®
74+
and persisted to the database, irrespective of whether it is in High Availability (HA) or single-instance mode.
75+
Each time Icinga DB processes a state change, it checks whether it is a hard state change and generates the
76+
corresponding SLA state event in the `sla_history_state` table. Similarly, Icinga DB generates a corresponding SLA
77+
history downtime (`sla_history_downtime`) each time it receives a downtime-triggered or ended/cancelled event from
78+
Icinga 2. Notably, if Icinga DB is operating in HA mode, this process takes place in parallel, i.e. both instances
79+
concurrently write their respective histories to the database, consequently, the aforementioned tables have also to
80+
track the endpoint ids in their `endpoint_id` column.
81+
82+
Though, it is not necessary to be concerned about the potential duplicate entries in the database, as long as the events
83+
from both Icinga DB instances have the same timestamp, there will be no duplicates. For instance, downtimes have unique
84+
and deterministic IDs, allowing the second instance to detect if that very same downtime has already been recorded by
85+
the other instance.
86+
87+
The checkable events of the types **created** and **deleted**, on the other hand, are special and represent the life
88+
cycle of the chekables. These two events are always written by a single Icinga DB instance at a time (if in HA mode) to
89+
the `sla_lifecycle` table, denoting the creation and deletion of a checkable in Icinga 2. Unfortunately, Icinga 2 lacks
90+
the capability to accurately determine the deletion time of an object. It should also be noted that Icinga DB is also
91+
incapable of identifying the precise timestamp of an object's deletion. Instead, it simply records the time at which
92+
the deletion event for that particular checkable occurred and populates the `sla_lifecycle` table accordingly.
93+
Consequently, if a checkable is deleted while Icinga DB is stopped or not in an operational state, the events that
94+
Icinga DB would otherwise record once it is restarted will not reflect the actual deletion or creation time.
95+
96+
#### Initial Config Sync
97+
98+
Each time when either the Icinga DB or Icinga 2 service is reloaded, Icinga DB computes something called a config
99+
`delta`. The config delta determines which objects need to be deleted from the database, to be updated, or are new and
100+
need to be inserted. After successfully computing and dumping the config, Icinga DB then performs a simple SQL
101+
`INSERT INTO sla_lifecycle` statements for all checkables that don't already have **created** SLA event with
102+
`delete_time` set to `0` and sets their `create_time` to `now`. Additionally, it also updates the `delete_time` column
103+
of each existing SLA lifecycle entries whose checkable IDs cannot be found in the `host/service` tables.
104+
105+
#### Runtime Updates
106+
107+
When a host or service is created or deleted at runtime, either using
108+
[Icinga Director](https://icinga.com/docs/icinga-director/latest/doc/01-Introduction/) or the plain `/v1/objects` API
109+
endpoint, Icinga DB automatically generates an SLA lifecycle entry denoting the checkable creation or deletion time.
110+
For all runtime *created* checkables, the SLA lifecycle entries are inserted using a slightly sophisticated SQL `INSERT`
111+
statement with ignore on error mechanism, i.e. if it encounters a `duplicate key` error, it simply suppresses the error
112+
and discards the query. In contrast, for runtime *deleted* checkables, it assumes that there is an SLA lifecycle
113+
**created** event for these checkables, and uses a simple `UPDATE` statement setting their deletion time to now.
114+
Consequently, should there be no corresponding **created** event for these checkables, the update statement becomes a
115+
no-op, as the [initial config dump](#initial-config-sync) should have created the necessary entries for all existing
116+
objects.
117+
118+
### Computing SLA OK percent
119+
120+
The following is a simplified explanation of the current (Icinga DB `1.3.0`) methodology behind the `get_sla_ok_percent`
121+
SQL procedure, used to calculate the SLA OK percent. It is a fundamental characteristic of functional specifications
122+
for Icinga Reporting to only generate reports covering a specific timeframe. Accordingly, the `get_sla_ok_percent`
123+
SQL procedure necessitates the input of the start and end timeframes within which the SLA is to be calculated.
124+
125+
First, it is necessary to identify the latest [`hard_state`](#hard-state-vs-previous-hard-state) of the service or host
126+
that occurred at or prior to the timeline start date, and marking it as the initial one. In case the first query fails
127+
to determine a `hard_state` entry, it proceeds to search for a [`previous_hard_state`](#hard-state-vs-previous-hard-state)
128+
entry in the `sla_history_state` table that have been recorded after the start of the timeline. If this approach also
129+
fails to retrieve the desired outcome, the regular non-historical `host_state` or `service_state` table is then
130+
examined for the current state. Should this also produce no results, then it uses `OK` as its initial state.
131+
132+
Next, we need to get the total time of the specified timeframe, expressed in milliseconds, for which we're going to
133+
compute the SLA OK percent (`total_time = timeline_end - timeline_start`).
134+
135+
Afterward, it traverses the entire state and downtime events within the provided timeframe, performing a series of
136+
simple arithmetic operations. The complete algorithmic process is illustrated in the following pseudocode.
137+
138+
```
139+
total_time := timeline_end - timeline_start
140+
141+
// Mark the timeline start date as our last event time for now.
142+
last_event_time := timeline_start
143+
144+
// The problem time of a given host or service is initially set to zero.
145+
problem_time := 0
146+
147+
// The previous_hard_state is determined dynamically as described above, however,
148+
// for the purposes of this analysis, we'll just set it to 'OK'.
149+
previous_hard_state := OK
150+
151+
// Loop through all the state and downtime events within the provided timeframe ordered by their timestamp.
152+
for event in (sla_history_state, sla_history_downtime) do
153+
if (event.previous_hard_state is PENDING) then
154+
// A PENDING state event indicates that the host or service in question has not yet had a check result that
155+
// clearly identifies its state. Consequently, such events become irrelevant for the purposes of calculating
156+
// the SLA and we must exclude the duration of that PENDING state from the total time.
157+
total_time = total_time - (event.event_time - last_event_time)
158+
else if (previous_hard_state is greater than OK/UP
159+
AND previous_hard_state is not PENDING
160+
AND checkable is not in DOWNTIME) then
161+
// If the previous_hard_state is set to a non-OK state and the host or service in question was not in downtime,
162+
// we consider that time slot to be problematic and add the duration to the problem time.
163+
problem_time = problem_time + (event.event_time - last_event_time)
164+
endif
165+
166+
// Set the "last_event_time" to the timestamp of the event being currently processed.
167+
last_event_time = event.event_time
168+
169+
if (event.type is "state change event") then
170+
// If the event being currently processed is a state change event, we mark its
171+
// latest hard state as the previous one for the next iteration.
172+
previous_hard_state = event.hard_state
173+
endif
174+
endloop
175+
```
176+
177+
At this point, we now have computed the problem time of a particular host or service for a given time frame. The final
178+
step is to determine the percentage of the remaining total time. In other words, we want to find out how much of the
179+
total time is taken up by the problem time, so that we can obtain our final SLA OK percentage result.
180+
181+
```
182+
sla_ok_percent := 100 * (total_time - problem_time) / total_time
183+
```
184+
185+
The following example illustrates the practical implications of this concept. Suppose we have the following SLA events:
186+
```json
187+
{
188+
"state": [
189+
{"event_time": 1200, "hard_state": 2, "previous_hard_state": 0},
190+
{"event_time": 1500, "hard_state": 0, "previous_hard_state": 2}
191+
],
192+
"downtime": [
193+
{"downtime_start": 1100, "downtime_end": 1300},
194+
{"downtime_start": 1400, "downtime_end": 1600}
195+
]
196+
}
197+
```
198+
199+
We would now like to calculate the SLA OK percent for the timeframe from `1000` to `2000`.
200+
201+
```
202+
total_time := 2000 - 1000
203+
problem_time := 0
204+
205+
- 1000..1200 // in OK state (see the previous_hard_state of the first state event), so nothing to do.
206+
- 1200..1300 // in Critical state, but was also set in downtime from (1100..1300), so here also nothing to do.
207+
- 1300..1400 // still in Critical state and is not in downtime, so we count that time slot as a problem time.
208+
problem_time = problem_time + 1400 - 1300
209+
210+
- 1400..1500 // still in critical state, but we have an active downtime during this period again, so nothing to do.
211+
- 1500..2000 // in OK state
212+
213+
// So, this indicates that our host was either not in a problem state or was set
214+
// to a downtime for 90% of the period from 1000 to 2000.
215+
sla_ok_percent := 100 * (total_time - problem_time) / total_time
216+
```
217+
218+
## Appendix
219+
220+
### Hard State vs. Previous Hard State
221+
222+
The `hard_state` column denotes the most recent hard state of the host and service.
223+
Conversely, the `previous_hard_state` column indicates the preceding hard state that was formerly stored in the
224+
`hard_state` column prior to the host or service transitioning to a new hard state. Please refer to the tabular
225+
representation below for a visual representation of this information.
226+
227+
| previous_hard_state | hard_state |
228+
|-------------------------------|------------|
229+
| PENDING (no check result yet) | OK |
230+
| OK | Warning |
231+
| Warning | Critical |
232+
| Critical | OK |

doc/images/sla_history_downtime.png

68.3 KB
Loading

doc/images/sla_history_state.png

42 KB
Loading

doc/images/sla_lifecycle.png

35.4 KB
Loading

0 commit comments

Comments
 (0)