|
| 1 | +# SLA Reporting |
| 2 | + |
| 3 | +A Service Level Agreement (SLA) is a legally binding contract between a service provider and a customer. |
| 4 | +Its purpose is to define the level of service that the supplier promises to deliver to the customer. |
| 5 | + |
| 6 | +Icinga DB is designed to automatically identify and record the most relevant checkable events exclusively in a separate |
| 7 | +table. By default, these events are retained forever unless you have set the retention |
| 8 | +[`sla-days` option](03-Configuration.md#retention). It is important to note that Icinga DB records the raw events in |
| 9 | +the database without any interpretation. In order to generate and visualise SLA reports of specific hosts and services |
| 10 | +based on the accumulated events over time, [Icinga Reporting](https://icinga.com/docs/icinga-reporting/latest/doc/02-Installation/) |
| 11 | +is the optimal complement, facilitating comprehensive SLA report generation within a specific timeframe. |
| 12 | + |
| 13 | +## Technical Description |
| 14 | + |
| 15 | +!!! info |
| 16 | + |
| 17 | + This documentation provides a detailed technical explanation of how Icinga DB fulfils all the |
| 18 | + necessary requirements for the generation of an accurate service level agreement (SLA). |
| 19 | + |
| 20 | +Icinga DB provides a built-in support for automatically storing the relevant events of your hosts and services without |
| 21 | +a manual action. Generally, these events are every **hard** state change a particular checkable encounters and all the |
| 22 | +downtimes scheduled for that checkable throughout its entire lifetime. It is important to note that the aforementioned |
| 23 | +events are not analogous to those utilised by [Icinga DB Web](https://icinga.com/docs/icinga-db-web/latest/) for |
| 24 | +visualising host and service states. |
| 25 | + |
| 26 | +!!! info |
| 27 | + |
| 28 | + [Acknowledgements](https://icinga.com/docs/icinga-2/latest/doc/08-advanced-topics/#acknowledgements) are special |
| 29 | + events that mainly control the Icinga 2 notification logic behaviour, and acknowledging a host problem will not |
| 30 | + have an impact on the SLA result for that host. |
| 31 | + |
| 32 | +In case of a hard state change of a monitored host or service, Icinga DB records the precise temporal occurrence of |
| 33 | +that state change in milliseconds within the `sla_history_state` table. The following image serves as a visual |
| 34 | +illustration of the relational and representational aspects of state change events. |
| 35 | + |
| 36 | + |
| 37 | + |
| 38 | +In contrast, two timestamps are retained for downtimes, one indicating the commencement of the downtime as |
| 39 | +`downtime_start` and the other denoting its end time, designated as `downtime_end` within the `sla_history_downtime` |
| 40 | +table. For the sake of completeness, the following image provides also a visual representation of the |
| 41 | +`sla_history_downtime` table and its relations. |
| 42 | + |
| 43 | + |
| 44 | + |
| 45 | +In certain circumstances, namely when a checkable is created and subsequently never deleted, this approach has been |
| 46 | +empirically [demonstrated](#computing-sla-ok-percent) to be sufficient. Nevertheless, in the case of a host being |
| 47 | +deleted and then recreated a couple of days later, the generation of SLA reports in |
| 48 | +[Icinga Reporting](https://icinga.com/docs/icinga-reporting/latest/doc/02-Installation/) for that host at |
| 49 | +the end of the week may yield disparate results, depending on the host state prior to its deletion. |
| 50 | + |
| 51 | +In order to generate SLA reports with the greatest possible accuracy, we have decided to supplement the existing data |
| 52 | +with information regarding the **creation** and **deletion** of hosts and services in a new `sla_lifecycle` table, |
| 53 | +introduced in Icinga DB **1.3.0**. This new table has a composite primary key consisting of two columns: `id` and |
| 54 | +`delete_time`. In this way, the `delete_time` is used to indicate whether a specific checkable has been deleted, |
| 55 | +with a `0` value denoting non-deleted object. In a perfect world, we would use `NULL` instead of `0` but due to the |
| 56 | +[primary key `constraints`](https://dev.mysql.com/doc/refman/8.4/en/create-table.html#create-table-indexes-keys), |
| 57 | +this is not possible. The `id` column represents either the service or host ID for which that particular sla lifecycle |
| 58 | +is being generated. |
| 59 | + |
| 60 | + |
| 61 | + |
| 62 | +The upgrade script for `1.3.0` generates a `create_time` SLA lifecycle entry for all existing hosts and services. |
| 63 | +However, since that script has no knowledge of the creation time of these existing objects, the timestamp for them is |
| 64 | +produced in the following manner: As previously outlined, Icinga DB has the capability to store timestamps for both |
| 65 | +hard state changes and downtimes since its first stable release. This enables the upgrade script to identify the least |
| 66 | +event timestamp of a given checkable from the `sla_history_state` and `sla_history_downtime` tables. In cases where no |
| 67 | +timestamps can be obtained from the aforementioned tables, it simply fallbacks to `now`, i.e. in such situations, |
| 68 | +the creation time of the checkable in question is set to the current timestamp. |
| 69 | + |
| 70 | +### Events Processing |
| 71 | + |
| 72 | +It is noteworthy that Icinga DB does not record checkable **soft** state changes, in the regular `service_state` |
| 73 | +and `host_state` tables. However, for the `state_history`, for instance, all state changes are retrieved from Redis® |
| 74 | +and persisted to the database, irrespective of whether it is in High Availability (HA) or single-instance mode. |
| 75 | +Each time Icinga DB processes a state change, it checks whether it is a hard state change and generates the |
| 76 | +corresponding SLA state event in the `sla_history_state` table. Similarly, Icinga DB generates a corresponding SLA |
| 77 | +history downtime (`sla_history_downtime`) each time it receives a downtime-triggered or ended/cancelled event from |
| 78 | +Icinga 2. Notably, if Icinga DB is operating in HA mode, this process takes place in parallel, i.e. both instances |
| 79 | +concurrently write their respective histories to the database, consequently, the aforementioned tables have also to |
| 80 | +track the endpoint ids in their `endpoint_id` column. |
| 81 | + |
| 82 | +Though, it is not necessary to be concerned about the potential duplicate entries in the database, as long as the events |
| 83 | +from both Icinga DB instances have the same timestamp, there will be no duplicates. For instance, downtimes have unique |
| 84 | +and deterministic IDs, allowing the second instance to detect if that very same downtime has already been recorded by |
| 85 | +the other instance. |
| 86 | + |
| 87 | +The checkable events of the types **created** and **deleted**, on the other hand, are special and represent the life |
| 88 | +cycle of the chekables. These two events are always written by a single Icinga DB instance at a time (if in HA mode) to |
| 89 | +the `sla_lifecycle` table, denoting the creation and deletion of a checkable in Icinga 2. Unfortunately, Icinga 2 lacks |
| 90 | +the capability to accurately determine the deletion time of an object. It should also be noted that Icinga DB is also |
| 91 | +incapable of identifying the precise timestamp of an object's deletion. Instead, it simply records the time at which |
| 92 | +the deletion event for that particular checkable occurred and populates the `sla_lifecycle` table accordingly. |
| 93 | +Consequently, if a checkable is deleted while Icinga DB is stopped or not in an operational state, the events that |
| 94 | +Icinga DB would otherwise record once it is restarted will not reflect the actual deletion or creation time. |
| 95 | + |
| 96 | +#### Initial Config Sync |
| 97 | + |
| 98 | +Each time when either the Icinga DB or Icinga 2 service is reloaded, Icinga DB computes something called a config |
| 99 | +`delta`. The config delta determines which objects need to be deleted from the database, to be updated, or are new and |
| 100 | +need to be inserted. After successfully computing and dumping the config, Icinga DB then performs a simple SQL |
| 101 | +`INSERT INTO sla_lifecycle` statements for all checkables that don't already have **created** SLA event with |
| 102 | +`delete_time` set to `0` and sets their `create_time` to `now`. Additionally, it also updates the `delete_time` column |
| 103 | +of each existing SLA lifecycle entries whose checkable IDs cannot be found in the `host/service` tables. |
| 104 | + |
| 105 | +#### Runtime Updates |
| 106 | + |
| 107 | +When a host or service is created or deleted at runtime, either using |
| 108 | +[Icinga Director](https://icinga.com/docs/icinga-director/latest/doc/01-Introduction/) or the plain `/v1/objects` API |
| 109 | +endpoint, Icinga DB automatically generates an SLA lifecycle entry denoting the checkable creation or deletion time. |
| 110 | +For all runtime *created* checkables, the SLA lifecycle entries are inserted using a slightly sophisticated SQL `INSERT` |
| 111 | +statement with ignore on error mechanism, i.e. if it encounters a `duplicate key` error, it simply suppresses the error |
| 112 | +and discards the query. In contrast, for runtime *deleted* checkables, it assumes that there is an SLA lifecycle |
| 113 | +**created** event for these checkables, and uses a simple `UPDATE` statement setting their deletion time to now. |
| 114 | +Consequently, should there be no corresponding **created** event for these checkables, the update statement becomes a |
| 115 | +no-op, as the [initial config dump](#initial-config-sync) should have created the necessary entries for all existing |
| 116 | +objects. |
| 117 | + |
| 118 | +### Computing SLA OK percent |
| 119 | + |
| 120 | +The following is a simplified explanation of the current (Icinga DB `1.3.0`) methodology behind the `get_sla_ok_percent` |
| 121 | +SQL procedure, used to calculate the SLA OK percent. It is a fundamental characteristic of functional specifications |
| 122 | +for Icinga Reporting to only generate reports covering a specific timeframe. Accordingly, the `get_sla_ok_percent` |
| 123 | +SQL procedure necessitates the input of the start and end timeframes within which the SLA is to be calculated. |
| 124 | + |
| 125 | +First, it is necessary to identify the latest [`hard_state`](#hard-state-vs-previous-hard-state) of the service or host |
| 126 | +that occurred at or prior to the timeline start date, and marking it as the initial one. In case the first query fails |
| 127 | +to determine a `hard_state` entry, it proceeds to search for a [`previous_hard_state`](#hard-state-vs-previous-hard-state) |
| 128 | +entry in the `sla_history_state` table that have been recorded after the start of the timeline. If this approach also |
| 129 | +fails to retrieve the desired outcome, the regular non-historical `host_state` or `service_state` table is then |
| 130 | +examined for the current state. Should this also produce no results, then it uses `OK` as its initial state. |
| 131 | + |
| 132 | +Next, we need to get the total time of the specified timeframe, expressed in milliseconds, for which we're going to |
| 133 | +compute the SLA OK percent (`total_time = timeline_end - timeline_start`). |
| 134 | + |
| 135 | +Afterward, it traverses the entire state and downtime events within the provided timeframe, performing a series of |
| 136 | +simple arithmetic operations. The complete algorithmic process is illustrated in the following pseudocode. |
| 137 | + |
| 138 | +``` |
| 139 | +total_time := timeline_end - timeline_start |
| 140 | +
|
| 141 | +// Mark the timeline start date as our last event time for now. |
| 142 | +last_event_time := timeline_start |
| 143 | +
|
| 144 | +// The problem time of a given host or service is initially set to zero. |
| 145 | +problem_time := 0 |
| 146 | +
|
| 147 | +// The previous_hard_state is determined dynamically as described above, however, |
| 148 | +// for the purposes of this analysis, we'll just set it to 'OK'. |
| 149 | +previous_hard_state := OK |
| 150 | +
|
| 151 | +// Loop through all the state and downtime events within the provided timeframe ordered by their timestamp. |
| 152 | +for event in (sla_history_state, sla_history_downtime) do |
| 153 | + if (event.previous_hard_state is PENDING) then |
| 154 | + // A PENDING state event indicates that the host or service in question has not yet had a check result that |
| 155 | + // clearly identifies its state. Consequently, such events become irrelevant for the purposes of calculating |
| 156 | + // the SLA and we must exclude the duration of that PENDING state from the total time. |
| 157 | + total_time = total_time - (event.event_time - last_event_time) |
| 158 | + else if (previous_hard_state is greater than OK/UP |
| 159 | + AND previous_hard_state is not PENDING |
| 160 | + AND checkable is not in DOWNTIME) then |
| 161 | + // If the previous_hard_state is set to a non-OK state and the host or service in question was not in downtime, |
| 162 | + // we consider that time slot to be problematic and add the duration to the problem time. |
| 163 | + problem_time = problem_time + (event.event_time - last_event_time) |
| 164 | + endif |
| 165 | +
|
| 166 | + // Set the "last_event_time" to the timestamp of the event being currently processed. |
| 167 | + last_event_time = event.event_time |
| 168 | + |
| 169 | + if (event.type is "state change event") then |
| 170 | + // If the event being currently processed is a state change event, we mark its |
| 171 | + // latest hard state as the previous one for the next iteration. |
| 172 | + previous_hard_state = event.hard_state |
| 173 | + endif |
| 174 | +endloop |
| 175 | +``` |
| 176 | + |
| 177 | +At this point, we now have computed the problem time of a particular host or service for a given time frame. The final |
| 178 | +step is to determine the percentage of the remaining total time. In other words, we want to find out how much of the |
| 179 | +total time is taken up by the problem time, so that we can obtain our final SLA OK percentage result. |
| 180 | + |
| 181 | +``` |
| 182 | +sla_ok_percent := 100 * (total_time - problem_time) / total_time |
| 183 | +``` |
| 184 | + |
| 185 | +The following example illustrates the practical implications of this concept. Suppose we have the following SLA events: |
| 186 | +```json |
| 187 | +{ |
| 188 | + "state": [ |
| 189 | + {"event_time": 1200, "hard_state": 2, "previous_hard_state": 0}, |
| 190 | + {"event_time": 1500, "hard_state": 0, "previous_hard_state": 2} |
| 191 | + ], |
| 192 | + "downtime": [ |
| 193 | + {"downtime_start": 1100, "downtime_end": 1300}, |
| 194 | + {"downtime_start": 1400, "downtime_end": 1600} |
| 195 | + ] |
| 196 | +} |
| 197 | +``` |
| 198 | + |
| 199 | +We would now like to calculate the SLA OK percent for the timeframe from `1000` to `2000`. |
| 200 | + |
| 201 | +``` |
| 202 | +total_time := 2000 - 1000 |
| 203 | +problem_time := 0 |
| 204 | +
|
| 205 | +- 1000..1200 // in OK state (see the previous_hard_state of the first state event), so nothing to do. |
| 206 | +- 1200..1300 // in Critical state, but was also set in downtime from (1100..1300), so here also nothing to do. |
| 207 | +- 1300..1400 // still in Critical state and is not in downtime, so we count that time slot as a problem time. |
| 208 | +problem_time = problem_time + 1400 - 1300 |
| 209 | +
|
| 210 | +- 1400..1500 // still in critical state, but we have an active downtime during this period again, so nothing to do. |
| 211 | +- 1500..2000 // in OK state |
| 212 | +
|
| 213 | +// So, this indicates that our host was either not in a problem state or was set |
| 214 | +// to a downtime for 90% of the period from 1000 to 2000. |
| 215 | +sla_ok_percent := 100 * (total_time - problem_time) / total_time |
| 216 | +``` |
| 217 | + |
| 218 | +## Appendix |
| 219 | + |
| 220 | +### Hard State vs. Previous Hard State |
| 221 | + |
| 222 | +The `hard_state` column denotes the most recent hard state of the host and service. |
| 223 | +Conversely, the `previous_hard_state` column indicates the preceding hard state that was formerly stored in the |
| 224 | +`hard_state` column prior to the host or service transitioning to a new hard state. Please refer to the tabular |
| 225 | +representation below for a visual representation of this information. |
| 226 | + |
| 227 | +| previous_hard_state | hard_state | |
| 228 | +|-------------------------------|------------| |
| 229 | +| PENDING (no check result yet) | OK | |
| 230 | +| OK | Warning | |
| 231 | +| Warning | Critical | |
| 232 | +| Critical | OK | |
0 commit comments