Bug: No reset of metrics after each check run #133

y-eight · 2024-04-16T14:09:07Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Currently the sparrow is exposing metrics for each check. In this example, I will reference the latency check but this is related to all the checks.

The latency check provides metrics with the statuscode as metadata in the metric itself, eg.:
sparrow_latency_duration_seconds{status="200",target="https://xyz.telekom.de/"} 0.1750766

If in one of the previous runs/check intervall the status was not 200 (eg. 500) the metric will be exposed with status="500".
If the current run is getting a 200 status code the metric will be exposed with status="200". The metric with status="500" is still present. The problem is that the sparrow is not exposing the last state. This is misleading since the status 500 is not the last state.

Expected Behavior

The prom metrics should represent the last state of the check.
Therefore the old metrics should be cleared before updating with the current state.

Steps To Reproduce

No response

Relevant logs and/or screenshots, environment information, etc.

Who can address the issue?

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

lvlcn-t · 2024-04-16T15:31:25Z

@y-eight While this bug is not exactly the same as #77, it's quite similar and I'd see it as one bigger issue that should be addressed in the near future.

puffitos · 2024-05-17T17:55:42Z

The problem here is quite another; we are probably misusing prometheus metrics. A more appropriate solution would be to use a counter with the status and url labels (to count how many requests were successful or not) and just a latency_duration_seconds, which just returns how much time the last call needed to finish.

This way, we always return the core information of the check (which is the latency) and also expose some additional information.

EDIT:

Just for completion's sake, after the chat with @y-eight, we decided that dragging the status of the response along doesn't make much sense for a latency check.

The latency check should only measure how much it took to execute a request. Everything else may be relevant information, but the critical information that the check should offer is just the total duration in seconds a request took to complete.

y-eight added the bug Something isn't working label Apr 16, 2024

y-eight mentioned this issue Jun 6, 2024

feat: fix dangaling metrics regading status code #146

Merged

4 tasks

y-eight closed this as completed Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: No reset of metrics after each check run #133

Bug: No reset of metrics after each check run #133

y-eight commented Apr 16, 2024 •

edited

Loading

lvlcn-t commented Apr 16, 2024

puffitos commented May 17, 2024 •

edited

Loading

Bug: No reset of metrics after each check run #133

Bug: No reset of metrics after each check run #133

Comments

y-eight commented Apr 16, 2024 • edited Loading

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant logs and/or screenshots, environment information, etc.

Who can address the issue?

Anything else?

lvlcn-t commented Apr 16, 2024

puffitos commented May 17, 2024 • edited Loading

y-eight commented Apr 16, 2024 •

edited

Loading

puffitos commented May 17, 2024 •

edited

Loading