Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: No reset of metrics after each check run #133

Closed
1 task done
y-eight opened this issue Apr 16, 2024 · 2 comments
Closed
1 task done

Bug: No reset of metrics after each check run #133

y-eight opened this issue Apr 16, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@y-eight
Copy link
Contributor

y-eight commented Apr 16, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Currently the sparrow is exposing metrics for each check. In this example, I will reference the latency check but this is related to all the checks.

The latency check provides metrics with the statuscode as metadata in the metric itself, eg.:
sparrow_latency_duration_seconds{status="200",target="https://xyz.telekom.de/"} 0.1750766

If in one of the previous runs/check intervall the status was not 200 (eg. 500) the metric will be exposed with status="500".
If the current run is getting a 200 status code the metric will be exposed with status="200". The metric with status="500" is still present. The problem is that the sparrow is not exposing the last state. This is misleading since the status 500 is not the last state.

Expected Behavior

The prom metrics should represent the last state of the check.
Therefore the old metrics should be cleared before updating with the current state.

Steps To Reproduce

No response

Relevant logs and/or screenshots, environment information, etc.

image

Who can address the issue?

No response

Anything else?

No response

@y-eight y-eight added the bug Something isn't working label Apr 16, 2024
@lvlcn-t
Copy link
Member

lvlcn-t commented Apr 16, 2024

@y-eight While this bug is not exactly the same as #77, it's quite similar and I'd see it as one bigger issue that should be addressed in the near future.

@puffitos
Copy link
Member

puffitos commented May 17, 2024

The problem here is quite another; we are probably misusing prometheus metrics. A more appropriate solution would be to use a counter with the status and url labels (to count how many requests were successful or not) and just a latency_duration_seconds, which just returns how much time the last call needed to finish.

This way, we always return the core information of the check (which is the latency) and also expose some additional information.

EDIT:

Just for completion's sake, after the chat with @y-eight, we decided that dragging the status of the response along doesn't make much sense for a latency check.

The latency check should only measure how much it took to execute a request. Everything else may be relevant information, but the critical information that the check should offer is just the total duration in seconds a request took to complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants