-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: No reset of metrics after each check run #133
Comments
The problem here is quite another; we are probably misusing prometheus metrics. A more appropriate solution would be to use a counter with the status and url labels (to count how many requests were successful or not) and just a latency_duration_seconds, which just returns how much time the last call needed to finish. This way, we always return the core information of the check (which is the latency) and also expose some additional information. EDIT: Just for completion's sake, after the chat with @y-eight, we decided that dragging the status of the response along doesn't make much sense for a latency check. The latency check should only measure how much it took to execute a request. Everything else may be relevant information, but the critical information that the check should offer is just the total duration in seconds a request took to complete. |
Is there an existing issue for this?
Current Behavior
Currently the sparrow is exposing metrics for each check. In this example, I will reference the latency check but this is related to all the checks.
The latency check provides metrics with the statuscode as metadata in the metric itself, eg.:
sparrow_latency_duration_seconds{status="200",target="https://xyz.telekom.de/"} 0.1750766
If in one of the previous runs/check intervall the status was not 200 (eg. 500) the metric will be exposed with
status="500"
.If the current run is getting a 200 status code the metric will be exposed with
status="200"
. The metric withstatus="500"
is still present. The problem is that the sparrow is not exposing the last state. This is misleading since the status 500 is not the last state.Expected Behavior
The prom metrics should represent the last state of the check.
Therefore the old metrics should be cleared before updating with the current state.
Steps To Reproduce
No response
Relevant logs and/or screenshots, environment information, etc.
Who can address the issue?
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: