-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set scrape timeout based on Prometheus header #60
base: main
Are you sure you want to change the base?
Conversation
@jdlafaye Please sign the Contributor License Agreement! Click here to manually synchronize the status of this Pull Request. See the FAQ for frequently asked questions. |
@jdlafaye Thank you for signing the Contributor License Agreement! |
@shakuzen Anything I need to do on my end of the PR in regards to CircleCI or otherwise? |
proxy-server/src/main/java/io/micrometer/prometheus/rsocket/PrometheusController.java
Outdated
Show resolved
Hide resolved
proxy-server/src/main/java/io/micrometer/prometheus/rsocket/PrometheusController.java
Outdated
Show resolved
Hide resolved
@@ -48,4 +57,16 @@ public int getWebsocketPort() { | |||
public void setWebsocketPort(int websocketPort) { | |||
this.websocketPort = websocketPort; | |||
} | |||
|
|||
public Duration getTimeoutOffset() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? Should not we set it to whatever value the client wants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the timeout was set to the exact value of the Prometheus scrape timeout (given by the header), any one app instance that is too slow to respond (thus timing out), would also cause us to hit the Prometheus scrape timeout (since they're the same in this hypothetical). Thus, as a result of a single instance timing out, no metrics are successfully scraped from any of the other connected instances (because Prometheus hit its scrape timeout and closed the connection).
So, a small offset is needed to make the individual instance scrape timeout slightly less than the Prometheus scrape timeout, allowing time for the metrics to be scraped from all connected instances, concatenated and returned via the network before Prometheus itself times out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is needed for the reason you describe, shouldn't we set the default to a small but nonzero value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends. Defaulting to a zero offset maintains backwards compatibility with the old behavior but doesn't give the benefit of the offset from a timeout perspective.
On the other hand, defaulting to a nonzero offset gives the timeout benefit described above but technically is a breaking change. For example, consider a Prometheus setup with a 10s scrape timeout and assume we chose an offset of 1s. In this case, the RSocket timeout is 9s. Let's say the user running this Prometheus setup previously had RSocket scrape duration of about 9.5 seconds. Prior to this change, they'd be running their setup without issue since there's no explicit RSocket timeout and 9.5 seconds is lower than the Prometheus 10s timeout. However, with the introduction of this change, suddenly this user's scrapes are timing out on the RSocket side from the 9 second timeout and thus aren't getting the metrics they were getting previously.
I don't think this is necessarily a likely scenario and the odds of this happening decrease as the default offset also decreases, but it's definitely a consideration.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really appreciate you thinking about backward compatibility and the potential negative effects of this. It's important for us to take these things seriously and avoid as much pain for users as possible. That said, I think if we're changing this behavior in a minor release (as opposed to a patch release), we make note of it in the release notes, and it is configurable so people negatively affected by it can effectively undo the behavior change, I think it is worthwhile moving things in a better direction. That's how I feel anyway. @jonatan-ivanov What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this, it makes sense, Prometheus sends the scrape_timeout
value in X-Prometheus-Scrape-Timeout-Seconds
and I did not find a way to account for latency on the Prometheus server side. I find this unfortunate since:
- If you have multiple Prometheus servers in different network locations with different latencies (let's pretend it is significant enough), the scraped app needs to account for the higher network latency. This feels somewhat wrong because the scraped app should not be aware of this.
- Or if you move a Prometheus server to one network location to another (let's pretend the latency it is significant enough), you need to change the offset for every scraped app.
But if the offset would be on Prometheus side the issue is somewhat similar:
- Scraped apps can be in multiple network locations so Prometheus should account for highest network latency, though configuring this per job could solve this.
- Or if you move an app that is scraped, you need to change all Prometheus instances, though you usually have less Prometheus instances than apps.
So I guess this is what we have, it makes sense to me.
I think I agree with @shakuzen about having a small but non-zero default value in a milestone release. To me the interesting part is what should that value be.
I should be small enough so the negative effect is minimal and big enough so that it can account the network delay which includes transfer cost which depends on the response size. My first guess is around 100-500ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think @jdlafaye about having a small default offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable to set a small default offset as it will provide an immediate benefit in a number of setups. I'd probably lean towards 500ms as that meshes well with the default Prometheus scrape timeout of 10s and also happens to be the same as what Blackbox Exporter uses.
Let me know your thoughts and I can make that change.
I think the CircleCI issue was a temporary one and it should be ok on the next push (for some reason I can't re-trigger the job for this PR but was able to re-trigger |
What will it look like when this scenario happens of a subset of scrape targets timing out? Will it be clear in the logs or metrics which service's scrape is taking too long? It'd be nice if detection and troubleshooting of this problem is easy. I don't think we need to block merging this PR on having that, though. |
Currently there is very little to no ability to determine which scrape targets (if any) are timing out. From a metrics perspective, the
From a logs perspective, we get no indication of any request timeouts that are occurring, let alone the specific instances that are timing out. I also believe this would be greatly helpful in debugging these sorts of issues. On its face it seems like this would be easy to implement. However, as far as I can tell, the scrapable apps are currently just identified by a key pair. While we could log the public key to identify an instance timing out, this isn't particularly helpful for debugging. To provide a helpful log, I think we'd need to track at least the source IP & port and potentially some metadata/tags that identify each scrapable app. Then when one of them times out, it's easy to log those details alongside the message. |
Thank you for the analysis. It sounds like it'll be better to break this out to a separate issue from this pull request. Would you mind opening an issue to track that? Could you also rebase or otherwise trigger a new build of the pull request? The last build seems to have failed and we cannot retrigger it for some reason. |
Opened issue #66 for the follow-up improvement regarding logging. Also merged the latest commits from main to my branch but still having pipeline issues. Not sure if starting a branch directly in this project (instead of the fork) would help this issue, open to suggestions. |
The CI build being broken is very strange. I don't know why that is. You can ignore it for now. I guess we'll have to manually build the PR before merging to check it. |
This PR sets the proxy timeout for individual instances based on the value of the
X-Prometheus-Scrape-Timeout-Seconds
header (set by Prometheus sincev1.6.0
) minus a configurable offset.Currently, a single connected instance failing to respond before the Prometheus scrape timeout (due to very high CPU or some other issue) will cause metrics not to be scraped from all of the connected instances. Thus, setting an instance scrape timeout slightly less than the Prometheus scrape timeout allows for dealing with these type of situations which are exceedingly common in large environments with hundreds to thousands of connected instances.
Note that I set the default timeout offset to 0 to maintain backwards compatibility, but I'd be happy to set it to a value like
1s
that's beneficial by default.