Set scrape timeout based on Prometheus header #60

jdlafaye · 2022-09-19T16:39:39Z

This PR sets the proxy timeout for individual instances based on the value of the X-Prometheus-Scrape-Timeout-Seconds header (set by Prometheus since v1.6.0) minus a configurable offset.

Currently, a single connected instance failing to respond before the Prometheus scrape timeout (due to very high CPU or some other issue) will cause metrics not to be scraped from all of the connected instances. Thus, setting an instance scrape timeout slightly less than the Prometheus scrape timeout allows for dealing with these type of situations which are exceedingly common in large environments with hundreds to thousands of connected instances.

Note that I set the default timeout offset to 0 to maintain backwards compatibility, but I'd be happy to set it to a value like 1s that's beneficial by default.

pivotal-cla · 2022-09-19T16:39:42Z

@jdlafaye Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

pivotal-cla · 2022-09-19T17:16:39Z

@jdlafaye Thank you for signing the Contributor License Agreement!

jdlafaye · 2022-10-12T20:16:30Z

@shakuzen Anything I need to do on my end of the PR in regards to CircleCI or otherwise?

proxy-server/src/main/java/io/micrometer/prometheus/rsocket/PrometheusController.java

jonatan-ivanov · 2022-10-12T21:10:18Z

proxy-server/src/main/java/io/micrometer/prometheus/rsocket/PrometheusControllerProperties.java

@@ -48,4 +57,16 @@ public int getWebsocketPort() {
  public void setWebsocketPort(int websocketPort) {
    this.websocketPort = websocketPort;
  }
+
+  public Duration getTimeoutOffset() {


Why is this needed? Should not we set it to whatever value the client wants?

If the timeout was set to the exact value of the Prometheus scrape timeout (given by the header), any one app instance that is too slow to respond (thus timing out), would also cause us to hit the Prometheus scrape timeout (since they're the same in this hypothetical). Thus, as a result of a single instance timing out, no metrics are successfully scraped from any of the other connected instances (because Prometheus hit its scrape timeout and closed the connection).

So, a small offset is needed to make the individual instance scrape timeout slightly less than the Prometheus scrape timeout, allowing time for the metrics to be scraped from all connected instances, concatenated and returned via the network before Prometheus itself times out.

If this is needed for the reason you describe, shouldn't we set the default to a small but nonzero value?

It depends. Defaulting to a zero offset maintains backwards compatibility with the old behavior but doesn't give the benefit of the offset from a timeout perspective.

On the other hand, defaulting to a nonzero offset gives the timeout benefit described above but technically is a breaking change. For example, consider a Prometheus setup with a 10s scrape timeout and assume we chose an offset of 1s. In this case, the RSocket timeout is 9s. Let's say the user running this Prometheus setup previously had RSocket scrape duration of about 9.5 seconds. Prior to this change, they'd be running their setup without issue since there's no explicit RSocket timeout and 9.5 seconds is lower than the Prometheus 10s timeout. However, with the introduction of this change, suddenly this user's scrapes are timing out on the RSocket side from the 9 second timeout and thus aren't getting the metrics they were getting previously.

I don't think this is necessarily a likely scenario and the odds of this happening decrease as the default offset also decreases, but it's definitely a consideration.

Thoughts?

I really appreciate you thinking about backward compatibility and the potential negative effects of this. It's important for us to take these things seriously and avoid as much pain for users as possible. That said, I think if we're changing this behavior in a minor release (as opposed to a patch release), we make note of it in the release notes, and it is configurable so people negatively affected by it can effectively undo the behavior change, I think it is worthwhile moving things in a better direction. That's how I feel anyway. @jonatan-ivanov What do you think?

I looked into this, it makes sense, Prometheus sends the scrape_timeout value in X-Prometheus-Scrape-Timeout-Seconds and I did not find a way to account for latency on the Prometheus server side. I find this unfortunate since:

If you have multiple Prometheus servers in different network locations with different latencies (let's pretend it is significant enough), the scraped app needs to account for the higher network latency. This feels somewhat wrong because the scraped app should not be aware of this.

Or if you move a Prometheus server to one network location to another (let's pretend the latency it is significant enough), you need to change the offset for every scraped app.

But if the offset would be on Prometheus side the issue is somewhat similar:

Scraped apps can be in multiple network locations so Prometheus should account for highest network latency, though configuring this per job could solve this.

Or if you move an app that is scraped, you need to change all Prometheus instances, though you usually have less Prometheus instances than apps.

So I guess this is what we have, it makes sense to me.

I think I agree with @shakuzen about having a small but non-zero default value in a milestone release. To me the interesting part is what should that value be.
I should be small enough so the negative effect is minimal and big enough so that it can account the network delay which includes transfer cost which depends on the response size. My first guess is around 100-500ms.

What do you think @jdlafaye about having a small default offset?

I think it's reasonable to set a small default offset as it will provide an immediate benefit in a number of setups. I'd probably lean towards 500ms as that meshes well with the default Prometheus scrape timeout of 10s and also happens to be the same as what Blackbox Exporter uses.

Let me know your thoughts and I can make that change.

jonatan-ivanov · 2022-10-12T21:36:29Z

I think the CircleCI issue was a temporary one and it should be ok on the next push (for some reason I can't re-trigger the job for this PR but was able to re-trigger main).

shakuzen · 2023-02-17T07:14:51Z

What will it look like when this scenario happens of a subset of scrape targets timing out? Will it be clear in the logs or metrics which service's scrape is taking too long? It'd be nice if detection and troubleshooting of this problem is easy. I don't think we need to block merging this PR on having that, though.

jdlafaye · 2023-02-27T19:10:15Z

Currently there is very little to no ability to determine which scrape targets (if any) are timing out.

From a metrics perspective, the prometheus_proxy_scrape_seconds_count metric will track the total number of requests timing out through the exception label (see full metric example below). However, this only tracks the total number of timeouts, not the specific instances that are contributing to that total. This is okay as you could write an alert based on this metric to notify you when timeouts are frequent. Then ideally the logs could let you know which specific instances are having issues.

prometheus_proxy_scrape_seconds_count{exception="Did not observe any item or terminal signal within 9000ms in 'source(MonoDefer)' (and no fallback has been configured)", outcome="error"}

From a logs perspective, we get no indication of any request timeouts that are occurring, let alone the specific instances that are timing out. I also believe this would be greatly helpful in debugging these sorts of issues. On its face it seems like this would be easy to implement. However, as far as I can tell, the scrapable apps are currently just identified by a key pair. While we could log the public key to identify an instance timing out, this isn't particularly helpful for debugging. To provide a helpful log, I think we'd need to track at least the source IP & port and potentially some metadata/tags that identify each scrapable app. Then when one of them times out, it's easy to log those details alongside the message.

shakuzen · 2023-02-28T06:33:20Z

To provide a helpful log, I think we'd need to track at least the source IP & port and potentially some metadata/tags that identify each scrapable app. Then when one of them times out, it's easy to log those details alongside the message.

Thank you for the analysis. It sounds like it'll be better to break this out to a separate issue from this pull request. Would you mind opening an issue to track that?

Could you also rebase or otherwise trigger a new build of the pull request? The last build seems to have failed and we cannot retrigger it for some reason.

jdlafaye · 2023-03-02T01:00:12Z

Opened issue #66 for the follow-up improvement regarding logging. Also merged the latest commits from main to my branch but still having pipeline issues. Not sure if starting a branch directly in this project (instead of the fork) would help this issue, open to suggestions.

shakuzen · 2023-03-03T05:37:15Z

The CI build being broken is very strange. I don't know why that is. You can ignore it for now. I guess we'll have to manually build the PR before merging to check it.

Set scrape timeout based on Prometheus header

e947416

shakuzen added enhancement New feature or request status: waiting for triage labels Sep 21, 2022

jonatan-ivanov requested changes Oct 12, 2022

View reviewed changes

jdlafaye added 2 commits October 12, 2022 17:26

simplify construction of timeout duration

6f72347

apply timeout only if it is determined successfully

82dd2d5

shakuzen approved these changes Feb 15, 2023

View reviewed changes

shakuzen added status: waiting for feedback status: waiting for team and removed status: waiting for triage status: waiting for feedback labels Feb 15, 2023

jonatan-ivanov approved these changes Feb 28, 2023

View reviewed changes

Merge branch 'main' into scrape-timeout

9985a68

jdlafaye mentioned this pull request Mar 2, 2023

Add Logging for Instance Scrape Timeouts #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set scrape timeout based on Prometheus header #60

Set scrape timeout based on Prometheus header #60

jdlafaye commented Sep 19, 2022

pivotal-cla commented Sep 19, 2022

pivotal-cla commented Sep 19, 2022

jdlafaye commented Oct 12, 2022

jonatan-ivanov Oct 12, 2022

jdlafaye Oct 13, 2022

shakuzen Feb 15, 2023

jdlafaye Feb 16, 2023

shakuzen Feb 17, 2023

jonatan-ivanov Feb 28, 2023

shakuzen Mar 3, 2023

jdlafaye Mar 9, 2023

jonatan-ivanov commented Oct 12, 2022

shakuzen commented Feb 17, 2023

jdlafaye commented Feb 27, 2023

shakuzen commented Feb 28, 2023

jdlafaye commented Mar 2, 2023

shakuzen commented Mar 3, 2023

Set scrape timeout based on Prometheus header #60

Are you sure you want to change the base?

Set scrape timeout based on Prometheus header #60

Conversation

jdlafaye commented Sep 19, 2022

pivotal-cla commented Sep 19, 2022

pivotal-cla commented Sep 19, 2022

jdlafaye commented Oct 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonatan-ivanov commented Oct 12, 2022

shakuzen commented Feb 17, 2023

jdlafaye commented Feb 27, 2023

shakuzen commented Feb 28, 2023

jdlafaye commented Mar 2, 2023

shakuzen commented Mar 3, 2023