Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set scrape timeout based on Prometheus header #60

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jdlafaye
Copy link

This PR sets the proxy timeout for individual instances based on the value of the X-Prometheus-Scrape-Timeout-Seconds header (set by Prometheus since v1.6.0) minus a configurable offset.

Currently, a single connected instance failing to respond before the Prometheus scrape timeout (due to very high CPU or some other issue) will cause metrics not to be scraped from all of the connected instances. Thus, setting an instance scrape timeout slightly less than the Prometheus scrape timeout allows for dealing with these type of situations which are exceedingly common in large environments with hundreds to thousands of connected instances.

Note that I set the default timeout offset to 0 to maintain backwards compatibility, but I'd be happy to set it to a value like 1s that's beneficial by default.

@pivotal-cla
Copy link

@jdlafaye Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

@pivotal-cla
Copy link

@jdlafaye Thank you for signing the Contributor License Agreement!

@jdlafaye
Copy link
Author

@shakuzen Anything I need to do on my end of the PR in regards to CircleCI or otherwise?

@@ -48,4 +57,16 @@ public int getWebsocketPort() {
public void setWebsocketPort(int websocketPort) {
this.websocketPort = websocketPort;
}

public Duration getTimeoutOffset() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Should not we set it to whatever value the client wants?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the timeout was set to the exact value of the Prometheus scrape timeout (given by the header), any one app instance that is too slow to respond (thus timing out), would also cause us to hit the Prometheus scrape timeout (since they're the same in this hypothetical). Thus, as a result of a single instance timing out, no metrics are successfully scraped from any of the other connected instances (because Prometheus hit its scrape timeout and closed the connection).

So, a small offset is needed to make the individual instance scrape timeout slightly less than the Prometheus scrape timeout, allowing time for the metrics to be scraped from all connected instances, concatenated and returned via the network before Prometheus itself times out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is needed for the reason you describe, shouldn't we set the default to a small but nonzero value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends. Defaulting to a zero offset maintains backwards compatibility with the old behavior but doesn't give the benefit of the offset from a timeout perspective.

On the other hand, defaulting to a nonzero offset gives the timeout benefit described above but technically is a breaking change. For example, consider a Prometheus setup with a 10s scrape timeout and assume we chose an offset of 1s. In this case, the RSocket timeout is 9s. Let's say the user running this Prometheus setup previously had RSocket scrape duration of about 9.5 seconds. Prior to this change, they'd be running their setup without issue since there's no explicit RSocket timeout and 9.5 seconds is lower than the Prometheus 10s timeout. However, with the introduction of this change, suddenly this user's scrapes are timing out on the RSocket side from the 9 second timeout and thus aren't getting the metrics they were getting previously.

I don't think this is necessarily a likely scenario and the odds of this happening decrease as the default offset also decreases, but it's definitely a consideration.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really appreciate you thinking about backward compatibility and the potential negative effects of this. It's important for us to take these things seriously and avoid as much pain for users as possible. That said, I think if we're changing this behavior in a minor release (as opposed to a patch release), we make note of it in the release notes, and it is configurable so people negatively affected by it can effectively undo the behavior change, I think it is worthwhile moving things in a better direction. That's how I feel anyway. @jonatan-ivanov What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this, it makes sense, Prometheus sends the scrape_timeout value in X-Prometheus-Scrape-Timeout-Seconds and I did not find a way to account for latency on the Prometheus server side. I find this unfortunate since:

  1. If you have multiple Prometheus servers in different network locations with different latencies (let's pretend it is significant enough), the scraped app needs to account for the higher network latency. This feels somewhat wrong because the scraped app should not be aware of this.
  2. Or if you move a Prometheus server to one network location to another (let's pretend the latency it is significant enough), you need to change the offset for every scraped app.

But if the offset would be on Prometheus side the issue is somewhat similar:

  1. Scraped apps can be in multiple network locations so Prometheus should account for highest network latency, though configuring this per job could solve this.
  2. Or if you move an app that is scraped, you need to change all Prometheus instances, though you usually have less Prometheus instances than apps.

So I guess this is what we have, it makes sense to me.

I think I agree with @shakuzen about having a small but non-zero default value in a milestone release. To me the interesting part is what should that value be.
I should be small enough so the negative effect is minimal and big enough so that it can account the network delay which includes transfer cost which depends on the response size. My first guess is around 100-500ms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think @jdlafaye about having a small default offset?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to set a small default offset as it will provide an immediate benefit in a number of setups. I'd probably lean towards 500ms as that meshes well with the default Prometheus scrape timeout of 10s and also happens to be the same as what Blackbox Exporter uses.

Let me know your thoughts and I can make that change.

@jonatan-ivanov
Copy link
Member

I think the CircleCI issue was a temporary one and it should be ok on the next push (for some reason I can't re-trigger the job for this PR but was able to re-trigger main).

@shakuzen
Copy link
Member

What will it look like when this scenario happens of a subset of scrape targets timing out? Will it be clear in the logs or metrics which service's scrape is taking too long? It'd be nice if detection and troubleshooting of this problem is easy. I don't think we need to block merging this PR on having that, though.

@jdlafaye
Copy link
Author

Currently there is very little to no ability to determine which scrape targets (if any) are timing out.

From a metrics perspective, the prometheus_proxy_scrape_seconds_count metric will track the total number of requests timing out through the exception label (see full metric example below). However, this only tracks the total number of timeouts, not the specific instances that are contributing to that total. This is okay as you could write an alert based on this metric to notify you when timeouts are frequent. Then ideally the logs could let you know which specific instances are having issues.

prometheus_proxy_scrape_seconds_count{exception="Did not observe any item or terminal signal within 9000ms in 'source(MonoDefer)' (and no fallback has been configured)", outcome="error"}

From a logs perspective, we get no indication of any request timeouts that are occurring, let alone the specific instances that are timing out. I also believe this would be greatly helpful in debugging these sorts of issues. On its face it seems like this would be easy to implement. However, as far as I can tell, the scrapable apps are currently just identified by a key pair. While we could log the public key to identify an instance timing out, this isn't particularly helpful for debugging. To provide a helpful log, I think we'd need to track at least the source IP & port and potentially some metadata/tags that identify each scrapable app. Then when one of them times out, it's easy to log those details alongside the message.

@shakuzen
Copy link
Member

To provide a helpful log, I think we'd need to track at least the source IP & port and potentially some metadata/tags that identify each scrapable app. Then when one of them times out, it's easy to log those details alongside the message.

Thank you for the analysis. It sounds like it'll be better to break this out to a separate issue from this pull request. Would you mind opening an issue to track that?

Could you also rebase or otherwise trigger a new build of the pull request? The last build seems to have failed and we cannot retrigger it for some reason.

@jdlafaye
Copy link
Author

jdlafaye commented Mar 2, 2023

Opened issue #66 for the follow-up improvement regarding logging. Also merged the latest commits from main to my branch but still having pipeline issues. Not sure if starting a branch directly in this project (instead of the fork) would help this issue, open to suggestions.

@shakuzen
Copy link
Member

shakuzen commented Mar 3, 2023

The CI build being broken is very strange. I don't know why that is. You can ignore it for now. I guess we'll have to manually build the PR before merging to check it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants