Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[native] Update prometheus metrics asynchronously #24716

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jaystarshot
Copy link
Member

@jaystarshot jaystarshot commented Mar 12, 2025

Description

Make metric updation async to rule out any slowness in the main thread. This is to rule out any slowness due to extensive metric use as we had seen such issue for histograms beo

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== NO RELEASE NOTE ==

@jaystarshot jaystarshot force-pushed the jay-oss-prom branch 4 times, most recently from 73f3402 to b0c75a3 Compare March 12, 2025 18:59
@jaystarshot jaystarshot marked this pull request as ready for review March 12, 2025 19:15
@jaystarshot jaystarshot requested a review from a team as a code owner March 12, 2025 19:15
@jaystarshot
Copy link
Member Author

@aditi-pandit @majetideepak Can you please help review? Thanks

@majetideepak majetideepak changed the title Make prometheus metric updation async [native] Update prometheus metrics asynchronously Mar 17, 2025
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaystarshot Can you share some data on the evaluation of this change? What are the benefits and what are the cons?

@@ -20,7 +20,7 @@ namespace facebook::presto::prometheus {
class PrometheusReporterTest : public testing::Test {
public:
void SetUp() override {
reporter = std::make_shared<PrometheusStatsReporter>(testLabels);
reporter = std::make_shared<PrometheusStatsReporter>(testLabels, 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not test with multiple threads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that for gauge metric its the last value which stays so the testing has to be in a single thread for it to be deterministic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaystarshot : Could we have a single test without gauge metrics but with the multi-threading. Without that there is no signal that this code is working.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

explicit PrometheusStatsReporter(
const std::map<std::string, std::string>& labels);
const std::map<std::string, std::string>& labels, int numThreads = 2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why default to 2? Should we make this configurable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can make it configurabel

@majetideepak
Copy link
Collaborator

The StatsReporter already runs on a separate thread currently?

@jaystarshot
Copy link
Member Author

jaystarshot commented Mar 17, 2025

@majetideepak The updates to in-memory metric entities currently can run within the same thread based on where its called from. #23338

This is designed to prevent any query-level slowdown, especially when handling high cardinality metrics. We previously encountered an issue where updating histograms caused latency degradation due to extremely high cardinality.
The background is that I want to add new metrics without worrying if its high cardinality or not

@jaystarshot jaystarshot force-pushed the jay-oss-prom branch 5 times, most recently from 19cf1fe to 99979d6 Compare March 18, 2025 03:07
@majetideepak
Copy link
Collaborator

@jaystarshot #23338 got resolved with #23357

This is designed to prevent any query-level slowdown, especially when handling high cardinality metrics.

Did this PR solve the above issue from your experiments? This PR is only moving the work from one thread to another.
In a busy system, the query can still be slow if the thread updating the metrics takes long.

I am also curious if the updates can now happen out of order. Will this lead to a less precise history?

@jaystarshot
Copy link
Member Author

jaystarshot commented Mar 18, 2025

@majetideepak this issue wasn’t in a busy system it was just slowing the main thread. Also having a limited threadpool for metric updation means that even when we are busy we can control degradation to only metrics updation (limited threads).
No accuracy won’t be impacted since there is no guarantee when these metrics will be pulled by the backend

@jaystarshot
Copy link
Member Author

Also i haven't tested if it fixed that issue since that is legacy but current way of populating metric is a ticking timebomb is someone creates a high cardinality metric

majetideepak
majetideepak previously approved these changes Mar 18, 2025
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jaystarshot

Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jaystarshot. Had one minor question.

@@ -241,4 +246,8 @@ std::string PrometheusStatsReporter::fetchMetrics() {
return serializer.Serialize(impl_->registry->Collect());
}

void PrometheusStatsReporter::waitForCompletion() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be called in the destructor as well ? Would ensure we didn't lose metrics at shutdown.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope we don't need in destructor. Prometheus is pull based and unless the metrics are pulled by the backend before closing that would be useless.

return resultOpt.value();
}
// default
return 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : Abstract this with a constant variable.

Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jaystarshot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants