-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics collect stress test #2247
base: main
Are you sure you want to change the base?
Conversation
02be567
to
1fc6d80
Compare
@fraillt Thanks for the continued interest in making metrics better! I have two suggestions around sending PRs that I would like you to consider. These are mainly directed towards making the PR review process easy and simple (which would in turn result in the getting PRs merged faster).
|
I'll try to be more careful regarding force-push in the future. Basically what I'm trying to say, is that I'm afraid that if I create PR just for refactoring, it might look worthless on its own... All changes that was done in this PR does relate to one file that brings new functionality |
I think we could still have a simpler split up of PRs. Something like this:
|
1fc6d80
to
b4fffe6
Compare
I force-pushed, because it basically delete everything I did, except for one file, so I guess it is ok :). |
barrier.wait(); | ||
let now = Instant::now(); | ||
let mut count = 0; | ||
while is_collecting.load(Ordering::Acquire) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could have a much simpler and effective setup here. If we want to know whether running collect stops the world, it's better to spawn a thread that keeps calling reader.collect
on a loop
. And have other threads, record measurements simultaneously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that initially. The problem was that collection phase was not realistic as it had 0 measurements and basically held lock in a loop. So I decided to make it more realistic, by generating some measurements so collection wouldn't be empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem was that collection phase was not realistic as it had 0 measurements and basically held lock in a loop
I don't follow. You could always record some measurements before you call collect. My concern is around the way this test is setup. The test is using some custom "iterations" which seems unnecessary. You could simply do this instead:
- Emit some measurements.
- Spawn a thread that runs collect in a loop.
- Spawn additional threads that record measurements.
- Calculate throughput for the measurements recorded in step 3.
If doing the above, leads to zero measurements being recorded, then we have our answer: Collect is going to "stop the world" as long as it runs
.
As I mentioned in this comment, unless we plan to use a more efficient synchronization mechanism in ValueMap
, this stress test would not be adding any value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, this test is adding value, as it is able to measure difference with different "collect" implementation.
Second, running collect in the loop doesn't work with delta temporality, because after first run there will be zero measurements and all other iterations will basically hold write lock in the loop. While in reality no one runs collect in loop and usually there are many measurements and its important that write lock is held as short as possible (which is not the case with current implementation, hence I was able to improve it and this test actually measures it).
I could probably have different tests for delta an cumulative temporality, but I decided to emulate realistic environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While in reality no one runs collect in loop
In reality, no one runs the collect the way it's run in this stress test either. Collect would usually be run periodically in 10/30/60 seconds intervals.
First of all, this test is adding value, as it is able to measure difference with different "collect" implementation.
usually there are many measurements and its important that write lock is held as short as possible (which is not the case with current implementation, hence I was able to improve it and this test actually measures it).
This test claims to measure update throughput "while collect is running". The actual implementation of this test however relies on squeezing in some updates before collect runs. That's not the same as testing "while collect is running".
I would love to improve collect to take the write lock for the shortest possible time, but it's better tested using a benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual implementation of this test however relies on squeezing in some updates before collect runs.
There's a lot of truth to this statement, but none the less it appears to measure something useful :)
temporality | type of change | before change (measurements/ms) | after change (measurements/ms) |
---|---|---|---|
Cumulative | changed write to read lock of hashmap | 9 | 840 |
Delta | reduce the amount of time write lock is held | 17 | 56 |
Ideally I would like to "catch" changes to how attribute-set hashmap are locked with both: existing types of measurements (with existing attribute-set and new attribute-set combination).
Maybe we need to tests these things separately... I don't know, any ideas are welcome :)
let barrier = Barrier::new(num_threads + 1); | ||
std::thread::scope(|s| { | ||
// first create bunch of measurements, | ||
// so that collection phase wouldn't be "empty" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done without spawning new threads. (before you spawn threads for running collect and recording measurements)
}; | ||
is_collecting.store(true, Ordering::Release); | ||
barrier.wait(); | ||
reader.collect(&mut rm).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the PR description, you mentioned that you saw a major improvement when running this with the changes for #2145 locally. I'm curious, the code changes for #2145 are mostly related to reuse of collect code across instruments. How is that improving the perf numbers?
Throughput perf here should mostly depend on the thread synchronization mechanism used for collect and update. Currently, it's using a RwLock
in ValueMap
and the collect operation requires a write lock which would "stop the world" until collect has the write lock. Unless you change that, I wouldn't expect any significant improvement in throughput here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For delta temporality I was able to reduce the amount of time a write lock is held while collecting.
For cumulative temporality I simply changed to read lock (currently there is write lock, although cumulative temporality doesn't modify anything, and IMO locking individual measurements is sufficient and much better, then "stop the world until all measurements are read" approach). Although read lock is better, but measurements with new attribute-sets will still be locked. I see two more alternatives:
- clone entire attribute-set hashmap (acquire read lock, clone hashmap, release lock, iterate individual measurements)
- implement some sort of sharding (more throughtput at the cost of single measurement performance)
In any case, there are ways to improve it:) so I created this PR so we could actually measure different ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are good ideas!
For cumulative temporality I simply changed to read lock (currently there is write lock, although cumulative temporality doesn't modify anything,
Great! In that case, we should be able test update throughput while collect
runs in a loop for Cumulative
and a use a simple benchmark for Delta
. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implement some sort of sharding (more throughtput at the cost of single measurement performance)
I did try sharding approach earlier in #1564 - to reduce the write-lock duration by locking only the specific shard of measurements during collection. However, I encountered some concurrency issues. It’s still an approach worth revisiting at some point.
Great! In that case, we should be able test update throughput while collect runs in a loop for Cumulative and a use a simple benchmark for Delta. 🙂
+1
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2247 +/- ##
=====================================
Coverage 79.3% 79.3%
=====================================
Files 121 121
Lines 20968 20968
=====================================
Hits 16646 16646
Misses 4322 4322 ☔ View full report in Codecov by Sentry. |
Changes
Added two tests (delta and cumulative temporality) to measuring how much metrics collection phase impact measurements throughput.
Implementation consists of running collection phase in the loop, while simultaneously testing measurements throughput.
The main concern isn't the speed of collection phase itself, but ensuring that Opentelemetry Metrics doesn't significantly contribute to p99 latency.
I didn't customize it for different metrics type, idea is to eventually land #2117, so that I could unify collection phase for other metrics #2145.
Additionally, significantly improved code reuse for
stress
crate.Here are some results on my machine:
For curiosity I have also implemented #2117 and #2145 locally, and results are as follows:
Merge requirement checklist
CHANGELOG.md
files updated for non-trivial, user-facing changes