Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve average performance of long task timer for out-of-order stopping #5591

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

fogninid
Copy link

The current implementation of DefaultLongTaskTimer optimizes for O(1) task starting, but performs poorly when stopping tasks that are not at the beginning of its internal queue (the oldest ones).
At the worst case, when calling stop immediately after starting, the stop call is currently expected to require O(N) operations.
Depending on the distribution of task lifetimes, the average case would be O(1) only for applications that stop tasks in exactly the same order as they were started; applications completing out-of-order and with unbiased lifetime would experience O(N) average.

Task stopping should not have any intrinsic difference to starting: both action are expected to be performed on application threads, and for a well-functioning application (that is not leaking of piling-up tasks) every call to start is matched by exactly one call to stop.

@pivotal-cla
Copy link

@fogninid Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

@pivotal-cla
Copy link

@fogninid Thank you for signing the Contributor License Agreement!

@fogninid fogninid force-pushed the long_task_timer_performance branch 2 times, most recently from aba71f6 to 295d518 Compare October 14, 2024 14:55
@shakuzen shakuzen added enhancement A general enhancement performance Issues related to general performance module: micrometer-core An issue that is related to our core module labels Oct 15, 2024
@shakuzen shakuzen added this to the 1.15.0-M1 milestone Oct 16, 2024
Copy link
Member

@shakuzen shakuzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pull request. What you wrote makes sense. Still, I wanted to verify with some JMH benchmarks so we can put some numbers behind it and have them in place for checking any future changes that would affect performance around this. I made #5595 to add JMH benchmarks.

@@ -92,7 +96,9 @@ public DefaultLongTaskTimer(Id id, Clock clock, TimeUnit baseTimeUnit,
@Override
public Sample start() {
SampleImpl sample = new SampleImpl();
activeTasks.add(sample);
if (!activeTasks.add(sample)) {
throw new IllegalStateException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't throw exceptions when recording metrics, generally. In what case would this happen?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a leftover from testing. See the implementation of SampleImpl#compareTo.
I was a bit surprised that the skip-list considers the comparator returning "0" as sufficient test for equality, and does not check Object#equals.
I am intentially leaving equality to be only reference identity (obviously two samples are never the same), but the naive comparator can not distinguish if two samples are created at exactly the same startTime. Putting hashCode in the comparison makes it "almost always" a total order, but it can theoretically still fail for hash collisions.

You can try removing the hashCode check from the comparator, and see some unit tests will fail.

A fail-proof solution would be to use a strictly increasing counter (for example an AtomicLong) instead of the time based one, but it does not look worth to increase the heap memory usage of SampleImpl.

I tend to say it is better to remove the exception that you mention here and accept that some samples can be lost (not accounted) if bothe the monotonic time does not increase and there are hash collisions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shakuzen how do you suggest me to proceed?

Should I change the DefaultLongTaskTimer.SampleImpl class to use more heap memory and store an AtomicInteger to be always correct even in those very rare corner cases?

Or do you have another suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response. Now that 1.14 GA is released, I'll get back to reviewing this so we can hopefully get it merged. I'll take a look tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to add an AtomicInteger field or similar; everyone would have to pay for that extra memory cost even though two samples having the same start time in nanoseconds and the same hash code is incredibly rare. I'll try to look into it more tomorrow to see what we can do.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[..] everyone would have to pay [..]

Good point, what about making only the collisions pay for storing more memory?
d8dcd3b

The additional functions and indirect calls should also almost never happen, so the only disadvantage I see would be in the slightly non-obvious code

@shakuzen
Copy link
Member

Sharing results from my MacBook Pro M1 with the benchmarks in the linked PR with 10,000 active samples and stopping a random sample. As expected, start is slower but the overall time to start and stop on average (with a random sample) is better.

Before

Benchmark                                   Mode  Cnt   Score   Error  Units
DefaultLongTaskTimerBenchmark.start           ss  200   0.495 ± 0.064  us/op
DefaultLongTaskTimerBenchmark.startAndStop    ss  200  15.351 ± 2.508  us/op
DefaultLongTaskTimerBenchmark.stopRandom      ss  200  14.784 ± 2.584  us/op

After

Benchmark                                   Mode  Cnt  Score   Error  Units
DefaultLongTaskTimerBenchmark.start           ss  200  1.002 ± 0.116  us/op
DefaultLongTaskTimerBenchmark.startAndStop    ss  200  6.338 ± 0.577  us/op
DefaultLongTaskTimerBenchmark.stopRandom      ss  200  5.154 ± 0.608  us/op

@fogninid fogninid force-pushed the long_task_timer_performance branch from 295d518 to 44f147a Compare October 16, 2024 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A general enhancement module: micrometer-core An issue that is related to our core module performance Issues related to general performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants