-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve average performance of long task timer for out-of-order stopping #5591
base: main
Are you sure you want to change the base?
improve average performance of long task timer for out-of-order stopping #5591
Conversation
@fogninid Please sign the Contributor License Agreement! Click here to manually synchronize the status of this Pull Request. See the FAQ for frequently asked questions. |
@fogninid Thank you for signing the Contributor License Agreement! |
aba71f6
to
295d518
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pull request. What you wrote makes sense. Still, I wanted to verify with some JMH benchmarks so we can put some numbers behind it and have them in place for checking any future changes that would affect performance around this. I made #5595 to add JMH benchmarks.
@@ -92,7 +96,9 @@ public DefaultLongTaskTimer(Id id, Clock clock, TimeUnit baseTimeUnit, | |||
@Override | |||
public Sample start() { | |||
SampleImpl sample = new SampleImpl(); | |||
activeTasks.add(sample); | |||
if (!activeTasks.add(sample)) { | |||
throw new IllegalStateException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't throw exceptions when recording metrics, generally. In what case would this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a leftover from testing. See the implementation of SampleImpl#compareTo.
I was a bit surprised that the skip-list considers the comparator returning "0" as sufficient test for equality, and does not check Object#equals.
I am intentially leaving equality to be only reference identity (obviously two samples are never the same), but the naive comparator can not distinguish if two samples are created at exactly the same startTime. Putting hashCode in the comparison makes it "almost always" a total order, but it can theoretically still fail for hash collisions.
You can try removing the hashCode check from the comparator, and see some unit tests will fail.
A fail-proof solution would be to use a strictly increasing counter (for example an AtomicLong) instead of the time based one, but it does not look worth to increase the heap memory usage of SampleImpl.
I tend to say it is better to remove the exception that you mention here and accept that some samples can be lost (not accounted) if bothe the monotonic time does not increase and there are hash collisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shakuzen how do you suggest me to proceed?
Should I change the DefaultLongTaskTimer.SampleImpl
class to use more heap memory and store an AtomicInteger
to be always correct even in those very rare corner cases?
Or do you have another suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late response. Now that 1.14 GA is released, I'll get back to reviewing this so we can hopefully get it merged. I'll take a look tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to add an AtomicInteger field or similar; everyone would have to pay for that extra memory cost even though two samples having the same start time in nanoseconds and the same hash code is incredibly rare. I'll try to look into it more tomorrow to see what we can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[..] everyone would have to pay [..]
Good point, what about making only the collisions pay for storing more memory?
d8dcd3b
The additional functions and indirect calls should also almost never happen, so the only disadvantage I see would be in the slightly non-obvious code
Sharing results from my MacBook Pro M1 with the benchmarks in the linked PR with 10,000 active samples and stopping a random sample. As expected, start is slower but the overall time to start and stop on average (with a random sample) is better. Before
After
|
…/stop of many tasks
…/stop of many tasks
295d518
to
44f147a
Compare
The current implementation of DefaultLongTaskTimer optimizes for O(1) task starting, but performs poorly when stopping tasks that are not at the beginning of its internal queue (the oldest ones).
At the worst case, when calling stop immediately after starting, the stop call is currently expected to require O(N) operations.
Depending on the distribution of task lifetimes, the average case would be O(1) only for applications that stop tasks in exactly the same order as they were started; applications completing out-of-order and with unbiased lifetime would experience O(N) average.
Task stopping should not have any intrinsic difference to starting: both action are expected to be performed on application threads, and for a well-functioning application (that is not leaking of piling-up tasks) every call to start is matched by exactly one call to stop.