-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate slowdown observed in PR #24344 #547
Comments
Investigation seems to be showing that the slowdown is being caused by excessive cache misses in ossl_rcu_read_lock. from the base numbers @mattcaswell supplied: I did some perf runs, and found that 13% of our cumulative cache miss samples, and %22 of our cache cycle samples were in get_hold_current_qp, specifically on the atomic load instruction in which we validated that a writer hadn't updated our quiecent point. that shouldn't happen more than once during a read lock cycle, but it was clearly happening much more often. Investigation of the code indicated that we failed to call ossl_synchronize_rcu after a write lock was dropped. failing to do this means that the current quiescent point will be resused, causing both significant cache line bouncing between cpus as well as having to run multiple iterations through get_hold_current_qp on the read side, which is likely where the additional samples of time are accumulating. By adding ossl_synchronize_rcu after ossl_rcu_write_unlock in the added call sites I get this performance comparison with the evp fetch test, running 50 iterations of the test at each thread count and averaging them
So this appears significantly better that said, I have two remaining concerns:
|
How many cores do you have on the machine? |
Where do you see mutation under a read lock occurring? |
Ah! Mea Culpa! Should we integrate the |
Answering my own question: yes, in the event that we are rolling back the |
I updated openssl/openssl#24344 to add the missing |
@mattcaswell yes, in that case you don't want to immediately call synchronize. Yo also may not want to call it if you plan to relock the writeock again shortly (I e you can batch synchronization if you like) @t8m my system has 20 xeon e5 cores, hyper threaded, so effectively 40 cores. It's super odd that we slow down so much at 16 threads, then get faster again |
@mattcaswell actually, I take back what I said about being concerned about improper protections. i was concerned that when you did a shallow dup of the algs array, that the member elements were going to point to the same memory (i.e. the IMPLEMENTATION stack was going to still be shared, but I see now that you dup that as well, so I think we're actually ok here. Nice work! |
Yeah, that's weird. I would expect the slowdown happenning when you use more than the number of physical cores. Could it be due to cache contention? I.e. 16 threads start to be starved of cache -> up to 40 is still starved but still doing things in parallel thanks to number of cores -> more than 40 - further slow down due to threads fighting over cores. |
It certainly could be, but perf isn't showing that (at least not clearly). It does seem to be showing a spike in scheduler context switches in both cases there, but I can't explain why that is. I could understand it if rcu was spinning on a write lock waiting for read side completion, but its happening without rcu as well, so its a mystery to me so far. Investigation continues... |
See PR openssl/openssl#24344
The text was updated successfully, but these errors were encountered: