Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate slowdown observed in PR #24344 #547

Open
Tracked by #357
nhorman opened this issue May 13, 2024 · 10 comments
Open
Tracked by #357

Investigate slowdown observed in PR #24344 #547

nhorman opened this issue May 13, 2024 · 10 comments
Assignees

Comments

@nhorman
Copy link
Contributor

nhorman commented May 13, 2024

See PR openssl/openssl#24344

@nhorman
Copy link
Contributor Author

nhorman commented May 16, 2024

Investigation seems to be showing that the slowdown is being caused by excessive cache misses in ossl_rcu_read_lock.

from the base numbers @mattcaswell supplied:
openssl/openssl#24344 (comment)

I did some perf runs, and found that 13% of our cumulative cache miss samples, and %22 of our cache cycle samples were in get_hold_current_qp, specifically on the atomic load instruction in which we validated that a writer hadn't updated our quiecent point. that shouldn't happen more than once during a read lock cycle, but it was clearly happening much more often.

Investigation of the code indicated that we failed to call ossl_synchronize_rcu after a write lock was dropped. failing to do this means that the current quiescent point will be resused, causing both significant cache line bouncing between cpus as well as having to run multiple iterations through get_hold_current_qp on the read side, which is likely where the additional samples of time are accumulating.

By adding ossl_synchronize_rcu after ossl_rcu_write_unlock in the added call sites I get this performance comparison with the evp fetch test, running 50 iterations of the test at each thread count and averaging them

Threads usecs (master) std deviation (master) usecs(rcu) std deviation (rcu)
1 .192005 .012845 .204523 .011090
2 .748949 .116107 .65589 .136209
4 1.678048 .259730 1.421918 .192387
8 5.075333 1.640971 4.444663 1.552035
16 16.247526 4.170233 15.467907 4.560269
32 14.044161 1.969850 13.906076 3.311382
64 18.926412 2.265276 15.628595 1.973126

So this appears significantly better

that said, I have two remaining concerns:

  1. We have this really weird pessimal case at 16 threads. it occurs consistently, regardless of the test being performed (baseline or rcu). I would like to investigate that further

  2. I'm concerned that the changes in this patch aren't correct. having investigated the code, I can see that it correctly protects the algorithm sparse array, which is good, preventing an update to the SA from being published until such time as it is fully updated. However, the elements in the sparse array (ALGORITHM structs), each contain elements that can and are mutated at various times while under an rcu read lock. that isn't permissable. I think we need to look more closely at how those structures are managed if rcu is to be successful here. currently I think its working by good fortune and timing.

@t8m
Copy link
Member

t8m commented May 17, 2024

  1. We have this really weird pessimal case at 16 threads. it occurs consistently, regardless of the test being performed (baseline or rcu). I would like to investigate that further

How many cores do you have on the machine?

@mattcaswell
Copy link
Member

can and are mutated at various times while under an rcu read lock

Where do you see mutation under a read lock occurring?

@mattcaswell
Copy link
Member

Investigation of the code indicated that we failed to call ossl_synchronize_rcu after a write lock was dropped

Ah! Mea Culpa! Should we integrate the ossl_synchronize_rcu call into the write unlock? Is there any case where you wouldn't want this to happen?

@mattcaswell
Copy link
Member

Should we integrate the ossl_synchronize_rcu call into the write unlock? Is there any case where you wouldn't want this to happen?

Answering my own question: yes, in the event that we are rolling back the ossl_synchronize_rcu is unnecessary.

@mattcaswell
Copy link
Member

I updated openssl/openssl#24344 to add the missing ossl_syncrhonize_rcu calls.

@nhorman
Copy link
Contributor Author

nhorman commented May 17, 2024

@mattcaswell yes, in that case you don't want to immediately call synchronize. Yo also may not want to call it if you plan to relock the writeock again shortly (I e you can batch synchronization if you like)

@t8m my system has 20 xeon e5 cores, hyper threaded, so effectively 40 cores. It's super odd that we slow down so much at 16 threads, then get faster again

@nhorman
Copy link
Contributor Author

nhorman commented May 17, 2024

@mattcaswell actually, I take back what I said about being concerned about improper protections. i was concerned that when you did a shallow dup of the algs array, that the member elements were going to point to the same memory (i.e. the IMPLEMENTATION stack was going to still be shared, but I see now that you dup that as well, so I think we're actually ok here.

Nice work!

@t8m
Copy link
Member

t8m commented May 17, 2024

@t8m my system has 20 xeon e5 cores, hyper threaded, so effectively 40 cores. It's super odd that we slow down so much at 16 threads, then get faster again

Yeah, that's weird. I would expect the slowdown happenning when you use more than the number of physical cores.

Could it be due to cache contention?

I.e. 16 threads start to be starved of cache -> up to 40 is still starved but still doing things in parallel thanks to number of cores -> more than 40 - further slow down due to threads fighting over cores.

@nhorman
Copy link
Contributor Author

nhorman commented May 17, 2024

It certainly could be, but perf isn't showing that (at least not clearly). It does seem to be showing a spike in scheduler context switches in both cases there, but I can't explain why that is. I could understand it if rcu was spinning on a write lock waiting for read side completion, but its happening without rcu as well, so its a mystery to me so far. Investigation continues...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

3 participants