Investigate slowdown observed in PR #24344 #547

nhorman · 2024-05-13T13:53:09Z

nhorman · 2024-05-16T20:02:56Z

Investigation seems to be showing that the slowdown is being caused by excessive cache misses in ossl_rcu_read_lock.

from the base numbers @mattcaswell supplied:
openssl/openssl#24344 (comment)

I did some perf runs, and found that 13% of our cumulative cache miss samples, and %22 of our cache cycle samples were in get_hold_current_qp, specifically on the atomic load instruction in which we validated that a writer hadn't updated our quiecent point. that shouldn't happen more than once during a read lock cycle, but it was clearly happening much more often.

Investigation of the code indicated that we failed to call ossl_synchronize_rcu after a write lock was dropped. failing to do this means that the current quiescent point will be resused, causing both significant cache line bouncing between cpus as well as having to run multiple iterations through get_hold_current_qp on the read side, which is likely where the additional samples of time are accumulating.

By adding ossl_synchronize_rcu after ossl_rcu_write_unlock in the added call sites I get this performance comparison with the evp fetch test, running 50 iterations of the test at each thread count and averaging them

Threads	usecs (master)	std deviation (master)	usecs(rcu)	std deviation (rcu)
1	.192005	.012845	.204523	.011090
2	.748949	.116107	.65589	.136209
4	1.678048	.259730	1.421918	.192387
8	5.075333	1.640971	4.444663	1.552035
16	16.247526	4.170233	15.467907	4.560269
32	14.044161	1.969850	13.906076	3.311382
64	18.926412	2.265276	15.628595	1.973126

So this appears significantly better

that said, I have two remaining concerns:

We have this really weird pessimal case at 16 threads. it occurs consistently, regardless of the test being performed (baseline or rcu). I would like to investigate that further
I'm concerned that the changes in this patch aren't correct. having investigated the code, I can see that it correctly protects the algorithm sparse array, which is good, preventing an update to the SA from being published until such time as it is fully updated. However, the elements in the sparse array (ALGORITHM structs), each contain elements that can and are mutated at various times while under an rcu read lock. that isn't permissable. I think we need to look more closely at how those structures are managed if rcu is to be successful here. currently I think its working by good fortune and timing.

t8m · 2024-05-17T06:24:44Z

We have this really weird pessimal case at 16 threads. it occurs consistently, regardless of the test being performed (baseline or rcu). I would like to investigate that further

How many cores do you have on the machine?

mattcaswell · 2024-05-17T07:17:16Z

can and are mutated at various times while under an rcu read lock

Where do you see mutation under a read lock occurring?

mattcaswell · 2024-05-17T08:20:44Z

Investigation of the code indicated that we failed to call ossl_synchronize_rcu after a write lock was dropped

Ah! Mea Culpa! Should we integrate the ossl_synchronize_rcu call into the write unlock? Is there any case where you wouldn't want this to happen?

mattcaswell · 2024-05-17T08:29:13Z

Should we integrate the ossl_synchronize_rcu call into the write unlock? Is there any case where you wouldn't want this to happen?

Answering my own question: yes, in the event that we are rolling back the ossl_synchronize_rcu is unnecessary.

mattcaswell · 2024-05-17T08:37:40Z

I updated openssl/openssl#24344 to add the missing ossl_syncrhonize_rcu calls.

nhorman · 2024-05-17T10:48:34Z

@mattcaswell yes, in that case you don't want to immediately call synchronize. Yo also may not want to call it if you plan to relock the writeock again shortly (I e you can batch synchronization if you like)

@t8m my system has 20 xeon e5 cores, hyper threaded, so effectively 40 cores. It's super odd that we slow down so much at 16 threads, then get faster again

nhorman · 2024-05-17T12:09:41Z

@mattcaswell actually, I take back what I said about being concerned about improper protections. i was concerned that when you did a shallow dup of the algs array, that the member elements were going to point to the same memory (i.e. the IMPLEMENTATION stack was going to still be shared, but I see now that you dup that as well, so I think we're actually ok here.

Nice work!

t8m · 2024-05-17T14:01:04Z

@t8m my system has 20 xeon e5 cores, hyper threaded, so effectively 40 cores. It's super odd that we slow down so much at 16 threads, then get faster again

Yeah, that's weird. I would expect the slowdown happenning when you use more than the number of physical cores.

Could it be due to cache contention?

I.e. 16 threads start to be starved of cache -> up to 40 is still starved but still doing things in parallel thanks to number of cores -> more than 40 - further slow down due to threads fighting over cores.

nhorman · 2024-05-17T14:36:40Z

It certainly could be, but perf isn't showing that (at least not clearly). It does seem to be showing a spike in scheduler context switches in both cases there, but I can't explain why that is. I could understand it if rcu was spinning on a write lock waiting for read side completion, but its happening without rcu as well, so its a mystery to me so far. Investigation continues...

nhorman mentioned this issue May 13, 2024

Algorithms fetching performance #357

Open

nhorman self-assigned this May 13, 2024

mattcaswell mentioned this issue May 13, 2024

convert property_read_lock to use rcu if suitable #538

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate slowdown observed in PR #24344 #547

Investigate slowdown observed in PR #24344 #547

nhorman commented May 13, 2024 •

edited by t8m

Loading

nhorman commented May 16, 2024

t8m commented May 17, 2024

mattcaswell commented May 17, 2024

mattcaswell commented May 17, 2024

mattcaswell commented May 17, 2024

mattcaswell commented May 17, 2024

nhorman commented May 17, 2024

nhorman commented May 17, 2024

t8m commented May 17, 2024

nhorman commented May 17, 2024

Investigate slowdown observed in PR #24344 #547

Investigate slowdown observed in PR #24344 #547

Comments

nhorman commented May 13, 2024 • edited by t8m Loading

nhorman commented May 16, 2024

t8m commented May 17, 2024

mattcaswell commented May 17, 2024

mattcaswell commented May 17, 2024

mattcaswell commented May 17, 2024

mattcaswell commented May 17, 2024

nhorman commented May 17, 2024

nhorman commented May 17, 2024

t8m commented May 17, 2024

nhorman commented May 17, 2024

nhorman commented May 13, 2024 •

edited by t8m

Loading