Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable cache Optimization #13481

Open
malik672 opened this issue Dec 20, 2024 · 11 comments
Open

Enable cache Optimization #13481

malik672 opened this issue Dec 20, 2024 · 11 comments
Labels
C-perf A change motivated by improving speed, memory usage or disk footprint

Comments

@malik672
Copy link
Contributor

malik672 commented Dec 20, 2024

Describe the feature

ref: #12842 (comment)
ref: #12842
Image

The main reason this consumes so much storage or allocation I should say is because it's way too cache unoptimized, the whole fetcher.rs seems like it's not taking advantage of the cache instead there are more reads to the main memory instead, a common example should be

  1. Hash lookups that are not needed (this might seem efficient with 0(1) constant time indexing but it's not)
  2. Hash set with maximum size of 8 should be replaced with a Vec, of course this might not seem logical for indexing but for small data structure it's all about being contiguous in memory
  3. Scattered memory access, this leads to more cache miss

in all the reason for such allocation and latency is just because we are not using the cache efficiently,

@hai-rise @mattsse

Additional context

I don't think any amount of code removal or algorithmic reduction will make this more efficient apart from the using the cache efficiently

@malik672 malik672 added C-enhancement New feature or request S-needs-triage This issue needs to be labelled labels Dec 20, 2024
@hai-rise
Copy link
Contributor

hai-rise commented Dec 20, 2024

I'm not following the bigger context but CPU cache misses "only" re-fetch values from RAM to CPU without allocating more, while my referenced comment focused on aggressive memory allocations, which is not necessarily related.

I don't think any amount of code removal or algorithmic reduction will make this more efficient apart from the using the cache efficiently

I'm unfamiliar with this specific module, but this statement is generally off:

  • The main selling of CPU caches is that reading from memory is very slow. If reading from memory is slow then more complicated operations like allocations, deallocations, or moving data in memory are even slower.
  • I've heard several engineers sell 100~1000x speedup for efficient CPU caches but never saw such scale in practice or realistic benchmarks. Meanwhile, I've seen countless scenarios where non-contiguous O(logN) smokes contiguous O(N).

My point is that while efficient CPU caches are always desired, they shouldn't devalue other critical aspects like efficient data structures and memory (de)allocations. They are not mutually exclusive!

@malik672
Copy link
Contributor Author

malik672 commented Dec 20, 2024

you are wrong, they are related actually very very related

  1. Think of Cache hits as batching

I've heard several engineers sell 100~1000x speedup for efficient CPU caches but never saw such scale in practice or realistic benchmarks. Meanwhile, I've seen countless scenarios where non-contiguous O(logN) smokes contiguous O(N).

This is technically right/wrong depending on the length of the data structure, so I can't really say, but from what i saw the data structure length was 8 and a hash set was used, made no sense to me tbh, for a larger size a non contiguous is better at the end of the day it's all about the size

My point is that while efficient CPU caches are always desired, they shouldn't devalue other critical aspects like efficient data structures and memory (de)allocations. They are not mutually exclusive!
This again is wrong; you can draw a straight line from cache hits to efficient data structure hence to memory de(allocations)

@malik672
Copy link
Contributor Author

My point is that while efficient CPU caches are always desired, they shouldn't devalue other critical aspects like efficient data structures and memory (de)allocations. They are not mutually exclusive!

They are very very very very very very very very very very very very very mutually exclusive

@hai-rise
Copy link
Contributor

hai-rise commented Dec 21, 2024

Nah, let's look at a simple example of Vec<CachePadded<Data>> that tries to get the best of CPU caches:

  • Vec puts all the Datas next to each other in memory so iterating them is CPU cache-efficient.
  • Each Data is CachePadded to fit a CPU cache line, so a CPU (running a thread) writing to an element wouldn't invalidate adjacent elements in the cache line of other CPUs.

This is not mutually exclusive to efficient memory (de)allocations as one can do efficient pre-allocations with with_capacity and reserve, use clear to reuse allocated memory instead of std::mem::take, etc. On the other hand, a push over the capacity would allocate and move the Vec per push. How much to pre-allocate to avoid this deadly cost depends on the problem at hand and has nothing to do with CPU cache.

This is not mutually exclusive to efficient data structures as when Data is ordered one would need to iterate the whole Vec to find the biggest element. When the number of elements is big, a tree implementation with sloppy memory alignment and no efficient pre-allocation would still beat Vec.

Your argument is that engineers and programs only need to focus on CPU cache and don't even need to care for other things like data structures, memory allocators, or the actual problem and data size a module is solving. I cannot find any context where this makes sense, unfortunately.

A more productive approach would be to propose clear design and code changes that address the actual problem in #12842. "Enable cache optimization" is as unclear as "enable CPU optimization" and "require fewer computer operations". And with any performance work, profiling data and benchmark numbers are more helpful than pure theory always, so the issue would be more convincing if you have them.

@malik672
Copy link
Contributor Author

Your argument is that engineers and programs only need to focus on CPU cache and don't even need to care for other things like data structures, memory allocators, or the actual problem and data size a module is solving. I cannot find any context where this makes sense, unfortunately

I never said this caring for the cache is the same (in a way) as caring for data structures, memory allocators, they are inter dependent on each other, you can draw a straight line between them: Efficient data structure for small length -> cache efficiency and minimal memory allocation

This is not mutually exclusive to efficient data structures as when Data is ordered one would need to iterate the whole Vec to find the biggest element. When the number of elements is big, a tree implementation with sloppy memory alignment and no efficient pre-allocation would still beat Vec.

I said this previously, using an example I saw in the code: This is technically right/wrong depending on the length of the data structure, so I can't really say, but from what I saw the data structure length was 8 and a hash set was used, made no sense to me tbh, for a larger size a non contiguous is better at the end of the day it's all about the size

A more productive approach would be to propose clear design and code changes that address the actual problem in #12842. "Enable cache optimization" is as unclear as "enable CPU optimization" and "require fewer computer operations". And with any performance work, profiling data and benchmark numbers are more helpful than pure theory always, so the issue would be more convincing if you have them.

I agree, benchmark is always better than theory, my theory is just me observing the code and changing some parts in it, would love to do that fully but I don't really have the time

@hai-rise
Copy link
Contributor

hai-rise commented Dec 21, 2024

they are inter dependent on each other, you can draw a straight line between them: Efficient data structure for small length -> cache efficiency and minimal memory allocation

My simple Vec example disputed this. For this same "efficient data structure for small length" that you claimed, one would allocate only once if they know the capacity at construction time. If not, one may need to allocate per push. This difference between O(1) and O(n) (not even counting the cost of moving the Vec) depends on whether we know the expected capacity which is tied to the problem at hand and has nothing to do with CPU caches.

My referenced screenshot showed allocations at 100 MB/s with an average lifetime of only 25.77 μs. This is unhealthy, and likely related to (pre)allocating or cloning more than needed regardless of whether the underlying data are contiguous in memory.

My whole points in this thread can be summarized as:

  • Performance contributions should either be obvious (removing an unneeded clone, allocation, etc.), or
  • Be paired with profiling data or benchmarks to back a claim or proposal, and
  • Ideally, be very actionable.

I'm writing this many words because I like your energy (in both Reth and our pevm), but you seem to have a habit of underplaying the challenge of performance works (like over-sticking to a single assumption instead of looking at the bigger picture) and have a rather low issue & PR conversion rate. A change in approach may help you improve at performance tuning, see more of your ideas merged, and all these open-source projects get better :).

@gakonst
Copy link
Member

gakonst commented Dec 21, 2024

@malik672 i think it's worth understanding Hai's broad point -- don't do crazy optimizations, always profile

@malik672
Copy link
Contributor Author

@malik672 i think it's worth understanding Hai's broad point -- don't do crazy optimizations, always profile

But I agreed to this ...... :) :)

@malik672
Copy link
Contributor Author

My simple Vec example disputed this. For this same "efficient data structure for small length" that you claimed, one would allocate only once if they know the capacity at construction time. If not, one may need to allocate per push. This difference between O(1) and O(n) (not even counting the cost of moving the Vec) depends on whether we know the expected capacity which is tied to the problem at hand and has nothing to do with CPU caches.

if we have an estimate, it's good enough, in the code it was 8, at worst it should be +5 that's still good for a Vec,
I definitely do agree with some of your points (especially the benchmark part) I just refuse to agree they cache is not inter dependent on memory allocation/deallocation

I would try to do a pr with benchmark, not sure I've the time but I would try

@hai-rise
Copy link
Contributor

hai-rise commented Dec 22, 2024

Nice. Some profiling data with Cachegrind to confirm the cache issue or benchmarks to show your solution is better would be very helpful. And feel free to continue in #13219 for collaboration with @Elvis339 in a more focused context 🙏.

@DaniPopes DaniPopes added C-perf A change motivated by improving speed, memory usage or disk footprint and removed C-enhancement New feature or request S-needs-triage This issue needs to be labelled labels Dec 22, 2024
@Elvis339
Copy link
Contributor

  1. Hash lookups that are not needed (this might seem efficient with 0(1) constant time indexing but it's not)
  2. Hash set with maximum size of 8 should be replaced with a Vec, of course this might not seem logical for indexing but for small data structure it's all about being contiguous in memory
  3. Scattered memory access, this leads to more cache miss

Re on point 2 about HashSet vs Vec:
The HashSet is currently serving a role in de-duplication. While we could potentially achieve the same with a Vec + custom logic, this would:

  • Add complexity to the codebase
  • Require additional space/time complexity tradeoffs
  • Need concrete benchmarks to prove it's actually faster than the current HashSet implementation

One important correction: The with_capacity(n) in the code isn't setting a maximum size - it's just pre-allocating space for n elements to avoid initial reallocations. We'd need production metrics to understand:

  • Actual typical collection sizes
  • Reallocation frequency
  • Whether this capacity is optimal for real-world usage patterns

Seems like latency is main KPI here (correct me if I'm wrong). I processed events in the last 7 days from this Grafana panel

Image

  • Most events are concentrated between 10^2 (100ns) and 10^3 (1000ns), which for me is indicator that the majority of events are relatively quick.

Before changing the code I think we need to take a more systematic approach here.
I propose the following action steps:

  1. Understand the current system
    • Map out critical paths and interactions (I can help with the parts I understand, but we should get input from others on unclear areas)
  2. Performance Analysis Plan
    • What operations correspond to the long tail events (>1000ns)
    • Actual HashSet sizes and growth patterns
    • Reallocation frequency and impact
    • Resource contention patterns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-perf A change motivated by improving speed, memory usage or disk footprint
Projects
Status: Todo
Development

No branches or pull requests

5 participants