Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bloom-filter: default value for components_memory_reclaim_threshold is too strict #18607

Open
Tracked by #17972
denesb opened this issue May 10, 2024 · 1 comment · May be fixed by #18611
Open
Tracked by #17972

bloom-filter: default value for components_memory_reclaim_threshold is too strict #18607

denesb opened this issue May 10, 2024 · 1 comment · May be fixed by #18611
Assignees
Labels
area/memory footprint backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed backport/6.0 status/regression
Milestone

Comments

@denesb
Copy link
Contributor

denesb commented May 10, 2024

components_memory_reclaim_threshold controls when we start evicting bloom filters from memory. It defaults to 0.1, which comes down to 10% of the shard's memory. It looks like this is too strict and causes unnecessary bloom filter eviction in otherwise healthy clusters. One cluster which was impacted by this, had a BF/memory ratio of 0.6. The eviction mechanism was introduced to avert OOM disasters, not to enforce lean bloom filters and so the default should be adjusted accordingly and increased to as high as 0.8 or even maybe 0.9. The idea is that by default it should only kick in when OOM is iminent, it should not intervene on clusters which have high BF memory consumption but are otherwise fine.

@denesb denesb added status/regression area/memory footprint backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed backport/6.0 labels May 10, 2024
lkshminarayanan added a commit to lkshminarayanan/scylladb that referenced this issue May 10, 2024
…fault

Incremented the components_memory_reclaim_threshold config's default
value to 0.8 as the previous value was too strict and caused unnecessary
eviction in otherwise healthy clusters.

Fixes scylladb#18607

Signed-off-by: Lakshmi Narayanan Sreethar <[email protected]>
@mykaul mykaul added this to the 5.4.7 milestone May 10, 2024
@vladzcloudius
Copy link
Contributor

vladzcloudius commented May 13, 2024

According to what @avikivity wrote before the OOM is "imminent" when it reaches 0.5.
However we know for a fact this is not true universally - we saw nodes surviving 0.6 and more just fine.
OOM happens not because BF occupies this or that amount of RAM but because the allocation fails.

BF amount can contribute but this is not the root cause - it's just a trigger together with many other possible triggers for the OOM.

IMO we should not enable this feature by default just like we don't enable compaction I/O bandwidth throttling by default.

Instead we should invest in the actual fixing of the problem:

  1. Make sure BF size always correct: we know exactly how many partitions sstable has right before we seal it. Using estimations like we do today is a performance optimization and it should be always discarded when it gets in a way of a correctness. Instead we keep on "fighting" for that heuristics piling up spaghetti code for no good reason whatsoever.
  2. If we want to restrict the BF memory size we should implement proper caching (we have a few classes in the code that can make it pretty straight forward, e.g. loading_cache.hh).
    2.1) If we do that we should add a configurable to control the maximum amount of RAM consumed by BF cache.

fyi @denesb

@denesb denesb modified the milestones: 5.4.7, 6.1 May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/memory footprint backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed backport/6.0 status/regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants