Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement]: consider replacing bloom filters with xor filters / binary fuse filters #32995

Open
1 task done
alexanderguzhva opened this issue May 12, 2024 · 4 comments
Open
1 task done
Assignees
Labels
kind/enhancement Issues or changes related to enhancement

Comments

@alexanderguzhva
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

What would you like to be added?

consider replacing bloom filters with xor filters / binary fuse filters

an image from https://github.com/FastFilter/xor_singleheader

Why is this needed?

No response

Anything else?

No response

@alexanderguzhva alexanderguzhva added the kind/enhancement Issues or changes related to enhancement label May 12, 2024
@xiaofan-luan
Copy link
Contributor

is there any performance profiling result?

@alexanderguzhva
Copy link
Contributor Author

significantly faster.

https://arxiv.org/pdf/2201.01174

Untitled

Alternatively, we may use `Blocked Bloom filters' (https://save-buffer.github.io/bloom_filter.html; an example of a go package is https://github.com/greatroar/blobloom)

@xiaofan-luan
Copy link
Contributor

@weiliu1031 let's profiling all the filter in multiple dimensions:

  1. write speed
  2. query speed/batch query speed
  3. memory utilization
  4. if it can merge without loss precision it is a big plus

@weiliu1031
Copy link
Contributor

/assign

sre-ci-robot pushed a commit that referenced this issue May 31, 2024
…#33405)

issue: #32995
To speed up the construction and querying of Bloom filters, we chose a
blocked Bloom filter instead of a basic Bloom filter implementation.

WARN: This PR is compatible with old version bf impl, but if fall back
to old milvus version, it may causes bloom filter deserialize failed.

In single Bloom filter test cases with a capacity of 1,000,000 and a
false positive rate (FPR) of 0.001, the blocked Bloom filter is 5 times
faster than the basic Bloom filter in both querying and construction, at
the cost of a 30% increase in memory usage.

- Block BF construct time	{"time": "54.128131ms"}
- Block BF size	                {"size": 3021578}
- Block BF Test cost	        {"time": "55.407352ms"}
- Basic BF construct time	{"time": "210.262183ms"}
- Basic BF size	                {"size": 2396308}
- Basic BF Test cost	        {"time": "192.596229ms"}

In multi Bloom filter test cases with a capacity of 100,000, an FPR of
0.001, and 100 Bloom filters, we reuse the primary key locations for all
Bloom filters to avoid repeated hash computations. As a result, the
blocked Bloom filter is also 5 times faster than the basic Bloom filter
in querying.

- Block BF TestLocation cost    {"time": "529.97183ms"}
- Basic BF TestLocation cost	{"time": "3.197430181s"}

---------

Signed-off-by: Wei Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Issues or changes related to enhancement
Projects
None yet
Development

No branches or pull requests

3 participants