Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Rolling Hashes #103

Open
ashvardanian opened this issue Feb 28, 2024 · 0 comments
Open

Improve Rolling Hashes #103

ashvardanian opened this issue Feb 28, 2024 · 0 comments
Labels
core Work on the algorithm design and implementation help wanted Extra attention is needed huge Large task potentially involving architectural or breaking changes performance Performance related discussion or suggestion

Comments

@ashvardanian
Copy link
Owner

In StringZilla a 64-bit rolling hash function is reused for both string hashes and substring hashes, Rabin-style fingerprints. Rolling hashes take the same amount of time to compute hashes with different window sizes, and are fast to update. Those are not however perfect hashes, and collisions are frequent. StringZilla attempts to use SIMD, but the performance is not yet satisfactory. On Intel Sapphire Rapids, the following numbers can be expected for N-way parallel variants.

  • 4-way AVX2 throughput with 64-bit integer multiplication (no native support): 0.28 GB/s.
  • 4-way AVX2 throughput with 32-bit integer multiplication: 0.54 GB/s.
  • 4-way AVX-512DQ throughput with 64-bit integer multiplication: 0.46 GB/s.
  • 4-way AVX-512 throughput with 32-bit integer multiplication: 0.58 GB/s.
  • 8-way AVX-512 throughput with 32-bit integer multiplication: 0.11 GB/s.

That's extremely slow. Using SIMD and a better scheme, we should be approaching memcpy speeds. Trying alternative rolling hash approaches is the biggest goal for the library in the upcoming releases, so algorithm recommendations would be appreciated!

@ashvardanian ashvardanian added help wanted Extra attention is needed performance Performance related discussion or suggestion core Work on the algorithm design and implementation huge Large task potentially involving architectural or breaking changes labels Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Work on the algorithm design and implementation help wanted Extra attention is needed huge Large task potentially involving architectural or breaking changes performance Performance related discussion or suggestion
Projects
None yet
Development

No branches or pull requests

1 participant