Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of compression: The clone wars #8

Open
4 tasks
torfmaster opened this issue Jan 9, 2022 · 7 comments
Open
4 tasks

Performance of compression: The clone wars #8

torfmaster opened this issue Jan 9, 2022 · 7 comments
Labels
difficulty medium help wanted Extra attention is needed

Comments

@torfmaster
Copy link
Owner

ribzip2 uses a very naive representation of Huffman codes and also writes them in a naive way: it uses dynamically allocated arrays of enums. Also at other places the habit of representing bits as arrays of enums has large costs, in total at least 5% are spent during clone operations of these arrays.
There are basically these places where this can be eliminated by a better internal representation, e.g. a 32 bit integer (bzip2 Huffman codes are length-limited to 17 bits writing and 20 bits reading anyway).

  • replace representation of huffman codes by 32 bits integers to avoid cloning of arrays
  • replace bitwriter internal representation by bytes or integers
  • use bit array (represented by integers, for examples) instead of arrays of enums
  • store block data more efficiently instead of just using arrays of Bit enums
@torfmaster torfmaster added difficulty medium help wanted Extra attention is needed labels Jan 9, 2022
@pdogr
Copy link

pdogr commented Mar 24, 2022

Hey @torfmaster. I wanted to take a shot at this, if this is still open.

@torfmaster
Copy link
Owner Author

Yes it is. Go ahead!

@ohsnyt
Copy link

ohsnyt commented Apr 1, 2022

torfmaster - greetings. I saw your project just a couple days ago, and am impressed with your progress. My son, Micah Snyder, is the current custodian of the bzip2 repository.

We have been talking about writing a Rust version. I'm somewhat new to Rust, and have been cutting my teeth on a bzip2 implementation (private repository). Just last night I actually achieved very similar compression level's to Julian's code - was so happy. I'll pull your code and look at it. (The solution for me was recreating a "partial sort" when optimizing the Huffman tables, patterned off Julian's code.)

@torfmaster
Copy link
Owner Author

@ohsnyt I'm very happy, that my work (more than a year in my spare time) gets some attention :) By the way I achieved a similar compression level to the original implementation by using a more or less classical k-means-clustering approach. You can enable this using the command line parameter (or inside the lib set the appropriate flag). However, it is a bit slower by now and I didn't enable it by default. To be honest I was unable to understand the original implementation of the optimization of the Huffman tables - if you have any hint why it works, give me a hint :)

I would be happy to bring this code to a broader audience, if you have any ideas, how - feel free to tell me!

@ohsnyt
Copy link

ohsnyt commented Apr 5, 2022

My GitHub skills are not great (yet). I tried create a fork and publish it back up. It is heavily commented. Look there and see if that helps. (If the code didn't push up, let me know.) The new code has a new subcommand for "standard" bzip2 encoding.

Since Julian's code computes Huffman codes so differently, I decided to make a looooong match branch at line 53 of block_encoders.rs. I didn't change much of any code before that point (except some CLI stuff to get the new option up). I didn't touch the decoding side.

@torfmaster
Copy link
Owner Author

@ohsnyt thank you! If you create a PR and tick "allow edits from maintainers" I can try to integrate this into the existing code. Still I am unsure why this code (I mean the original bzip2 code) works at all, but I'll benchmark it and if it shows to be more efficient than the k means clustering method I'll integrate it.
One question regarding your port. Have you tried to benchmarked your code with large files? I have used the same approach (i.e. fat pivot radix sort) initiallly but it has shown much slower than the now-used SA-IS approach.

@ohsnyt
Copy link

ohsnyt commented Apr 22, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants