-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of compression: The clone wars #8
Comments
Hey @torfmaster. I wanted to take a shot at this, if this is still open. |
Yes it is. Go ahead! |
torfmaster - greetings. I saw your project just a couple days ago, and am impressed with your progress. My son, Micah Snyder, is the current custodian of the bzip2 repository. We have been talking about writing a Rust version. I'm somewhat new to Rust, and have been cutting my teeth on a bzip2 implementation (private repository). Just last night I actually achieved very similar compression level's to Julian's code - was so happy. I'll pull your code and look at it. (The solution for me was recreating a "partial sort" when optimizing the Huffman tables, patterned off Julian's code.) |
@ohsnyt I'm very happy, that my work (more than a year in my spare time) gets some attention :) By the way I achieved a similar compression level to the original implementation by using a more or less classical k-means-clustering approach. You can enable this using the command line parameter (or inside the lib set the appropriate flag). However, it is a bit slower by now and I didn't enable it by default. To be honest I was unable to understand the original implementation of the optimization of the Huffman tables - if you have any hint why it works, give me a hint :) I would be happy to bring this code to a broader audience, if you have any ideas, how - feel free to tell me! |
My GitHub skills are not great (yet). I tried create a fork and publish it back up. It is heavily commented. Look there and see if that helps. (If the code didn't push up, let me know.) The new code has a new subcommand for "standard" bzip2 encoding. Since Julian's code computes Huffman codes so differently, I decided to make a looooong match branch at line 53 of block_encoders.rs. I didn't change much of any code before that point (except some CLI stuff to get the new option up). I didn't touch the decoding side. |
@ohsnyt thank you! If you create a PR and tick "allow edits from maintainers" I can try to integrate this into the existing code. Still I am unsure why this code (I mean the original bzip2 code) works at all, but I'll benchmark it and if it shows to be more efficient than the k means clustering method I'll integrate it. |
Sorry for the delay in replying. I am still working some bugs out of a larger import. I’m getting closer, but finding in initial testing that the speeds are very similar.
I’ll be in touch more later.
David
… On Apr 8, 2022, at 1:39 AM, torfmaster ***@***.***> wrote:
@ohsnyt <https://github.com/ohsnyt> thank you! If you create a PR and tick "allow edits from maintainers" I can try to integrate this into the existing code. Still I am unsure why this code (I mean the original bzip2 code) works at all, but I'll benchmark it and if it shows to be more efficient than the k means clustering method I'll integrate it.
One question regarding your port. Have you tried to benchmarked your code with large files? I have used the same approach (i.e. fat pivot radix sort) initiallly but it has shown much slower than the now-used SA-IS approach.
—
Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOT32KDNSYR733NRKZ7ID3TVD7ICRANCNFSM5LSLFSWA>.
You are receiving this because you were mentioned.
|
ribzip2
uses a very naive representation of Huffman codes and also writes them in a naive way: it uses dynamically allocated arrays of enums. Also at other places the habit of representing bits as arrays of enums has large costs, in total at least 5% are spent during clone operations of these arrays.There are basically these places where this can be eliminated by a better internal representation, e.g. a 32 bit integer (bzip2 Huffman codes are length-limited to 17 bits writing and 20 bits reading anyway).
Bit
enumsThe text was updated successfully, but these errors were encountered: