Performance of compression: The clone wars #8

torfmaster · 2022-01-09T22:19:20Z

ribzip2 uses a very naive representation of Huffman codes and also writes them in a naive way: it uses dynamically allocated arrays of enums. Also at other places the habit of representing bits as arrays of enums has large costs, in total at least 5% are spent during clone operations of these arrays.
There are basically these places where this can be eliminated by a better internal representation, e.g. a 32 bit integer (bzip2 Huffman codes are length-limited to 17 bits writing and 20 bits reading anyway).

replace representation of huffman codes by 32 bits integers to avoid cloning of arrays
replace bitwriter internal representation by bytes or integers
use bit array (represented by integers, for examples) instead of arrays of enums
store block data more efficiently instead of just using arrays of Bit enums

The text was updated successfully, but these errors were encountered:

pdogr · 2022-03-24T12:29:45Z

Hey @torfmaster. I wanted to take a shot at this, if this is still open.

torfmaster · 2022-03-24T17:12:07Z

Yes it is. Go ahead!

ohsnyt · 2022-04-01T21:39:58Z

torfmaster - greetings. I saw your project just a couple days ago, and am impressed with your progress. My son, Micah Snyder, is the current custodian of the bzip2 repository.

We have been talking about writing a Rust version. I'm somewhat new to Rust, and have been cutting my teeth on a bzip2 implementation (private repository). Just last night I actually achieved very similar compression level's to Julian's code - was so happy. I'll pull your code and look at it. (The solution for me was recreating a "partial sort" when optimizing the Huffman tables, patterned off Julian's code.)

torfmaster · 2022-04-02T10:00:10Z

@ohsnyt I'm very happy, that my work (more than a year in my spare time) gets some attention :) By the way I achieved a similar compression level to the original implementation by using a more or less classical k-means-clustering approach. You can enable this using the command line parameter (or inside the lib set the appropriate flag). However, it is a bit slower by now and I didn't enable it by default. To be honest I was unable to understand the original implementation of the optimization of the Huffman tables - if you have any hint why it works, give me a hint :)

I would be happy to bring this code to a broader audience, if you have any ideas, how - feel free to tell me!

ohsnyt · 2022-04-05T13:19:16Z

My GitHub skills are not great (yet). I tried create a fork and publish it back up. It is heavily commented. Look there and see if that helps. (If the code didn't push up, let me know.) The new code has a new subcommand for "standard" bzip2 encoding.

Since Julian's code computes Huffman codes so differently, I decided to make a looooong match branch at line 53 of block_encoders.rs. I didn't change much of any code before that point (except some CLI stuff to get the new option up). I didn't touch the decoding side.

torfmaster · 2022-04-08T06:39:26Z

@ohsnyt thank you! If you create a PR and tick "allow edits from maintainers" I can try to integrate this into the existing code. Still I am unsure why this code (I mean the original bzip2 code) works at all, but I'll benchmark it and if it shows to be more efficient than the k means clustering method I'll integrate it.
One question regarding your port. Have you tried to benchmarked your code with large files? I have used the same approach (i.e. fat pivot radix sort) initiallly but it has shown much slower than the now-used SA-IS approach.

ohsnyt · 2022-04-22T11:22:26Z

Sorry for the delay in replying. I am still working some bugs out of a larger import. I’m getting closer, but finding in initial testing that the speeds are very similar. I’ll be in touch more later. David

…

On Apr 8, 2022, at 1:39 AM, torfmaster ***@***.***> wrote: @ohsnyt <https://github.com/ohsnyt> thank you! If you create a PR and tick "allow edits from maintainers" I can try to integrate this into the existing code. Still I am unsure why this code (I mean the original bzip2 code) works at all, but I'll benchmark it and if it shows to be more efficient than the k means clustering method I'll integrate it. One question regarding your port. Have you tried to benchmarked your code with large files? I have used the same approach (i.e. fat pivot radix sort) initiallly but it has shown much slower than the now-used SA-IS approach. — Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOT32KDNSYR733NRKZ7ID3TVD7ICRANCNFSM5LSLFSWA>. You are receiving this because you were mentioned.

torfmaster added difficulty medium help wanted Extra attention is needed labels Jan 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of compression: The clone wars #8

Performance of compression: The clone wars #8

torfmaster commented Jan 9, 2022

pdogr commented Mar 24, 2022

torfmaster commented Mar 24, 2022

ohsnyt commented Apr 1, 2022

torfmaster commented Apr 2, 2022

ohsnyt commented Apr 5, 2022

torfmaster commented Apr 8, 2022

ohsnyt commented Apr 22, 2022 via email

Performance of compression: The clone wars #8

Performance of compression: The clone wars #8

Comments

torfmaster commented Jan 9, 2022

pdogr commented Mar 24, 2022

torfmaster commented Mar 24, 2022

ohsnyt commented Apr 1, 2022

torfmaster commented Apr 2, 2022

ohsnyt commented Apr 5, 2022

torfmaster commented Apr 8, 2022

ohsnyt commented Apr 22, 2022 via email