Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking Unishox #50

Open
powturbo opened this issue Dec 22, 2022 · 2 comments
Open

Benchmarking Unishox #50

powturbo opened this issue Dec 22, 2022 · 2 comments

Comments

@powturbo
Copy link

powturbo commented Dec 22, 2022

I've integrated unishox into Turbobench and made some tests with your test files.
As Unishox2 is very slow, I've made the tests with unishox3. unishox3 can't compress all the test files without errors.
Here a benchmark successfully compressed and decompressed files without error.
Unishox compression is better only for very small files (< 1k).

     C Size  ratio%     C MB/s     D MB/s   Name            File              (bold = pareto) MB=1.000.0000
        8212    13.5      16.20      22.55   bsc 0e2         hindi.txt
	9927    16.3       0.84     535.41   brotli 11       hindi.txt
       10226    16.8       4.71     182.20   lzma 9          hindi.txt
       10408    17.1       0.47    1968.94   zstd 22         hindi.txt
       10502    17.2       0.21     792.69   zopfli          hindi.txt
       10546    17.3       5.07    1795.21   libdeflate 12   hindi.txt
       11425    18.7       6.06     782.53   zlib 9          hindi.txt
       13811    22.6      11.72    4695.15   lz4 16          hindi.txt
       15248    25.0       0.20     182.75   unishox3        hindi.txt
       61037   100.0   30518.50   30518.50   memcpy          hindi.txt
      116431   190.8     544.97     792.69   shoco           hindi.txt

     C Size  ratio%     C MB/s     D MB/s   Name            File              
       14612    32.3      10.00      12.89   bsc 0e2         spanish.txt
       15442    34.2       0.86     223.61   brotli 11       spanish.txt
       16522    36.6       6.21      68.34   lzma 9          spanish.txt
       16744    37.1       0.40     406.94   zopfli          spanish.txt
       16769    37.1       8.40     740.49   libdeflate 12   spanish.txt
       16797    37.2       0.38     868.65   zstd 22         spanish.txt
       17584    38.9      18.86     370.25   zlib 9          spanish.txt
       19692    43.6       0.04      88.05   unishox3        spanish.txt
       21889    48.5      18.69    3011.33   lz4 16          spanish.txt
       37079    82.1      94.10     438.54   shoco           spanish.txt
       45170   100.0   22585.00   45170.00   memcpy          spanish.txt

     C Size  ratio%     C MB/s     D MB/s   Name            File               
        6235    43.2       0.93     166.02   brotli 11       chinese.txt
        6506    45.0       4.58       6.36   bsc 0e2         chinese.txt
        6828    47.3       7.73      53.50   lzma 9          chinese.txt
        7093    49.1       0.31     515.86   zopfli          chinese.txt
        7098    49.1       0.13     555.54   zstd 22         chinese.txt
        7187    49.8      12.58     577.76   libdeflate 12   chinese.txt
        7450    51.6      33.51     498.07   zlib 9          chinese.txt
        8989    62.2       0.20      76.02   unishox3        chinese.txt
        9458    65.5      30.41    2888.80   lz4 16          chinese.txt
       14444   100.0   14444.00   14444.00   memcpy          chinese.txt
       25170   174.3     283.22     722.20   shoco           chinese.txt

     C Size  ratio%     C MB/s     D MB/s   Name            File               
        2882    35.9       4.45       6.03   bsc 0e2         zh.txt
        2932    36.5       0.95     167.35   brotli 11       zh.txt
        3061    38.1       0.07     669.42   zstd 22         zh.txt
        3091    38.5       7.01      96.78   lzma 9          zh.txt
        3098    38.6       0.54     669.42   zopfli          zh.txt
        3105    38.7       6.46     730.27   libdeflate 12   zh.txt
        3233    40.2      19.98     669.42   zlib 9          zh.txt
        4386    54.6      23.22    2677.67   lz4 16          zh.txt
        4492    55.9       0.67     110.04   unishox3        zh.txt
        8033   100.0    8033.00    8033.00   memcpy          zh.txt
       15668   195.0     259.13    1004.12   shoco           zh.txt

     C Size  ratio%     C MB/s     D MB/s   Name            File               
        2150    28.5       3.83       5.64   bsc 0e2         ru.txt
        2255    29.9       0.75     160.62   brotli 11       ru.txt
        2484    32.9       4.24     125.82   lzma 9          ru.txt
        2515    33.3       0.12     629.08   zopfli          ru.txt
        2530    33.5       4.91     754.90   libdeflate 12   ru.txt
        2546    33.7       0.07     686.27   zstd 22         ru.txt
        2628    34.8      15.01     580.69   zlib 9          ru.txt
        3168    42.0       0.43     193.56   unishox3        ru.txt
        3591    47.6      15.22    1258.17   lz4 16          ru.txt
        7549   100.0    7549.00    7549.00   memcpy          ru.txt
       14244   188.7     215.69     686.27   shoco           ru.txt
	   
     C Size  ratio%     C MB/s     D MB/s   Name            File               
        1916    28.5       0.88     197.68   brotli 11       json3.txt
        2068    30.8       5.73     160.02   lzma 9          json3.txt
        2115    31.5       0.35     611.00   zopfli          json3.txt
        2120    31.5       0.06     840.12   zstd 22         json3.txt
        2120    31.5       5.13     746.78   libdeflate 12   json3.txt
        2124    31.6       3.81       5.59   bsc 0e2         json3.txt
        2163    32.2      23.18     611.00   zlib 9          json3.txt
        2261    33.6       0.09     292.22   unishox3        json3.txt
        2937    43.7      23.02    3360.50   lz4 16          json3.txt
        5727    85.2      97.41     517.00   shoco           json3.txt
        6721   100.0    6721.00    6721.00   memcpy          json3.txt

     C Size  ratio%     C MB/s     D MB/s   Name            File               
         282    25.3       0.51      53.14   brotli 11       xml1.txt
         308    27.6       3.58     372.00   libdeflate 12   xml1.txt
         308    27.6       0.06     279.00   zopfli          xml1.txt
         312    28.0      10.63     279.00   zlib 9          xml1.txt
         318    28.5       0.01     372.00   zstd 22         xml1.txt
         321    28.8       2.89      65.65   lzma 9          xml1.txt
         344    30.8       0.99       1.45   bsc 0e2         xml1.txt
         363    32.5       0.08     223.20   unishox3        xml1.txt
         458    41.0       5.81     558.00   lz4 16          xml1.txt
         963    86.3      85.85     558.00   shoco           xml1.txt
        1116   100.0    1116.00   1116.00    memcpy          xml1.txt

     C Size  ratio%     C MB/s     D MB/s   Name            File               
         102    42.3       0.05     120.50   unishox3        json1.txt
         130    53.9       0.62      17.21   lzma 9          json1.txt
         131    54.4       0.02      80.33   zopfli          json1.txt
         132    54.8       2.11     241.00   zlib 9          json1.txt
         132    54.8       0.74      80.33   libdeflate 12   json1.txt
         133    55.2       0.16       8.61   brotli 11       json1.txt
         134    55.6       0.00     120.50   zstd 22         json1.txt
         165    68.5       1.48     241.00   lz4 16          json1.txt
         172    71.4       0.24       0.28   bsc 0e2         json1.txt
         221    91.7      14.18     241.00   shoco           json1.txt
         241   100.0     241.00     241.00   memcpy          json1.txt
@siara-cc
Copy link
Owner

siara-cc commented Dec 23, 2022

Hi @powturbo ,

Thank you for sharing this !! Thats exactly what Unishox has been made for - Compressing short strings / files < 1kb !!

Its really not a general purpose compressor so this comparison is not fair except for showing speed (slowness), which I am working on.

However, for achieving better compression overall, following logic could be used, since the magic bit(s) at the beginning of compressed bytes can be used to identify Unishox or other methods:

if (size < 1024)
    output = compress_with_unishox(input);
else
    output = compress_with_any_other(input)

The size 1024 is arbitrary and if speed is not a concern, it is also possible to compress with both and use the best.

The real contenders for Unishox are Smaz, Shoco, Unicode.org's SCSU and BOCU (implementations here and here) and AIMCS (Implementation here).

See here for applications of Unishox and it is being used in Tasmota and Meshtastic projects. It is also being used to compress URLs, Email addresses and a person even uses it for obfuscation :-).

I have included Unishox3-Alpha in the CI/CD to test compression of all sample files and I did not find any errors. Is it possible to let me know what the errors were?

@powturbo
Copy link
Author

Yeah, unishox3 is excellent on short strings.
Note, if you have a large database of short strings like urls, emails,... you can also use lz4,zstd and zlib with a preset dictionary.
Actually I've not tested how it compares to unishox.

I'm getting a turbobench compare error for the following files:
ERROR at 72996:20, 77 file=alice_wland.txt
ERROR at 68970:48, 65 file=french.txt
ERROR at 79796:32, 65 file=json4.txt
ERROR at 29339:3a, 34 file=korean.txt
ERROR at 74296:20, 2e file=world95.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants