Support Fractional-Power-of-Two Window Sizes in Compression #4275

felixhandte · 2025-01-30T21:58:05Z

This PR adds support for configuring and performing compression with non-power-of-two window sizes.

The frame format and decoder have always supported more window sizes than just powers of two. The frame format defines a 3 bit window mantissa field in addition to the 4 bit window log descriptor. However, the compressor has only ever supported power-of-two windows. This PR changes that. In order to do that, it makes a number of changes:

It separates the public and private CParams struct definitions.
It adds a windowFrac field to the private CParams (but not to the public one, because we are scared of changing that struct definition.
It adds a new CCtx Param ZSTD_c_windowFrac which allows users to set this param.

To-Do:

Testing.
CLI Support.

lib/compress/clevels.h

Cyan4973

I'm not convinced by the strategy of creating a new intermediate internal structure to pass the new parameter.
We have an established pattern to add new parameters via ZSTD_CCtx_params.
I prefer we stick to it and don't introduce a new pattern specific to this new parameter.

.github/workflows/dev-long-tests.yml

felixhandte · 2025-02-04T15:22:45Z

I agree that it would have been more expedient to just put it in the CCtxParams, but it seems bad to separate these highly interdependent variables.

In this future where the Zstd compression side supports fractional window sizes, it is a bug if you look at just the windowLog and don't look at the windowFrac at the same time. I worry that if they're not right next to each other, it is both mechanically harder to make both available to all the places that interact with window sizing, it is also easier to make mistakes as a developer and look only at the windowLog. (Similarly, if you look closely at the code for setting these values, there are nuances that require that they interact.)

And, as we discussed, the match-finders need access to this parameter, so it needs to be available from the MatchState, which currently only has the CParams, and not the CCtxParams.

I considered an alternate approach where, rather than introduce a new variable, we change the internal windowLog to be a "windowDesc" which carries both components, as, e.g., windowLog << 3 | windowFrac. But then even if you didn't have to change the struct shape, you'd still have to translate the values of the internal cParams with the public ones, and it's actually harder to validate that you're doing that correctly if they are the same struct.

Even though the `ZSTD_compressionParameters` struct is part of the unstable API, and therefore we are allowed to modify it, it seems likely that to actually do so would cause widespread chaos and bloodshed. So we are strongly incentivized to avoid changing it. We would, however, like to modify the CParams that are actually passed around internally. In particular, we want to be able to configure non-power-of-2 window sizes. So we need something more expressive than an integral window log. And we want this new field to be next to the window log, rather than hanging out in the CCtxParams, which are not passed to the block compressors. So, in order to support that, this commit: 1. Introduces a new struct, the `ZSTD_CParams`, that (for the moment) mirrors the definition of the public `ZSTD_compressionParameters` struct. 2. Codemods all internal use and storage of `cparams` internally to use the new struct definition. (The exception to this is the `ZSTD_parameters` struct, which is a public definition but is also passed around internally sometimes.) 3. Adds translation functions to convert the public and private struct defs to each other. 4. Uses those translation functions at the user API boundary (and when handling `ZSTD_parameters`).

This commit extends the refactor done in the previous and applies it to the `ZSTD_parameters` public struct, introducing an internal variant called `ZSTD_Params` whose only difference is containing a `ZSTD_CParams` instead of a `ZSTD_compressionParameters`. This commit similarly introduces conversion functions and rewrites internal functions to use the new, private definition. This allows us to clean up some internal conversions of the private `CParams` back into the public `CParams` in order to put them in a `ZSTD_parameters` struct to pass into something.

Store it in the CParams. Don't use it yet.

Using a blocksize of 100KB with a 65KB dictionary just *happens* to work with the current window-sizing logic, because it has to round all the way up to the next power of two. But it doesn't *have* to work. So this pre-emptively fixes up this test to use a size that will work after switching to tighter window sizing, but still uses a block size that is bigger than the dict size in case that matters for some reason?

We want to turn this on by default immediately after this release. And we want to thoroughly exercise the underlying code paths handling fractional windows, because they will be available in this upcoming release via new APIs (the explicit CCtx param and the constrain window for protocol path). And it's scary stuff. We can remove all these flags once we turn this on by default.

I found some slight errors via the exhaustive testing added to the fuzzer. This commit fixes them.

For inputs smaller than `1 << ZSTD_WINDOWLOG_MIN` bytes, the pre-existing code in `ZSTD_adjustCParams_internal()` temporarily calculated an illegally small window log for the purposes of adjusting the hash- and chain-table sizes, before fixing that bad window log to comply with the minimums. The new code clamped the window log earlier and therefore didn't shrink the tables as much. This commit fixes that to mirror the pre-existing behavior. This resolves the remaining regression test differences.

Cyan4973 · 2025-02-22T00:28:06Z

tests/fuzzer.c

@@ -3670,7 +3781,7 @@ static int basicUnitTests(U32 const seed, double compressibility)
    {   ZSTD_CCtx* const cctx = ZSTD_createCCtx();
        ZSTD_DCtx* const dctx = ZSTD_createDCtx();
        static const size_t dictSize = 65 KB;
-        static const size_t blockSize = 100 KB;   /* won't cause pb with small dict size */
+        static const size_t blockSize = 71 KB;   /* won't cause pb with small dict size */


Interesting,
why changing blockSize in this test ?
Is this related to non-power-of-2 window sizes ?
In which case, what's better tested with 71 KB than with previous 100 KB ?

Yeah, the commit message for the commit that makes this change describes the motivation:

Using a blocksize of 100KB with a 65KB dictionary just happens to work with
the current window-sizing logic, because it has to round all the way up to
the next power of two. But it doesn't have to work.

So this pre-emptively fixes up this test to use a size that will work after
switching to tighter window sizing, but still uses a block size that is
bigger than the dict size in case that matters for some reason?

As discussed elsewhere, this PR doesn't change the default window sizing logic: it still rounds up to the next power of two by default (this is what is controlled by the ZSTD_WINDOW_ALLOW_PICKING_FRACTIONAL_SIZES macro, which defaults to false). Which is to say, this PR doesn't break this test. But if you set -DZSTD_WINDOW_ALLOW_PICKING_FRACTIONAL_SIZES=1, this test as it exists will break, because Zstd will pick a 72KB window size, and blocks can't be larger than the window size, and this test tries to invoke the block-level compression API with a larger input.

So the change here preserves the behavior of the test of checking a block larger than the dictionary (although I don't think that's actually an important property of how this test is set up), while complying with the new, smaller selected window size.

Cyan4973 · 2025-03-04T18:33:23Z

lib/compress/zstd_compress_internal.h

+} ZSTD_CParams;
+
+/**
+ * Internal equivalent of public ZSTD_Params struct, wrapping the internal


nit: public ZSTD_parameters struct

Cyan4973 · 2025-03-04T18:44:50Z

lib/compress/zstd_compress_internal.h

+ * translation. The only long-term justified use of these should be at the
+ * User API.
+ */
+ZSTD_CParams ZSTD_getCParamsFromPublicCParams(ZSTD_compressionParameters cParams);


Using Published Names for Clarity

When referring to specific parameters or functions, it feels more consistent to use their published names to reduce confusion. In the case of PublicCParams, while its meaning can be inferred from context, using the public name directly feels preferable.

Consider renaming the following functions to use more consistent names:

ZSTD_getCParamsFromPublicCParams() => ZSTD_getCParamsFromCompressionParameters()

ZSTD_getPublicCParamsFromCParams() => ZSTD_getCompressionParametersFromCParams()

ZSTD_getParamsFromPublicParams() => ZSTD_getParamsFromParameters()

etc.

By using published names consistently, code becomes more readable for future contributors.

also: minor: get feels very generic.
convert might be more accurate to describe this operation.

Cyan4973 · 2025-03-04T19:01:30Z

Just some minor naming suggestions.

zstd_compress.c is relatively straightforward to review, compared to #4305, because the code structure remains the same. So no comment there.

facebook-github-bot added the CLA Signed label Jan 30, 2025

Cyan4973 reviewed Feb 3, 2025

View reviewed changes

lib/compress/clevels.h Show resolved Hide resolved

Cyan4973 requested changes Feb 3, 2025

View reviewed changes

Cyan4973 reviewed Feb 3, 2025

View reviewed changes

.github/workflows/dev-long-tests.yml Show resolved Hide resolved

felixhandte force-pushed the win-frac branch 2 times, most recently from 810c02a to bfc4b7d Compare February 7, 2025 15:36

felixhandte added 20 commits February 18, 2025 07:48

Add a windowFrac Field to ZSTD_CParams

aa76d3f

Add Public ZSTD_c_windowFrac CCtxParam

139b725

Store it in the CParams. Don't use it yet.

Implement Support for Fractional Window Sizes

4dbc1a1

Easy: Rename Arg to ZSTD_cycleLog

3e308e2

Trivial fuzzer.c Additions

a8da454

Macro Guard to Control Picking Fractional Window Sizes

301a34d

A Little More Bounds-Checking

a2f3b98

Only Apply windowFrac when windowLog is Also Set

fe80106

Avoid Illegal WL and WF Combos

abc3d6f

Correct Corner-Case Bugs in Window Sizing

78996a6

I found some slight errors via the exhaustive testing added to the fuzzer. This commit fixes them.

Exhaustive Testing of Window Sizing Combinations

9848c21

Add WindowFrac Param to Fuzz Tests

705a5b2

Add Advanced Parameter to CLI to Set Window Fraction

d720558

Add Man Page Documentation

8b12bd6

Add CLI Test for Fractional Window Size Param

2af69f9

Fix Various Contbuild Warnings

72c38c9

felixhandte force-pushed the win-frac branch from bfc4b7d to 72c38c9 Compare February 18, 2025 16:14

felixhandte marked this pull request as ready for review February 18, 2025 17:39

felixhandte mentioned this pull request Feb 20, 2025

Put windowFrac in the CCtxParams, not the CParams #4305

Draft

Cyan4973 reviewed Feb 22, 2025

View reviewed changes

Cyan4973 reviewed Mar 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Fractional-Power-of-Two Window Sizes in Compression #4275

Support Fractional-Power-of-Two Window Sizes in Compression #4275

Uh oh!

felixhandte commented Jan 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Cyan4973 left a comment •

edited

Loading

Uh oh!

Uh oh!

felixhandte commented Feb 4, 2025

Uh oh!

Cyan4973 Feb 22, 2025

Uh oh!

felixhandte Mar 5, 2025

Uh oh!

Cyan4973 Mar 4, 2025

Uh oh!

Cyan4973 Mar 4, 2025 •

edited

Loading

Uh oh!

Cyan4973 Mar 4, 2025 •

edited

Loading

Uh oh!

Cyan4973 commented Mar 4, 2025

Uh oh!

Uh oh!

Support Fractional-Power-of-Two Window Sizes in Compression #4275

Are you sure you want to change the base?

Support Fractional-Power-of-Two Window Sizes in Compression #4275

Uh oh!

Conversation

felixhandte commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Cyan4973 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

felixhandte commented Feb 4, 2025

Uh oh!

Cyan4973 Feb 22, 2025

Choose a reason for hiding this comment

Uh oh!

felixhandte Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

Cyan4973 Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

Cyan4973 Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Using Published Names for Clarity

Uh oh!

Cyan4973 Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyan4973 commented Mar 4, 2025

Uh oh!

Uh oh!

felixhandte commented Jan 30, 2025 •

edited

Loading

Cyan4973 left a comment •

edited

Loading

Cyan4973 Mar 4, 2025 •

edited

Loading

Cyan4973 Mar 4, 2025 •

edited

Loading