Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack-objects: Create an alternative name hash algorithm (recreated) #1823

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Commits on Nov 22, 2024

  1. pack-objects: add --full-name-hash option

    The pack_name_hash() method has not been materially changed since it was
    introduced in ce0bd64 (pack-objects: improve path grouping
    heuristics., 2006-06-05). The intention here is to group objects by path
    name, but also attempt to group similar file types together by making
    the most-significant digits of the hash be focused on the final
    characters.
    
    Here's the crux of the implementation:
    
    	/*
    	 * This effectively just creates a sortable number from the
    	 * last sixteen non-whitespace characters. Last characters
    	 * count "most", so things that end in ".c" sort together.
    	 */
    	while ((c = *name++) != 0) {
    		if (isspace(c))
    			continue;
    		hash = (hash >> 2) + (c << 24);
    	}
    
    As the comment mentions, this only cares about the last sixteen
    non-whitespace characters. This cause some filenames to collide more than
    others. This collision is somewhat by design in order to promote hash
    locality for files that have similar types (.c, .h, .json) or could be the
    same file across a directory rename (a/foo.txt to b/foo.txt). This leads to
    decent cross-path deltas in cases like shallow clones or packing a
    repository with very few historical versions of files that share common data
    with other similarly-named files.
    
    However, when the name-hash instead leads to a large number of name-hash
    collisions for otherwise unrelated files, this can lead to confusing the
    delta calculation to prefer cross-path deltas over previous versions of the
    same file.
    
    Here are some examples that I've seen while investigating repositories
    that are growing more than they should be:
    
     * "/CHANGELOG.json" is 15 characters, and is created by the beachball
       [1] tool. Only the final character of the parent directory can
       differentiate different versions of this file, but also only the two
       most-significant digits. If that character is a letter, then this is
       always a collision. Similar issues occur with the similar
       "/CHANGELOG.md" path, though there is more opportunity for
       differences in the parent directory.
    
     * Localization files frequently have common filenames but differentiates
       via parent directories. In C#, the name "/strings.resx.lcl" is used
       for these localization files and they will all collide in name-hash.
    
    [1] https://github.com/microsoft/beachball
    
    I've come across many other examples where some internal tool uses a
    common name across multiple directories and is causing Git to repack
    poorly due to name-hash collisions.
    
    It is clear that the existing name-hash algorithm is optimized for
    repositories with short path names, but also is optimized for packing a
    single snapshot of a repository, not a repository with many versions of
    the same file. In my testing, this has proven out where the name-hash
    algorithm does a good job of finding peer files as delta bases when
    unable to use a historical version of that exact file.
    
    However, for repositories that have many versions of most files and
    directories, it is more important that the objects that appear at the
    same path are grouped together.
    
    Create a new pack_full_name_hash() method and a new --full-name-hash
    option for 'git pack-objects' to call that method instead. Add a simple
    pass-through for 'git repack --full-name-hash' for additional testing in
    the context of a full repack, where I expect this will be most
    effective.
    
    The hash algorithm is as simple as possible to be reasonably effective:
    for each character of the path string, add a multiple of that character
    and a large prime number (chosen arbitrarily, but intended to be large
    relative to the size of a uint32_t). Then, shift the current hash value
    to the right by 5, with overlap. The addition and shift parameters are
    standard mechanisms for creating hard-to-predict behaviors in the bits
    of the resulting hash.
    
    This is not meant to be cryptographic at all, but uniformly distributed
    across the possible hash values. This creates a hash that appears
    pseudorandom. There is no ability to consider similar file types as
    being close to each other.
    
    In a later change, a test-tool will be added so the effectiveness of
    this hash can be demonstrated directly.
    
    For now, let's consider how effective this mechanism is when repacking a
    repository with and without the --full-name-hash option. Specifically,
    let's use 'git repack -adf [--full-name-hash]' as our test.
    
    On the Git repository, we do not expect much difference. All path names
    are short. This is backed by our results:
    
    | Stage                 | Pack Size | Repack Time |
    |-----------------------|-----------|-------------|
    | After clone           | 260 MB    | N/A         |
    | Standard Repack       | 127MB     | 106s        |
    | With --full-name-hash | 126 MB    | 99s         |
    
    This example demonstrates how there is some natural overhead coming from
    the cloned copy because the server is hosting many forks and has not
    optimized for exactly this set of reachable objects. But the full repack
    has similar characteristics with and without --full-name-hash.
    
    However, we can test this in a repository that uses one of the
    problematic naming conventions above. The fluentui [2] repo uses
    beachball to generate CHANGELOG.json and CHANGELOG.md files, and these
    files have very poor delta characteristics when comparing against
    versions across parent directories.
    
    | Stage                 | Pack Size | Repack Time |
    |-----------------------|-----------|-------------|
    | After clone           | 694 MB    | N/A         |
    | Standard Repack       | 438 MB    | 728s        |
    | With --full-name-hash | 168 MB    | 142s        |
    
    [2] https://github.com/microsoft/fluentui
    
    In this example, we see significant gains in the compressed packfile
    size as well as the time taken to compute the packfile.
    
    Using a collection of repositories that use the beachball tool, I was
    able to make similar comparisions with dramatic results. While the
    fluentui repo is public, the others are private so cannot be shared for
    reproduction. The results are so significant that I find it important to
    share here:
    
    | Repo     | Standard Repack | With --full-name-hash |
    |----------|-----------------|-----------------------|
    | fluentui |         438 MB  |               168 MB  |
    | Repo B   |       6,255 MB  |               829 MB  |
    | Repo C   |      37,737 MB  |             7,125 MB  |
    | Repo D   |     130,049 MB  |             6,190 MB  |
    
    Future changes could include making --full-name-hash implied by a config
    value or even implied by default during a full repack.
    
    It is important to point out that the name hash value is stored in the
    .bitmap file format, so we must disable the --full-name-hash option when
    bitmaps are being read or written. Later, the bitmap format could be
    updated to be aware of the name hash version so deltas can be quickly
    computed across the bitmapped/not-bitmapped boundary.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    c04c98f View commit details
    Browse the repository at this point in the history
  2. repack: add --full-name-hash option

    The new '--full-name-hash' option for 'git repack' is a simple
    pass-through to the underlying 'git pack-objects' subcommand. However,
    this subcommand may have other options and a temporary filename as part
    of the subcommand execution that may not be predictable or could change
    over time.
    
    The existing test_subcommand method requires an exact list of arguments
    for the subcommand. This is too rigid for our needs here, so create a
    new method, test_subcommand_flex. Use it to check that the
    --full-name-hash option is passing through.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    6c37277 View commit details
    Browse the repository at this point in the history
  3. pack-objects: add GIT_TEST_FULL_NAME_HASH

    Add a new environment variable to opt-in to the --full-name-hash option
    in 'git pack-objects'. This allows for extra testing of the feature
    without repeating all of the test scenarios. Unlike many GIT_TEST_*
    variables, we are choosing to not add this to the linux-TEST-vars CI build
    as that test run is already overloaded. The behavior exposed by this test
    variable is of low risk and should be sufficient to allow manual testing
    when an issue arises.
    
    But this option isn't free. There are a few tests that change behavior
    with the variable enabled.
    
    First, there are a few tests that are very sensitive to certain delta
    bases being picked. These are both involving the generation of thin
    bundles and then counting their objects via 'git index-pack --fix-thin'
    which pulls the delta base into the new packfile. For these tests,
    disable the option as a decent long-term option.
    
    Second, there are two tests in t5616-partial-clone.sh that I believe are
    actually broken scenarios. While the client is set up to clone the
    'promisor-server' repo via a treeless partial clone filter (tree:0),
    that filter does not translate to the 'server' repo. Thus, fetching from
    these repos causes the server to think that the client has all reachable
    trees and blobs from the commits advertised as 'haves'. This leads the
    server to providing a thin pack assuming those objects as delta bases.
    Changing the name-hash algorithm presents new delta bases and thus
    breaks the expectations of these tests. An alternative could be to set
    up 'server' as a promisor server with the correct filter enabled. This
    may also point out more issues with partial clone being set up as a
    remote-based filtering mechanism and not a repository-wide setting. For
    now, do the minimal change to make the test work by disabling the test
    variable.
    
    Third, there are some tests that compare the exact output of a 'git
    pack-objects' process when using bitmaps. The warning that disables the
    --full-name-hash option causes these tests to fail. Disable the
    environment variable to get around this issue.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    00e7e8b View commit details
    Browse the repository at this point in the history
  4. p5313: add size comparison test

    As custom options are added to 'git pack-objects' and 'git repack' to
    adjust how compression is done, use this new performance test script to
    demonstrate their effectiveness in performance and size.
    
    The recently-added --full-name-hash option swaps the default name-hash
    algorithm with one that attempts to uniformly distribute the hashes
    based on the full path name instead of the last 16 characters.
    
    This has a dramatic effect on full repacks for repositories with many
    versions of most paths. It can have a negative impact on cases such as
    pushing a single change.
    
    This can be seen by running pt5313 on the open source fluentui
    repository [1]. Most commits will have this kind of output for the thin
    and big pack cases, though certain commits (such as [2]) will have
    problematic thin pack size for other reasons.
    
    [1] https://github.com/microsoft/fluentui
    [2] a637a06df05360ce5ff21420803f64608226a875
    
    Checked out at the parent of [2], I see the following statistics:
    
    Test                                               HEAD
    ---------------------------------------------------------------------
    5313.2: thin pack                                  0.37(0.43+0.02)
    5313.3: thin pack size                                        1.2M
    5313.4: thin pack with --full-name-hash            0.06(0.09+0.02)
    5313.5: thin pack size with --full-name-hash                 20.4K
    5313.6: big pack                                   2.01(7.73+0.23)
    5313.7: big pack size                                        20.3M
    5313.8: big pack with --full-name-hash             1.32(2.77+0.27)
    5313.9: big pack size with --full-name-hash                  19.9M
    5313.10: shallow fetch pack                        1.40(3.01+0.08)
    5313.11: shallow pack size                                   34.4M
    5313.12: shallow pack with --full-name-hash        1.08(1.25+0.14)
    5313.13: shallow pack size with --full-name-hash             35.4M
    5313.14: repack                                    90.70(672.88+2.46)
    5313.15: repack size                                        439.6M
    5313.16: repack with --full-name-hash              18.53(123.41+2.53)
    5313.17: repack size with --full-name-hash                  169.7M
    
    In this case, we see positive behaviors such as a significant shrink in
    the size of the thin pack and full repack. The big pack is slightly
    smaller with --full-name-hash than without. The shallow pack is slightly
    larger with --full-name-hash.
    
    In the case of the Git repository, these numbers show some of the issues
    with this approach:
    
    Test                                               HEAD
    --------------------------------------------------------------------
    5313.2: thin pack                                  0.00(0.00+0.00)
    5313.3: thin pack size                                         589
    5313.4: thin pack with --full-name-hash            0.00(0.00+0.00)
    5313.5: thin pack size with --full-name-hash                 14.9K
    5313.6: big pack                                   2.07(3.57+0.17)
    5313.7: big pack size                                        17.6M
    5313.8: big pack with --full-name-hash             2.00(3.07+0.19)
    5313.9: big pack size with --full-name-hash                  17.9M
    5313.10: shallow fetch pack                        1.41(2.23+0.06)
    5313.11: shallow pack size                                   12.1M
    5313.12: shallow pack with --full-name-hash        1.22(1.66+0.04)
    5313.13: shallow pack size with --full-name-hash             12.4M
    5313.14: repack                                    15.75(89.29+1.54)
    5313.15: repack size                                        126.4M
    5313.16: repack with --full-name-hash              15.56(89.78+1.32)
    5313.17: repack size with --full-name-hash                  126.0M
    
    The thin pack that simulates a push is much worse with --full-name-hash
    in this case. The name hash values are doing a lot to assist with delta
    bases, it seems. The big pack and shallow clone cases are slightly worse
    with the --full-name-hash option. Only the full repack gains some
    benefits in size.
    
    The results are similar with the nodejs/node repo:
    
    Test                                               HEAD
    ---------------------------------------------------------------------
    5313.2: thin pack                                  0.01(0.01+0.00)
    5313.3: thin pack size                                        1.6K
    5313.4: thin pack with --full-name-hash            0.01(0.00+0.00)
    5313.5: thin pack size with --full-name-hash                  3.1K
    5313.6: big pack                                   4.26(8.03+0.24)
    5313.7: big pack size                                        56.0M
    5313.8: big pack with --full-name-hash             4.16(6.55+0.22)
    5313.9: big pack size with --full-name-hash                  56.2M
    5313.10: shallow fetch pack                        7.67(11.80+0.29)
    5313.11: shallow pack size                                  104.6M
    5313.12: shallow pack with --full-name-hash        7.52(9.65+0.23)
    5313.13: shallow pack size with --full-name-hash            105.9M
    5313.14: repack                                    71.22(317.61+3.95)
    5313.15: repack size                                        739.9M
    5313.16: repack with --full-name-hash              48.85(267.02+3.72)
    5313.17: repack size with --full-name-hash                  793.5M
    
    The Linux kernel repository was the initial target of the default name
    hash value, and its naming conventions are practically build to take the
    most advantage of the default name hash values:
    
    Test                                               HEAD
    -------------------------------------------------------------------------
    5313.2: thin pack                                  0.15(0.01+0.03)
    5313.3: thin pack size                                        4.6K
    5313.4: thin pack with --full-name-hash            0.03(0.02+0.01)
    5313.5: thin pack size with --full-name-hash                  6.8K
    5313.6: big pack                                   18.51(33.74+0.95)
    5313.7: big pack size                                       201.1M
    5313.8: big pack with --full-name-hash             16.01(29.81+0.88)
    5313.9: big pack size with --full-name-hash                 202.1M
    5313.10: shallow fetch pack                        11.49(17.61+0.54)
    5313.11: shallow pack size                                  269.2M
    5313.12: shallow pack with --full-name-hash        11.24(15.25+0.56)
    5313.13: shallow pack size with --full-name-hash            269.8M
    5313.14: repack                                    1001.25(2271.06+38.86)
    5313.15: repack size                                          2.5G
    5313.16: repack with --full-name-hash              625.75(1941.96+36.09)
    5313.17: repack size with --full-name-hash                    2.6G
    
    Finally, an internal Javascript repo of moderate size shows significant
    gains when repacking with --full-name-hash due to it having many name
    hash collisions. However, it's worth noting that only the full repack
    case has enough improvement to be worth it. But the improvements are
    significant: 6.4 GB to 862 MB.
    
    Test                                               HEAD
    --------------------------------------------------------------------------
    5313.2: thin pack                                  0.03(0.02+0.00)
    5313.3: thin pack size                                        1.2K
    5313.4: thin pack with --full-name-hash            0.03(0.03+0.00)
    5313.5: thin pack size with --full-name-hash                  2.6K
    5313.6: big pack                                   2.20(3.23+0.30)
    5313.7: big pack size                                       130.7M
    5313.8: big pack with --full-name-hash             2.33(3.17+0.34)
    5313.9: big pack size with --full-name-hash                 131.0M
    5313.10: shallow fetch pack                        3.56(6.02+0.32)
    5313.11: shallow pack size                                   44.5M
    5313.12: shallow pack with --full-name-hash        2.94(3.94+0.32)
    5313.13: shallow pack size with --full-name-hash             45.3M
    5313.14: repack                                    2435.22(12523.11+23.53)
    5313.15: repack size                                          6.4G
    5313.16: repack with --full-name-hash              473.25(1805.11+17.22)
    5313.17: repack size with --full-name-hash                  861.9M
    
    These tests demonstrate that it is important to be careful about which
    cases are best for using the --full-name-hash option.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    21fa75d View commit details
    Browse the repository at this point in the history
  5. pack-objects: disable --full-name-hash when shallow

    As demonstrated in the previous change, the --full-name-hash option of
    'git pack-objects' is less effective in a trunctated history. Thus, even
    when the option is selected via a command-line option or config, disable
    this option when the '--shallow' option is specified. This will help
    performance in servers that choose to enable the --full-name-hash option
    by default for a repository while not regressing their ability to serve
    shallow clones.
    
    This will not present a compatibility issue in the future when the full
    name hash values are stored in the reachability bitmaps, since shallow
    clones disable bitmaps.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    842f720 View commit details
    Browse the repository at this point in the history
  6. test-tool: add helper for name-hash values

    Add a new test-tool helper, name-hash, to output the value of the
    name-hash algorithms for the input list of strings, one per line.
    
    Since the name-hash values can be stored in the .bitmap files, it is
    important that these hash functions do not change across Git versions.
    Add a simple test to t5310-pack-bitmaps.sh to provide some testing of
    the current values. Due to how these functions are implemented, it would
    be difficult to change them without disturbing these values.
    
    Create a performance test that uses test_size to demonstrate how
    collisions occur for these hash algorithms. This test helps inform
    someone as to the behavior of the name-hash algorithms for their repo
    based on the paths at HEAD.
    
    My copy of the Git repository shows modest statistics around the
    collisions of the default name-hash algorithm:
    
    Test                                              this tree
    -----------------------------------------------------------------
    5314.1: paths at head                                        4.5K
    5314.2: number of distinct name-hashes                       4.1K
    5314.3: number of distinct full-name-hashes                  4.5K
    5314.4: maximum multiplicity of name-hashes                    13
    5314.5: maximum multiplicity of fullname-hashes                 1
    
    Here, the maximum collision multiplicity is 13, but around 10% of paths
    have a collision with another path.
    
    In a more interesting example, the microsoft/fluentui [1] repo had these
    statistics at time of committing:
    
    Test                                              this tree
    -----------------------------------------------------------------
    5314.1: paths at head                                       19.6K
    5314.2: number of distinct name-hashes                       8.2K
    5314.3: number of distinct full-name-hashes                 19.6K
    5314.4: maximum multiplicity of name-hashes                   279
    5314.5: maximum multiplicity of fullname-hashes                 1
    
    [1] https://github.com/microsoft/fluentui
    
    That demonstrates that of the nearly twenty thousand path names, they
    are assigned around eight thousand distinct values. 279 paths are
    assigned to a single value, leading the packing algorithm to sort
    objects from those paths together, by size.
    
    In this repository, no collisions occur for the full-name-hash
    algorithm.
    
    In a more extreme example, an internal monorepo had a much worse
    collision rate:
    
    Test                                              this tree
    -----------------------------------------------------------------
    5314.1: paths at head                                      221.6K
    5314.2: number of distinct name-hashes                      72.0K
    5314.3: number of distinct full-name-hashes                221.6K
    5314.4: maximum multiplicity of name-hashes                 14.4K
    5314.5: maximum multiplicity of fullname-hashes                 2
    
    Even in this repository with many more paths at HEAD, the collision rate
    was low and the maximum number of paths being grouped into a single
    bucket by the full-path-name algorithm was two.
    
    Signed-off-by: Derrick Stolee <[email protected]>
    derrickstolee committed Nov 22, 2024
    Configuration menu
    Copy the full SHA
    c6cb866 View commit details
    Browse the repository at this point in the history