-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pack-objects: Create an alternative name hash algorithm (recreated) #1823
base: master
Are you sure you want to change the base?
Commits on Nov 22, 2024
-
pack-objects: add --full-name-hash option
The pack_name_hash() method has not been materially changed since it was introduced in ce0bd64 (pack-objects: improve path grouping heuristics., 2006-06-05). The intention here is to group objects by path name, but also attempt to group similar file types together by making the most-significant digits of the hash be focused on the final characters. Here's the crux of the implementation: /* * This effectively just creates a sortable number from the * last sixteen non-whitespace characters. Last characters * count "most", so things that end in ".c" sort together. */ while ((c = *name++) != 0) { if (isspace(c)) continue; hash = (hash >> 2) + (c << 24); } As the comment mentions, this only cares about the last sixteen non-whitespace characters. This cause some filenames to collide more than others. This collision is somewhat by design in order to promote hash locality for files that have similar types (.c, .h, .json) or could be the same file across a directory rename (a/foo.txt to b/foo.txt). This leads to decent cross-path deltas in cases like shallow clones or packing a repository with very few historical versions of files that share common data with other similarly-named files. However, when the name-hash instead leads to a large number of name-hash collisions for otherwise unrelated files, this can lead to confusing the delta calculation to prefer cross-path deltas over previous versions of the same file. Here are some examples that I've seen while investigating repositories that are growing more than they should be: * "/CHANGELOG.json" is 15 characters, and is created by the beachball [1] tool. Only the final character of the parent directory can differentiate different versions of this file, but also only the two most-significant digits. If that character is a letter, then this is always a collision. Similar issues occur with the similar "/CHANGELOG.md" path, though there is more opportunity for differences in the parent directory. * Localization files frequently have common filenames but differentiates via parent directories. In C#, the name "/strings.resx.lcl" is used for these localization files and they will all collide in name-hash. [1] https://github.com/microsoft/beachball I've come across many other examples where some internal tool uses a common name across multiple directories and is causing Git to repack poorly due to name-hash collisions. It is clear that the existing name-hash algorithm is optimized for repositories with short path names, but also is optimized for packing a single snapshot of a repository, not a repository with many versions of the same file. In my testing, this has proven out where the name-hash algorithm does a good job of finding peer files as delta bases when unable to use a historical version of that exact file. However, for repositories that have many versions of most files and directories, it is more important that the objects that appear at the same path are grouped together. Create a new pack_full_name_hash() method and a new --full-name-hash option for 'git pack-objects' to call that method instead. Add a simple pass-through for 'git repack --full-name-hash' for additional testing in the context of a full repack, where I expect this will be most effective. The hash algorithm is as simple as possible to be reasonably effective: for each character of the path string, add a multiple of that character and a large prime number (chosen arbitrarily, but intended to be large relative to the size of a uint32_t). Then, shift the current hash value to the right by 5, with overlap. The addition and shift parameters are standard mechanisms for creating hard-to-predict behaviors in the bits of the resulting hash. This is not meant to be cryptographic at all, but uniformly distributed across the possible hash values. This creates a hash that appears pseudorandom. There is no ability to consider similar file types as being close to each other. In a later change, a test-tool will be added so the effectiveness of this hash can be demonstrated directly. For now, let's consider how effective this mechanism is when repacking a repository with and without the --full-name-hash option. Specifically, let's use 'git repack -adf [--full-name-hash]' as our test. On the Git repository, we do not expect much difference. All path names are short. This is backed by our results: | Stage | Pack Size | Repack Time | |-----------------------|-----------|-------------| | After clone | 260 MB | N/A | | Standard Repack | 127MB | 106s | | With --full-name-hash | 126 MB | 99s | This example demonstrates how there is some natural overhead coming from the cloned copy because the server is hosting many forks and has not optimized for exactly this set of reachable objects. But the full repack has similar characteristics with and without --full-name-hash. However, we can test this in a repository that uses one of the problematic naming conventions above. The fluentui [2] repo uses beachball to generate CHANGELOG.json and CHANGELOG.md files, and these files have very poor delta characteristics when comparing against versions across parent directories. | Stage | Pack Size | Repack Time | |-----------------------|-----------|-------------| | After clone | 694 MB | N/A | | Standard Repack | 438 MB | 728s | | With --full-name-hash | 168 MB | 142s | [2] https://github.com/microsoft/fluentui In this example, we see significant gains in the compressed packfile size as well as the time taken to compute the packfile. Using a collection of repositories that use the beachball tool, I was able to make similar comparisions with dramatic results. While the fluentui repo is public, the others are private so cannot be shared for reproduction. The results are so significant that I find it important to share here: | Repo | Standard Repack | With --full-name-hash | |----------|-----------------|-----------------------| | fluentui | 438 MB | 168 MB | | Repo B | 6,255 MB | 829 MB | | Repo C | 37,737 MB | 7,125 MB | | Repo D | 130,049 MB | 6,190 MB | Future changes could include making --full-name-hash implied by a config value or even implied by default during a full repack. It is important to point out that the name hash value is stored in the .bitmap file format, so we must disable the --full-name-hash option when bitmaps are being read or written. Later, the bitmap format could be updated to be aware of the name hash version so deltas can be quickly computed across the bitmapped/not-bitmapped boundary. Signed-off-by: Derrick Stolee <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c04c98f - Browse repository at this point
Copy the full SHA c04c98fView commit details -
repack: add --full-name-hash option
The new '--full-name-hash' option for 'git repack' is a simple pass-through to the underlying 'git pack-objects' subcommand. However, this subcommand may have other options and a temporary filename as part of the subcommand execution that may not be predictable or could change over time. The existing test_subcommand method requires an exact list of arguments for the subcommand. This is too rigid for our needs here, so create a new method, test_subcommand_flex. Use it to check that the --full-name-hash option is passing through. Signed-off-by: Derrick Stolee <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6c37277 - Browse repository at this point
Copy the full SHA 6c37277View commit details -
pack-objects: add GIT_TEST_FULL_NAME_HASH
Add a new environment variable to opt-in to the --full-name-hash option in 'git pack-objects'. This allows for extra testing of the feature without repeating all of the test scenarios. Unlike many GIT_TEST_* variables, we are choosing to not add this to the linux-TEST-vars CI build as that test run is already overloaded. The behavior exposed by this test variable is of low risk and should be sufficient to allow manual testing when an issue arises. But this option isn't free. There are a few tests that change behavior with the variable enabled. First, there are a few tests that are very sensitive to certain delta bases being picked. These are both involving the generation of thin bundles and then counting their objects via 'git index-pack --fix-thin' which pulls the delta base into the new packfile. For these tests, disable the option as a decent long-term option. Second, there are two tests in t5616-partial-clone.sh that I believe are actually broken scenarios. While the client is set up to clone the 'promisor-server' repo via a treeless partial clone filter (tree:0), that filter does not translate to the 'server' repo. Thus, fetching from these repos causes the server to think that the client has all reachable trees and blobs from the commits advertised as 'haves'. This leads the server to providing a thin pack assuming those objects as delta bases. Changing the name-hash algorithm presents new delta bases and thus breaks the expectations of these tests. An alternative could be to set up 'server' as a promisor server with the correct filter enabled. This may also point out more issues with partial clone being set up as a remote-based filtering mechanism and not a repository-wide setting. For now, do the minimal change to make the test work by disabling the test variable. Third, there are some tests that compare the exact output of a 'git pack-objects' process when using bitmaps. The warning that disables the --full-name-hash option causes these tests to fail. Disable the environment variable to get around this issue. Signed-off-by: Derrick Stolee <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 00e7e8b - Browse repository at this point
Copy the full SHA 00e7e8bView commit details -
p5313: add size comparison test
As custom options are added to 'git pack-objects' and 'git repack' to adjust how compression is done, use this new performance test script to demonstrate their effectiveness in performance and size. The recently-added --full-name-hash option swaps the default name-hash algorithm with one that attempts to uniformly distribute the hashes based on the full path name instead of the last 16 characters. This has a dramatic effect on full repacks for repositories with many versions of most paths. It can have a negative impact on cases such as pushing a single change. This can be seen by running pt5313 on the open source fluentui repository [1]. Most commits will have this kind of output for the thin and big pack cases, though certain commits (such as [2]) will have problematic thin pack size for other reasons. [1] https://github.com/microsoft/fluentui [2] a637a06df05360ce5ff21420803f64608226a875 Checked out at the parent of [2], I see the following statistics: Test HEAD --------------------------------------------------------------------- 5313.2: thin pack 0.37(0.43+0.02) 5313.3: thin pack size 1.2M 5313.4: thin pack with --full-name-hash 0.06(0.09+0.02) 5313.5: thin pack size with --full-name-hash 20.4K 5313.6: big pack 2.01(7.73+0.23) 5313.7: big pack size 20.3M 5313.8: big pack with --full-name-hash 1.32(2.77+0.27) 5313.9: big pack size with --full-name-hash 19.9M 5313.10: shallow fetch pack 1.40(3.01+0.08) 5313.11: shallow pack size 34.4M 5313.12: shallow pack with --full-name-hash 1.08(1.25+0.14) 5313.13: shallow pack size with --full-name-hash 35.4M 5313.14: repack 90.70(672.88+2.46) 5313.15: repack size 439.6M 5313.16: repack with --full-name-hash 18.53(123.41+2.53) 5313.17: repack size with --full-name-hash 169.7M In this case, we see positive behaviors such as a significant shrink in the size of the thin pack and full repack. The big pack is slightly smaller with --full-name-hash than without. The shallow pack is slightly larger with --full-name-hash. In the case of the Git repository, these numbers show some of the issues with this approach: Test HEAD -------------------------------------------------------------------- 5313.2: thin pack 0.00(0.00+0.00) 5313.3: thin pack size 589 5313.4: thin pack with --full-name-hash 0.00(0.00+0.00) 5313.5: thin pack size with --full-name-hash 14.9K 5313.6: big pack 2.07(3.57+0.17) 5313.7: big pack size 17.6M 5313.8: big pack with --full-name-hash 2.00(3.07+0.19) 5313.9: big pack size with --full-name-hash 17.9M 5313.10: shallow fetch pack 1.41(2.23+0.06) 5313.11: shallow pack size 12.1M 5313.12: shallow pack with --full-name-hash 1.22(1.66+0.04) 5313.13: shallow pack size with --full-name-hash 12.4M 5313.14: repack 15.75(89.29+1.54) 5313.15: repack size 126.4M 5313.16: repack with --full-name-hash 15.56(89.78+1.32) 5313.17: repack size with --full-name-hash 126.0M The thin pack that simulates a push is much worse with --full-name-hash in this case. The name hash values are doing a lot to assist with delta bases, it seems. The big pack and shallow clone cases are slightly worse with the --full-name-hash option. Only the full repack gains some benefits in size. The results are similar with the nodejs/node repo: Test HEAD --------------------------------------------------------------------- 5313.2: thin pack 0.01(0.01+0.00) 5313.3: thin pack size 1.6K 5313.4: thin pack with --full-name-hash 0.01(0.00+0.00) 5313.5: thin pack size with --full-name-hash 3.1K 5313.6: big pack 4.26(8.03+0.24) 5313.7: big pack size 56.0M 5313.8: big pack with --full-name-hash 4.16(6.55+0.22) 5313.9: big pack size with --full-name-hash 56.2M 5313.10: shallow fetch pack 7.67(11.80+0.29) 5313.11: shallow pack size 104.6M 5313.12: shallow pack with --full-name-hash 7.52(9.65+0.23) 5313.13: shallow pack size with --full-name-hash 105.9M 5313.14: repack 71.22(317.61+3.95) 5313.15: repack size 739.9M 5313.16: repack with --full-name-hash 48.85(267.02+3.72) 5313.17: repack size with --full-name-hash 793.5M The Linux kernel repository was the initial target of the default name hash value, and its naming conventions are practically build to take the most advantage of the default name hash values: Test HEAD ------------------------------------------------------------------------- 5313.2: thin pack 0.15(0.01+0.03) 5313.3: thin pack size 4.6K 5313.4: thin pack with --full-name-hash 0.03(0.02+0.01) 5313.5: thin pack size with --full-name-hash 6.8K 5313.6: big pack 18.51(33.74+0.95) 5313.7: big pack size 201.1M 5313.8: big pack with --full-name-hash 16.01(29.81+0.88) 5313.9: big pack size with --full-name-hash 202.1M 5313.10: shallow fetch pack 11.49(17.61+0.54) 5313.11: shallow pack size 269.2M 5313.12: shallow pack with --full-name-hash 11.24(15.25+0.56) 5313.13: shallow pack size with --full-name-hash 269.8M 5313.14: repack 1001.25(2271.06+38.86) 5313.15: repack size 2.5G 5313.16: repack with --full-name-hash 625.75(1941.96+36.09) 5313.17: repack size with --full-name-hash 2.6G Finally, an internal Javascript repo of moderate size shows significant gains when repacking with --full-name-hash due to it having many name hash collisions. However, it's worth noting that only the full repack case has enough improvement to be worth it. But the improvements are significant: 6.4 GB to 862 MB. Test HEAD -------------------------------------------------------------------------- 5313.2: thin pack 0.03(0.02+0.00) 5313.3: thin pack size 1.2K 5313.4: thin pack with --full-name-hash 0.03(0.03+0.00) 5313.5: thin pack size with --full-name-hash 2.6K 5313.6: big pack 2.20(3.23+0.30) 5313.7: big pack size 130.7M 5313.8: big pack with --full-name-hash 2.33(3.17+0.34) 5313.9: big pack size with --full-name-hash 131.0M 5313.10: shallow fetch pack 3.56(6.02+0.32) 5313.11: shallow pack size 44.5M 5313.12: shallow pack with --full-name-hash 2.94(3.94+0.32) 5313.13: shallow pack size with --full-name-hash 45.3M 5313.14: repack 2435.22(12523.11+23.53) 5313.15: repack size 6.4G 5313.16: repack with --full-name-hash 473.25(1805.11+17.22) 5313.17: repack size with --full-name-hash 861.9M These tests demonstrate that it is important to be careful about which cases are best for using the --full-name-hash option. Signed-off-by: Derrick Stolee <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 21fa75d - Browse repository at this point
Copy the full SHA 21fa75dView commit details -
pack-objects: disable --full-name-hash when shallow
As demonstrated in the previous change, the --full-name-hash option of 'git pack-objects' is less effective in a trunctated history. Thus, even when the option is selected via a command-line option or config, disable this option when the '--shallow' option is specified. This will help performance in servers that choose to enable the --full-name-hash option by default for a repository while not regressing their ability to serve shallow clones. This will not present a compatibility issue in the future when the full name hash values are stored in the reachability bitmaps, since shallow clones disable bitmaps. Signed-off-by: Derrick Stolee <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 842f720 - Browse repository at this point
Copy the full SHA 842f720View commit details -
test-tool: add helper for name-hash values
Add a new test-tool helper, name-hash, to output the value of the name-hash algorithms for the input list of strings, one per line. Since the name-hash values can be stored in the .bitmap files, it is important that these hash functions do not change across Git versions. Add a simple test to t5310-pack-bitmaps.sh to provide some testing of the current values. Due to how these functions are implemented, it would be difficult to change them without disturbing these values. Create a performance test that uses test_size to demonstrate how collisions occur for these hash algorithms. This test helps inform someone as to the behavior of the name-hash algorithms for their repo based on the paths at HEAD. My copy of the Git repository shows modest statistics around the collisions of the default name-hash algorithm: Test this tree ----------------------------------------------------------------- 5314.1: paths at head 4.5K 5314.2: number of distinct name-hashes 4.1K 5314.3: number of distinct full-name-hashes 4.5K 5314.4: maximum multiplicity of name-hashes 13 5314.5: maximum multiplicity of fullname-hashes 1 Here, the maximum collision multiplicity is 13, but around 10% of paths have a collision with another path. In a more interesting example, the microsoft/fluentui [1] repo had these statistics at time of committing: Test this tree ----------------------------------------------------------------- 5314.1: paths at head 19.6K 5314.2: number of distinct name-hashes 8.2K 5314.3: number of distinct full-name-hashes 19.6K 5314.4: maximum multiplicity of name-hashes 279 5314.5: maximum multiplicity of fullname-hashes 1 [1] https://github.com/microsoft/fluentui That demonstrates that of the nearly twenty thousand path names, they are assigned around eight thousand distinct values. 279 paths are assigned to a single value, leading the packing algorithm to sort objects from those paths together, by size. In this repository, no collisions occur for the full-name-hash algorithm. In a more extreme example, an internal monorepo had a much worse collision rate: Test this tree ----------------------------------------------------------------- 5314.1: paths at head 221.6K 5314.2: number of distinct name-hashes 72.0K 5314.3: number of distinct full-name-hashes 221.6K 5314.4: maximum multiplicity of name-hashes 14.4K 5314.5: maximum multiplicity of fullname-hashes 2 Even in this repository with many more paths at HEAD, the collision rate was low and the maximum number of paths being grouped into a single bucket by the full-path-name algorithm was two. Signed-off-by: Derrick Stolee <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c6cb866 - Browse repository at this point
Copy the full SHA c6cb866View commit details