-
Notifications
You must be signed in to change notification settings - Fork 708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have --analyze provide frequency of changes to individual paths as well #392
Comments
I don't understand what good that would do. If you have any blob repeated a trillion times, Git will only store one copy of it, so removing highly duplicated blobs doesn't shrink the size of history (well, it does, but only by the size of a single compressed copy). What can matter is lots of different but similar blobs of medium size, but then you need a way to find those "similar" blobs. The best ways I know of to do that are: (1) filename (e.g. someone stores a medium sized blob in a given file, then keeps tweaking that file throughout history), (2) directory (e.g. storing a bunch of medium sized blobs together), (3) extension (e.g. lots of image files or pdfs or presentations or whatever). filter-repo already has facilities for those, though. The only place highly duplicated medium sized blobs could cause problems that I can think of is in git bombs where the concern is not history size but checkout size. However, that's a case where all the blobs are part of the same commit just checked out at many different paths. But you specifically brought up a deep repository history which isn't at all a requirement for git bombs, which make me think it's not relevant for your case. Perhaps you could clarify a bit more what you are seeing? |
Thanks for commenting @newren ! Yes, in particular, I am referring to your scenario 1 where a blob with a particular filename is being rewritten in history many times. I see this often, when working with many monorepos, where developers unfamiliar with proper git paradigms will either manually or via an automated process commit the same binary throughout history. Because it's only a small-medium size blob, it not easy to discern from the analysis outputs that it is a good candidate for repository cleanup (unless a user is specifically looking for filename repetition). This is where I think a blobs-by-count.txt or something similar would be beneficial. It would essentially just be a Hopefully that clears it up a bit! I don't think this would be a computationally expensive file to generate, and the file itself would be smaller than the blobs-shas-and-paths.txt file, which I know can tend to be 100mb+ on very large repositories 😅. I am also happy to contribute to this as well 😄 |
Sorry, I'm still confused. My scenario 1 was someone committing a blob (let's say it's hash abbreviates to deadbeef01) at some path (let's say subdir/dir2/somefile.ext), then updating those contents periodically, meaning the blob changes. So, subdir/dir2/somefile.ext has an abbreviated hash of deadbeef01 at first, then after the next update it has an abbreviated hash of deadbeef02, then deadbeef03, etc. Checking for duplication by blob_id would give you a count of 1. I also don't understand what you mean by |
If I get @pmartindev right, he wants to know how many times a blob was changed. |
Ah, a list of paths with a count of changes to the content stored at that path. Thanks for the explanation. Maybe in a file named "frequency-of-changes-per-path.txt" ? I'd be fine with something like that. (As a side note: the above awk/grep pipelines have a potential shortcoming in that they only count the number of unique blobs stored at a given path, which means that if someone reverts to an older version that change wouldn't be counted. At an extreme, if there was a weird history where people repeatedly reverted the contents of some path back and forth between A and B, and did that thousands of times, the frequency count by the above awk/grep pipelines would only be two, because there are only 2 unique blobs that were ever stored at that path, even if there were thousands of changes to the path. Not sure if that matters, but just thought I'd point it out for when you go to implement it.) |
Thanks for the explanation @JLuszawski. That is exactly what I was try to convey 😄 |
That's a thought I had too. However, I kind of like the fact that the six files matching |
When using filter-repo --analyze on problematic repos, especially those with a deep history, I often times find that the problematic blobs are the ones that are not necessarily the largest, but those of a medium size that are committed either by an automated system or a common binary being frequently rewritten by users. Is there currently an easy way to output the frequency of blobs committed, if not, would there be interest in having a blob-frequency.txt with the blob name and count?
The text was updated successfully, but these errors were encountered: