Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: Improve/expand the 1MB file auto-snaphot heuristic #5278

Open
necauqua opened this issue Jan 6, 2025 · 6 comments
Open

FR: Improve/expand the 1MB file auto-snaphot heuristic #5278

necauqua opened this issue Jan 6, 2025 · 6 comments
Labels
enhancement New feature or request

Comments

@necauqua
Copy link
Contributor

necauqua commented Jan 6, 2025

Is your feature request related to a problem? Please describe.
Related to auto-tracking (#323).

As far as I understand, currently the snapshotting process get aborted with an error if it encounters a file that's >1mb (or configurable) in size.

However, that heuristic will miss a situation when there's a folder full of small files that get added - in my head when the heuristic is tripped I always think "pfew, good that the target/ folder had large enough files".

Describe the solution you'd like
Instead of checking for individual file sizes, the snapshotting process could measure the total logical size of all files in working copy and compare that to the total logical size of @ (not sure how well that'll work with things like watchman though, this is just an idea).
Now if that differs by >1mb (or configurable) we sound the alarm, "are you sure" type of deal.
It may also catch a large deletion, although not like that's of any concern with the oplog :)

I think I had a .direnv folder snapshotted once or twice where it would probably have been caught by this more generic heuristic.

The most important part - for me, anyway, as I always cared about the size of the content store, not so much about secrets or whatever other auto-tracking concerns there are, is that with this algorithm no snapshot can add more than 1MB to the store

Describe alternatives you've considered
Keep it as-is 🤷

@necauqua
Copy link
Contributor Author

necauqua commented Jan 6, 2025

Even better idea - to just measure the sum of all new files to be snapshotted, although idk about changes in tracked files - but this will cover the case where "we add 1mb but also delete 1mb, so the heuristic did not trip"

@PhilipMetzger
Copy link
Contributor

It no longer fails since 168c797 but I agree that the heuristic can be improved.

@PhilipMetzger PhilipMetzger added the enhancement New feature or request label Jan 6, 2025
@necauqua
Copy link
Contributor Author

necauqua commented Jan 6, 2025

Huh, in context of 168c797 my idea makes a bit less sense - since I was thinking about aborting the snapshot completely if the sum of changes is >1MB, and I'm not sure how it can be molded into the new "print a warning and don't track the offending file" behavior - which btw makes my example with "pfew a file in target/ was large enough for the entire target/ to not get snapshotted" moot

Could be a config option too?.
But the comment on the commit is valid, overruling the heuristic (by a config override) was a bit annoying, although personally I never had false positives :)

@arxanas
Copy link
Contributor

arxanas commented Jan 6, 2025

The most important part - for me, anyway, as I always cared about the size of the content store, not so much about secrets or whatever other auto-tracking concerns there are, is that with this algorithm no snapshot can add more than 1MB to the store

In #323 (comment), I was obliquely proposing that we can treat untracked files similarly to Git LFS files. In that case, we could start tracking the files without tracking their contents (and leave the contents in the working copy). You'd have to eventually either ignore such files or start tracking their contents (perhaps as a part of trying to push?), but the content store wouldn't be bloated until such time as you explicitly start tracking their contents. Do you think a workflow like that would work for you?

@necauqua
Copy link
Contributor Author

necauqua commented Jan 6, 2025

Yeah I wanted to say that your idea there sounds incredible as a replacement (adjustment?) of the current 1mb heuristic.

This issue is about if a dir with a ton of small files was added - I don't think your proposal covers it?
Not sure what can even cover it, probably we'll just have to live with it lol
Wait your proposal does actually kind of cover it, if a not-yet-tracked directory is >1mb total we can store the reference type of thing in the tree for that dir, like you said.

Yes, this does sound very interesting then.

@arxanas
Copy link
Contributor

arxanas commented Jan 6, 2025

This issue is about if a dir with a ton of small files was added - I don't think your proposal covers it?
Not sure what can even cover it, probably we'll just have to live with it lol
Wait your proposal does actually kind of cover it, if a not-yet-tracked directory is >1mb total we can store the reference type of thing in the tree for that dir, like you said.

Yeah, my thought was that we probably don't want to descend into untracked directories and instead want to leave an opaque 'untracked tree' in the commit in all circumstances, just from a practical perspective.

  • The main specific consideration I had in mind was that build artifact directories can have a ton of files.
    • In the untracked workflow, I didn't see that we would benefit much from visiting them all? There might be a use-case that I haven't thought of where it makes sense, though.
    • If we do visit all the files, we might incur a noticeable performance overhead.
    • I hadn't really worked out whether untracked files and directories would only include paths, or would also include their content hashes (without actually including the corresponding content in the store). Obviously, hashing the files will have performance implications.
  • I didn't distinguish between untracked directories of greater than any particular size. I was just proposing not to descend into untracked directories at all.
    • I personally don't like it when git status starts dumping one million (1,000,000) files into my terminal for my consideration, but there are probably several ways to address this issue from a UX perspective.
    • From a UX perspective, to support the workflow of selecting individual files to track from a directory, you could imagine even committing just the empty directory itself as an empty tree, and then perhaps jj status would start listing the untracked files within that directory. (Note that Git doesn't usually allow you to commit empty directories, and jj may inherit some quirks.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants