FR: Improve/expand the 1MB file auto-snaphot heuristic #5278

necauqua · 2025-01-06T18:55:01Z

Is your feature request related to a problem? Please describe.
Related to auto-tracking (#323).

As far as I understand, currently the snapshotting process get aborted with an error if it encounters a file that's >1mb (or configurable) in size.

However, that heuristic will miss a situation when there's a folder full of small files that get added - in my head when the heuristic is tripped I always think "pfew, good that the target/ folder had large enough files".

Describe the solution you'd like
Instead of checking for individual file sizes, the snapshotting process could measure the total logical size of all files in working copy and compare that to the total logical size of @ (not sure how well that'll work with things like watchman though, this is just an idea).
Now if that differs by >1mb (or configurable) we sound the alarm, "are you sure" type of deal.
It may also catch a large deletion, although not like that's of any concern with the oplog :)

I think I had a .direnv folder snapshotted once or twice where it would probably have been caught by this more generic heuristic.

The most important part - for me, anyway, as I always cared about the size of the content store, not so much about secrets or whatever other auto-tracking concerns there are, is that with this algorithm no snapshot can add more than 1MB to the store

Describe alternatives you've considered
Keep it as-is 🤷

The text was updated successfully, but these errors were encountered:

necauqua · 2025-01-06T18:58:03Z

Even better idea - to just measure the sum of all new files to be snapshotted, although idk about changes in tracked files - but this will cover the case where "we add 1mb but also delete 1mb, so the heuristic did not trip"

PhilipMetzger · 2025-01-06T18:59:54Z

It no longer fails since 168c797 but I agree that the heuristic can be improved.

necauqua · 2025-01-06T20:15:54Z

Huh, in context of 168c797 my idea makes a bit less sense - since I was thinking about aborting the snapshot completely if the sum of changes is >1MB, and I'm not sure how it can be molded into the new "print a warning and don't track the offending file" behavior - which btw makes my example with "pfew a file in target/ was large enough for the entire target/ to not get snapshotted" moot

Could be a config option too?.
But the comment on the commit is valid, overruling the heuristic (by a config override) was a bit annoying, although personally I never had false positives :)

arxanas · 2025-01-06T20:47:06Z

The most important part - for me, anyway, as I always cared about the size of the content store, not so much about secrets or whatever other auto-tracking concerns there are, is that with this algorithm no snapshot can add more than 1MB to the store

In #323 (comment), I was obliquely proposing that we can treat untracked files similarly to Git LFS files. In that case, we could start tracking the files without tracking their contents (and leave the contents in the working copy). You'd have to eventually either ignore such files or start tracking their contents (perhaps as a part of trying to push?), but the content store wouldn't be bloated until such time as you explicitly start tracking their contents. Do you think a workflow like that would work for you?

necauqua · 2025-01-06T20:52:55Z

Yeah I wanted to say that your idea there sounds incredible as a replacement (adjustment?) of the current 1mb heuristic.

This issue is about if a dir with a ton of small files was added - I don't think your proposal covers it?
~~Not sure what can even cover it, probably we'll just have to live with it lol~~
Wait your proposal does actually kind of cover it, if a not-yet-tracked directory is >1mb total we can store the reference type of thing in the tree for that dir, like you said.

Yes, this does sound very interesting then.

arxanas · 2025-01-06T21:05:54Z

This issue is about if a dir with a ton of small files was added - I don't think your proposal covers it?
~~Not sure what can even cover it, probably we'll just have to live with it lol~~
Wait your proposal does actually kind of cover it, if a not-yet-tracked directory is >1mb total we can store the reference type of thing in the tree for that dir, like you said.

Yeah, my thought was that we probably don't want to descend into untracked directories and instead want to leave an opaque 'untracked tree' in the commit in all circumstances, just from a practical perspective.

The main specific consideration I had in mind was that build artifact directories can have a ton of files.
- In the untracked workflow, I didn't see that we would benefit much from visiting them all? There might be a use-case that I haven't thought of where it makes sense, though.
- If we do visit all the files, we might incur a noticeable performance overhead.
- I hadn't really worked out whether untracked files and directories would only include paths, or would also include their content hashes (without actually including the corresponding content in the store). Obviously, hashing the files will have performance implications.
I didn't distinguish between untracked directories of greater than any particular size. I was just proposing not to descend into untracked directories at all.
- I personally don't like it when git status starts dumping one million (1,000,000) files into my terminal for my consideration, but there are probably several ways to address this issue from a UX perspective.
- From a UX perspective, to support the workflow of selecting individual files to track from a directory, you could imagine even committing just the empty directory itself as an empty tree, and then perhaps jj status would start listing the untracked files within that directory. (Note that Git doesn't usually allow you to commit empty directories, and jj may inherit some quirks.)

PhilipMetzger added the enhancement New feature or request label Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FR: Improve/expand the 1MB file auto-snaphot heuristic #5278

FR: Improve/expand the 1MB file auto-snaphot heuristic #5278

necauqua commented Jan 6, 2025

necauqua commented Jan 6, 2025 •

edited

Loading

PhilipMetzger commented Jan 6, 2025

necauqua commented Jan 6, 2025

arxanas commented Jan 6, 2025

necauqua commented Jan 6, 2025

arxanas commented Jan 6, 2025

FR: Improve/expand the 1MB file auto-snaphot heuristic #5278

FR: Improve/expand the 1MB file auto-snaphot heuristic #5278

Comments

necauqua commented Jan 6, 2025

necauqua commented Jan 6, 2025 • edited Loading

PhilipMetzger commented Jan 6, 2025

necauqua commented Jan 6, 2025

arxanas commented Jan 6, 2025

necauqua commented Jan 6, 2025

arxanas commented Jan 6, 2025

necauqua commented Jan 6, 2025 •

edited

Loading