Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add git to filesystem source 301 #312
Add git to filesystem source 301 #312
Changes from 7 commits
6681b22
3e21941
4a84bee
d3073b7
13a891e
1d425dc
4fec235
a6a34ac
6f13dc9
6714daa
90d3cd0
bb3b76c
e2f2995
14eaa82
b2e2d5f
bc35230
9daaf9a
b09f919
3d3222c
f3de00f
2c3c45f
9a2b573
e763f92
d4312ec
6a8834b
fc839db
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this will create a process for each file to run git rev parse or something similar. this is indeed a no go... what we should do is to index and cache a whole tree or a path when doing
ls
with a single git command.ie. look here https://stackoverflow.com/questions/1964470/whats-the-equivalent-of-subversions-use-commit-times-for-git
git --no-pager whatchanged --pretty=%at
this is really fast
then here you'll just look into cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First cut of this is at 9daaf9a
It is not as fast as I had hoped.
git log --raw
instead ofgit whatchanged
.log
has better documentation. Apparently they both usegit rev-list
behind the scenes. Also using a git pathspec to limit what is returned to the scope of whatls()
is doing.ls()
is operating on. So one call togit log --raw
for every folder in the repo.git log --raw
for the entire folder hierarchy, for given ref. It starts to push the architecture of fsspec though. As far as I can see, we would need to override (or examine at runtime) something up the hierarchy (or call stack) - like glob(), walk() - to detect when they're about to deeply traverse the repo. The glob pattern could then be applied in the pathspec of the git command so it only pre-fetches revisions that will be used.II will try to find where it is slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2nd cut.f3de00f. Eagerly caches all repo revisions with one
git log --raw
.Here are results for
walk(path="", ref=""HEAD)
per repo:Is this acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be cached in init. 100% it spawns git command and creates a process per every ls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 14eaa82 (link is not in PR context)
Since git refs feature was added the trees differ by commit, so the cache is a little more complex than setting a variable in init.
There is some danger of the user getting unexpected results if they move a pointer (branch, HEAD) during the lifetime of an fsspec instance (noting that fsspec caches instances by defaul). For now, I have simply mentioned that in docstrings as I believe it's an edge case. If it's a common case, the cache could be indexed by commit sha, with conversion functions between ref <--> sha. But I'm wary of spawning more git commands just to manage the cache :) !
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.