-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix/issue 2090 : Add a repo_files
command, with recursive deletion.
#2280
Conversation
79e5a25
to
8e3ad70
Compare
8e3ad70
to
4b30a31
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@Wauplin just ran Ruff, sorry 'bout this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @OlivierKessler01, thanks a lot for this extensive work! It's a very good first version and I've reviewed it top to bottom. You'll find my comments below. Please let me know if you have any questions :) Thanks again!
src/huggingface_hub/hf_api.py
Outdated
Specifying `parent_commit` ensures the repo has not changed before committing the changes, and can be | ||
especially useful if the repo is updated / committed to concurrently. | ||
""" | ||
patterns = [p + "*" if p[-1] == "/" else p for p in patterns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I had a look into it and I think it should be done in filter_repo_objects
directly. I just pushed 19d6b19 to your PR, hope you don't mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @OlivierKessler01, many thanks for the changes! I've pushed a few commits (see above) to finalize it and now everything looks good to me.
Last thing to update when implementing a new feature: adding a section in the docs! In this case, it should be here: https://github.com/huggingface/huggingface_hub/blob/main/docs/source/en/guides/cli.md. Which correspond to these docs. I think a section between huggingface-cli upload
and huggingface-cli scan-cache
would be good. You can have:
## huggingface-cli repo-files
(... some introduction content ...)
### Delete files
(... docs of your PR with 2-3 examples ...)
## huggingface-cli scan-cache
(...)
… repo_files delete command
Hi @Wauplin thank you for the contribution. I just added the documentation block, have a look :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay! Thanks again @OlivierKessler01 for all the work on this PR 🚀 I just had a look to the docs and I think we are good to go. Let's wait for the CI to complete and I'll merge.
And, merged! |
Implement a new
repo-files
command to manipulate files (deletion...)Fixes #2235
Opiniated choice made regarding globs matching
Crux
I've encoutered an issue when implementing the folder matching as per this issue specification :
To implement the pattern lexer I reused
huggingface_hub.hf_api._prepare_folder_deletions()
, which is already used in theupload
command to lexe/tokenize allow and disallow lists.The function is a wrapper around
fnmatch.fnmatch()
(viahuggingface_hub.utils._paths.filter_repo_objects()
). It matches files against unix shell-style file matching patterns.Unix shell-style automatons respond to this case in an rather unconvenient way :
Solution implemented
It adds "*" at the end of the patterns that end with "/". Given git untracks any folder that is empty, deleting all files in the folder effectively deletes the folder. Did it in the new
huggingface_hub.hf_api.delete_files_r()
function, because updatinghuggingface_hub.utils._paths.filter_repo_objects()
at a lower-level would alter the behavior of various functions like theupload
command and the commit scheduler in a non-BC compatible way.What do you think about it ?
All this has been documented in the function doc.
What could have been done instead
Remove support for this
huggingface-cli repo-files <repo_id> delete folder/
Natively support this through already existing logic, and let Git untrack the folder.
huggingface-cli repo-files <repo_id> delete folder/*
Tests added (duration that might disrupt DevX and project output)
Remarks
Feel free to tell me if you'd like something different here :)