Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix/issue 2090 : Add a repo_files command, with recursive deletion. #2280

Merged
merged 26 commits into from
Jun 3, 2024

Conversation

OlivierKessler01
Copy link
Contributor

@OlivierKessler01 OlivierKessler01 commented May 13, 2024

Implement a new repo-files command to manipulate files (deletion...)

Fixes #2235

Opiniated choice made regarding globs matching

Crux

I've encoutered an issue when implementing the folder matching as per this issue specification :

huggingface-cli repo-files <repo_id> delete folder/

To implement the pattern lexer I reused huggingface_hub.hf_api._prepare_folder_deletions(), which is already used in the upload command to lexe/tokenize allow and disallow lists.

The function is a wrapper around fnmatch.fnmatch() (via huggingface_hub.utils._paths.filter_repo_objects()). It matches files against unix shell-style file matching patterns.

Unix shell-style automatons respond to this case in an rather unconvenient way :

>>> fnmatch.fnmatch("nested/file.bin", "nested/")
False

Solution implemented

It adds "*" at the end of the patterns that end with "/". Given git untracks any folder that is empty, deleting all files in the folder effectively deletes the folder. Did it in the new huggingface_hub.hf_api.delete_files_r() function, because updating huggingface_hub.utils._paths.filter_repo_objects() at a lower-level would alter the behavior of various functions like the upload command and the commit scheduler in a non-BC compatible way.

What do you think about it ?

huggingface-cli upload --delete="nested/"
#the line above still doesn't work, but it's consistent (logic unchanged, interface unchanged). 

huggingface-cli repo-files delete nested/ #interpreted as `huggingface-cli repo-files delete nested/*`
#the line above works as per https://github.com/huggingface/huggingface_hub/issues/2235 specs.

All this has been documented in the function doc.

What could have been done instead

Remove support for this
huggingface-cli repo-files <repo_id> delete folder/

Natively support this through already existing logic, and let Git untrack the folder.
huggingface-cli repo-files <repo_id> delete folder/*

Tests added (duration that might disrupt DevX and project output)

20.46s call     tests/test_hf_api.py::CommitApiTest::test_delete_files_r

Remarks

Feel free to tell me if you'd like something different here :)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@OlivierKessler01
Copy link
Contributor Author

@Wauplin just ran Ruff, sorry 'bout this.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @OlivierKessler01, thanks a lot for this extensive work! It's a very good first version and I've reviewed it top to bottom. You'll find my comments below. Please let me know if you have any questions :) Thanks again!

src/huggingface_hub/commands/repo_files.py Outdated Show resolved Hide resolved
src/huggingface_hub/commands/repo_files.py Outdated Show resolved Hide resolved
src/huggingface_hub/commands/repo_files.py Outdated Show resolved Hide resolved
src/huggingface_hub/commands/repo_files.py Outdated Show resolved Hide resolved
tests/test_command_repo_files.py Outdated Show resolved Hide resolved
tests/test_command_repo_files.py Outdated Show resolved Hide resolved
tests/test_hf_api.py Outdated Show resolved Hide resolved
tests/test_hf_api.py Outdated Show resolved Hide resolved
src/huggingface_hub/commands/repo_files.py Show resolved Hide resolved
src/huggingface_hub/commands/repo_files.py Outdated Show resolved Hide resolved
Specifying `parent_commit` ensures the repo has not changed before committing the changes, and can be
especially useful if the repo is updated / committed to concurrently.
"""
patterns = [p + "*" if p[-1] == "/" else p for p in patterns]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I had a look into it and I think it should be done in filter_repo_objects directly. I just pushed 19d6b19 to your PR, hope you don't mind.

tests/test_hf_api.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @OlivierKessler01, many thanks for the changes! I've pushed a few commits (see above) to finalize it and now everything looks good to me.

Last thing to update when implementing a new feature: adding a section in the docs! In this case, it should be here: https://github.com/huggingface/huggingface_hub/blob/main/docs/source/en/guides/cli.md. Which correspond to these docs. I think a section between huggingface-cli upload and huggingface-cli scan-cache would be good. You can have:

## huggingface-cli repo-files

(... some introduction content ...)

### Delete files

(... docs of your PR with 2-3 examples ...)

## huggingface-cli scan-cache

(...)

@OlivierKessler01
Copy link
Contributor Author

Hi @Wauplin thank you for the contribution. I just added the documentation block, have a look :)

docs/source/en/guides/cli.md Outdated Show resolved Hide resolved
docs/source/en/guides/cli.md Outdated Show resolved Hide resolved
docs/source/en/guides/cli.md Outdated Show resolved Hide resolved
docs/source/en/guides/cli.md Outdated Show resolved Hide resolved
docs/source/en/guides/cli.md Outdated Show resolved Hide resolved
docs/source/en/guides/cli.md Outdated Show resolved Hide resolved
docs/source/en/guides/cli.md Outdated Show resolved Hide resolved
src/huggingface_hub/commands/repo_files.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! Thanks again @OlivierKessler01 for all the work on this PR 🚀 I just had a look to the docs and I think we are good to go. Let's wait for the CI to complete and I'll merge.

@Wauplin Wauplin merged commit c8dc5f5 into huggingface:main Jun 3, 2024
14 checks passed
@Wauplin
Copy link
Contributor

Wauplin commented Jun 3, 2024

And, merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CLI] Command line to delete files on a repo
3 participants