[offload] DiskCache.clean_offload_dir#678
Conversation
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
| } | ||
|
|
||
| @classmethod | ||
| def clean_offload_dir(cls, offload_dir: str | os.PathLike | None = None) -> int: |
There was a problem hiding this comment.
Do we ever expect there to be files in the offload dir that aren't created by CT? Or files created by CT which are not in the offload dir? Why not just delete the whole directly?
There was a problem hiding this comment.
I went with a design goal of scoping to only files created in this session, as the user might be running multiple sessions or jobs and clearing out a user's offload_dir entirely seems like a bad side-effect / UX
There was a problem hiding this comment.
Makes sense, #666 will help handle the case where two sessions share the same folder
| except FileNotFoundError: | ||
| pass # Already deleted |
There was a problem hiding this comment.
Why would this case occur?
There was a problem hiding this comment.
if a file is deleted by some other means? seems like a place where it's worth a few lines to fool-proof it
There was a problem hiding this comment.
Consider using os.path.exists instead of try; catch
| } | ||
|
|
||
| @classmethod | ||
| def clean_offload_dir(cls, offload_dir: str | os.PathLike | None = None) -> int: |
There was a problem hiding this comment.
How do you expect this to be called in distributed cases?
There was a problem hiding this comment.
we need to discuss distributed case, how it would get called, but implementation is below in dist_disk.py
| file_path = cls.index[offloaded]["safetensors_file"] | ||
|
|
||
| # Only delete files in the specified dir (if provided) | ||
| if offload_dir is not None and not file_path.startswith(offload_dir): |
There was a problem hiding this comment.
Technically Path.resolve will resolve symlinks right? So maybe the startswith would fail here?
There was a problem hiding this comment.
i can remove this if we're fine to just use cls.offload_dir
There was a problem hiding this comment.
This should also be an assert
| If None, clean all files in the shared index. | ||
| :return: Number of files cleaned up | ||
| """ | ||
| if offload_dir is not None: |
There was a problem hiding this comment.
Why provide offload_dir and test for nullability? It's guaranteed at this point that cls.offload_dir is populated, right?
There was a problem hiding this comment.
I can remove this, was following the convention used in create_checkpoint_symlink which provides optional offload_dir. same question arises there
There was a problem hiding this comment.
I think this is fine, but we should raise major warnings/errors if the caller provided offload_dir ever conflicts with cls.offload_dir
There was a problem hiding this comment.
Looks good, my only points are
- There are 3 cases where "if" statements should be assertions. If any of these assumptions are broken, then something has seriously gone wrong
os.unlink == os.remove?- I think we should remove
cls.index, since it no longer contains valid data. You might consider renaming toDiskCache.clear_index() - I think you do need to think more about when this is going to be called. As I see it, there are two cases for when this could be called, and both conflict with this implementation
Case A
If we delete files before the model is cleaned up, then we've created a model which will error if any parameters are accessed, which is a bad state to be in. Assuming no user shenanigans this should be fine if we make it clear to users that the model is invalid after this and will error if used.
Case B
If we delete files during model cleanup, then we'll want to delete incrementally, so the implementation will look very different than this. One solution would be to add a finalizer to all instances of DiskCache which attempts file deletion on garbage collection. This is the safest but not what's implemented.
Case C
If we delete files after model cleanup, this is safe, but requires the user to del their model, then call this, which like case A needs to made clear to the user.
| continue | ||
|
|
||
| # Only delete files we created (with our prefix) | ||
| if os.path.basename(file_path).startswith(cls._new_file_prefix): |
There was a problem hiding this comment.
This should be an assert, not an if
| file_path = cls.index[offloaded]["safetensors_file"] | ||
|
|
||
| # Only delete files in the specified dir (if provided) | ||
| if offload_dir is not None and not file_path.startswith(offload_dir): |
There was a problem hiding this comment.
This should also be an assert
| # Only delete files we created (with our prefix) | ||
| if os.path.basename(file_path).startswith(cls._new_file_prefix): | ||
| try: | ||
| # os.unlink removes symlink itself, not target |
There was a problem hiding this comment.
I don't think I understand this point? What does it mean to remove the symlink itself? os.remove and os.unlink are exactly equivalent from my testing and the documentation.
| If None, clean all files in the shared index. | ||
| :return: Number of files cleaned up | ||
| """ | ||
| if offload_dir is not None: |
There was a problem hiding this comment.
I think this is fine, but we should raise major warnings/errors if the caller provided offload_dir ever conflicts with cls.offload_dir
| } | ||
|
|
||
| @classmethod | ||
| def clean_offload_dir(cls, offload_dir: str | os.PathLike | None = None) -> int: |
There was a problem hiding this comment.
Makes sense, #666 will help handle the case where two sessions share the same folder
|
This pull request has merge conflicts that must be resolved before it can be |
Corequisite:
The DiskCache creates several hundreds of intermediate files in user-provided
offload_dirduring normal operation with large models. These are either symlinks or intermediate tensors that are not necessary/valuable to keep around after successfully running an entrypoint.PR adds helpers to clean any intermediate files that were added during normal operation.