Skip to content

[offload] DiskCache.clean_offload_dir#678

Draft
brian-dellabetta wants to merge 1 commit into
mainfrom
bdellabe/offload-dir-prune
Draft

[offload] DiskCache.clean_offload_dir#678
brian-dellabetta wants to merge 1 commit into
mainfrom
bdellabe/offload-dir-prune

Conversation

@brian-dellabetta

@brian-dellabetta brian-dellabetta commented Apr 10, 2026

Copy link
Copy Markdown
Collaborator

Corequisite:

The DiskCache creates several hundreds of intermediate files in user-provided offload_dir during normal operation with large models. These are either symlinks or intermediate tensors that are not necessary/valuable to keep around after successfully running an entrypoint.

PR adds helpers to clean any intermediate files that were added during normal operation.

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
@coderabbitai

coderabbitai Bot commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4dc0a2c0-e1ad-4215-8577-1a3ba798d4ae

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bdellabe/offload-dir-prune

Comment @coderabbitai help to get the list of available commands and usage tips.

}

@classmethod
def clean_offload_dir(cls, offload_dir: str | os.PathLike | None = None) -> int:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever expect there to be files in the offload dir that aren't created by CT? Or files created by CT which are not in the offload dir? Why not just delete the whole directly?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with a design goal of scoping to only files created in this session, as the user might be running multiple sessions or jobs and clearing out a user's offload_dir entirely seems like a bad side-effect / UX

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, #666 will help handle the case where two sessions share the same folder

Comment on lines +209 to +210
except FileNotFoundError:
pass # Already deleted

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this case occur?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a file is deleted by some other means? seems like a place where it's worth a few lines to fool-proof it

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using os.path.exists instead of try; catch

}

@classmethod
def clean_offload_dir(cls, offload_dir: str | os.PathLike | None = None) -> int:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you expect this to be called in distributed cases?

@brian-dellabetta brian-dellabetta Apr 13, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to discuss distributed case, how it would get called, but implementation is below in dist_disk.py

file_path = cls.index[offloaded]["safetensors_file"]

# Only delete files in the specified dir (if provided)
if offload_dir is not None and not file_path.startswith(offload_dir):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically Path.resolve will resolve symlinks right? So maybe the startswith would fail here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can remove this if we're fine to just use cls.offload_dir

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be an assert

If None, clean all files in the shared index.
:return: Number of files cleaned up
"""
if offload_dir is not None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why provide offload_dir and test for nullability? It's guaranteed at this point that cls.offload_dir is populated, right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove this, was following the convention used in create_checkpoint_symlink which provides optional offload_dir. same question arises there

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, but we should raise major warnings/errors if the caller provided offload_dir ever conflicts with cls.offload_dir

@kylesayrs kylesayrs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, my only points are

  1. There are 3 cases where "if" statements should be assertions. If any of these assumptions are broken, then something has seriously gone wrong
  2. os.unlink == os.remove?
  3. I think we should remove cls.index, since it no longer contains valid data. You might consider renaming to DiskCache.clear_index()
  4. I think you do need to think more about when this is going to be called. As I see it, there are two cases for when this could be called, and both conflict with this implementation

Case A

If we delete files before the model is cleaned up, then we've created a model which will error if any parameters are accessed, which is a bad state to be in. Assuming no user shenanigans this should be fine if we make it clear to users that the model is invalid after this and will error if used.

Case B

If we delete files during model cleanup, then we'll want to delete incrementally, so the implementation will look very different than this. One solution would be to add a finalizer to all instances of DiskCache which attempts file deletion on garbage collection. This is the safest but not what's implemented.

Case C

If we delete files after model cleanup, this is safe, but requires the user to del their model, then call this, which like case A needs to made clear to the user.

continue

# Only delete files we created (with our prefix)
if os.path.basename(file_path).startswith(cls._new_file_prefix):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an assert, not an if

file_path = cls.index[offloaded]["safetensors_file"]

# Only delete files in the specified dir (if provided)
if offload_dir is not None and not file_path.startswith(offload_dir):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be an assert

# Only delete files we created (with our prefix)
if os.path.basename(file_path).startswith(cls._new_file_prefix):
try:
# os.unlink removes symlink itself, not target

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand this point? What does it mean to remove the symlink itself? os.remove and os.unlink are exactly equivalent from my testing and the documentation.

If None, clean all files in the shared index.
:return: Number of files cleaned up
"""
if offload_dir is not None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, but we should raise major warnings/errors if the caller provided offload_dir ever conflicts with cls.offload_dir

}

@classmethod
def clean_offload_dir(cls, offload_dir: str | os.PathLike | None = None) -> int:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, #666 will help handle the case where two sessions share the same folder

@mergify

mergify Bot commented Jun 22, 2026

Copy link
Copy Markdown

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @brian-dellabetta.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants