Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use symbolic link for saving best/last checkpoints #5490

Open
ZeroRin opened this issue Apr 23, 2024 · 0 comments
Open

Use symbolic link for saving best/last checkpoints #5490

ZeroRin opened this issue Apr 23, 2024 · 0 comments

Comments

@ZeroRin
Copy link

ZeroRin commented Apr 23, 2024

🚀 Feature Request

When preparing checkpoint_best.pt and checkpoint_last.pt, create symbolic link instead of making exact copy of the checkpoint

Motivation

I'm running on a machine that somehow has bad io performance, writing the same file 3 times seems extremely inefficient.

Seems that there is plan to implement a asynchronous copying but i believe a symlink is much more efficient

if len(checkpoints) > 0 and trainer.should_save_checkpoint_on_current_rank:
saved_cp = trainer.save_checkpoint(checkpoints[0], extra_state)
for cp in checkpoints[1:]:
if cfg.write_checkpoints_asynchronously:
# TODO[ioPath]: Need to implement a delayed asynchronous
# file copying/moving feature.
logger.warning(
f"ioPath is not copying {checkpoints[0]} to {cp} "
"since async write mode is on."
)
else:
assert PathManager.copy(
checkpoints[0], cp, overwrite=True
), f"Failed to copy {checkpoints[0]} to {cp}"

Pitch

Alternatives

Additional context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant