Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT]When load_ckpt is called and the obtained iteration count equals args.train_iters, the train_step process will be directly skipped. If, at this point, the save_checkpoint function may encounter an error. #1310

Open
bphwk opened this issue Dec 5, 2024 · 0 comments

Comments

@bphwk
Copy link

bphwk commented Dec 5, 2024

Is your feature request related to a problem? Please describe.
When load_ckpt is called and the obtained iteration count equals args.train_iters, the train_step process will be directly skipped. If, at this point, the condition if args.save and iteration != 0 and iteration % args.save_interval != 0: is entered, the save_checkpoint function may encounter an error due to the absence of optimizer-related parameters, leading to the failure of the training task. In the case of Torch Elastic mode with asynchronous checkpointing (async ckpt) enabled, this can result in infinite resumption of training.

Describe alternatives you've considered
Exit process when adding iteration==args.train_iters

if iteration == args.train_iters:
    sys.exit(0)
while iteration < args.train_iters:
    if args.profile and torch.distributed.get_rank() in args.profile_ranks:
        if args.use_pytorch_profiler:
            prof.step()
        elif iteration == args.profile_step_start:
            torch.cuda.cudart().cudaProfilerStart()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant