[ENHANCEMENT]When load_ckpt is called and the obtained iteration count equals args.train_iters, the train_step process will be directly skipped. If, at this point, the save_checkpoint function may encounter an error. #1310

bphwk · 2024-12-05T02:13:05Z

Is your feature request related to a problem? Please describe.
When load_ckpt is called and the obtained iteration count equals args.train_iters, the train_step process will be directly skipped. If, at this point, the condition if args.save and iteration != 0 and iteration % args.save_interval != 0: is entered, the save_checkpoint function may encounter an error due to the absence of optimizer-related parameters, leading to the failure of the training task. In the case of Torch Elastic mode with asynchronous checkpointing (async ckpt) enabled, this can result in infinite resumption of training.

Describe alternatives you've considered
Exit process when adding iteration==args.train_iters

if iteration == args.train_iters:
    sys.exit(0)
while iteration < args.train_iters:
    if args.profile and torch.distributed.get_rank() in args.profile_ranks:
        if args.use_pytorch_profiler:
            prof.step()
        elif iteration == args.profile_step_start:
            torch.cuda.cudart().cudaProfilerStart()

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENHANCEMENT]When load_ckpt is called and the obtained iteration count equals args.train_iters, the train_step process will be directly skipped. If, at this point, the save_checkpoint function may encounter an error. #1310

[ENHANCEMENT]When load_ckpt is called and the obtained iteration count equals args.train_iters, the train_step process will be directly skipped. If, at this point, the save_checkpoint function may encounter an error. #1310

bphwk commented Dec 5, 2024

[ENHANCEMENT]When load_ckpt is called and the obtained iteration count equals args.train_iters, the train_step process will be directly skipped. If, at this point, the save_checkpoint function may encounter an error. #1310

[ENHANCEMENT]When load_ckpt is called and the obtained iteration count equals args.train_iters, the train_step process will be directly skipped. If, at this point, the save_checkpoint function may encounter an error. #1310

Comments

bphwk commented Dec 5, 2024