You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[ENHANCEMENT]When load_ckpt is called and the obtained iteration count equals args.train_iters, the train_step process will be directly skipped. If, at this point, the save_checkpoint function may encounter an error.
#1310
Is your feature request related to a problem? Please describe.
When load_ckpt is called and the obtained iteration count equals args.train_iters, the train_step process will be directly skipped. If, at this point, the condition if args.save and iteration != 0 and iteration % args.save_interval != 0: is entered, the save_checkpoint function may encounter an error due to the absence of optimizer-related parameters, leading to the failure of the training task. In the case of Torch Elastic mode with asynchronous checkpointing (async ckpt) enabled, this can result in infinite resumption of training.
Describe alternatives you've considered
Exit process when adding iteration==args.train_iters
if iteration == args.train_iters:
sys.exit(0)
while iteration < args.train_iters:
if args.profile and torch.distributed.get_rank() in args.profile_ranks:
if args.use_pytorch_profiler:
prof.step()
elif iteration == args.profile_step_start:
torch.cuda.cudart().cudaProfilerStart()
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
When load_ckpt is called and the obtained iteration count equals args.train_iters, the train_step process will be directly skipped. If, at this point, the condition if args.save and iteration != 0 and iteration % args.save_interval != 0: is entered, the save_checkpoint function may encounter an error due to the absence of optimizer-related parameters, leading to the failure of the training task. In the case of Torch Elastic mode with asynchronous checkpointing (async ckpt) enabled, this can result in infinite resumption of training.
Describe alternatives you've considered
Exit process when adding iteration==args.train_iters
The text was updated successfully, but these errors were encountered: