Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
DeepSpeed has several ways to call
zero_grad()
but they have the following inconsistency.zero_grad
: Clear .grad and .grad_acczero_grad
: Clear .grad and resetmicro_step_id
. This affects whether it overwrites or accumulates gradients after reduce. It also causes a mismatch with engine'smicro_steps
.zero_grad
: Clear .grad (doesn't call optimizer's zero_grad in its zero_grad). But it calls the optimizer'szero_grad
afterstep()
.Another confusion is that it doesn't consider the gradient accumulation boundary while backward and step do. Users naturally expect the code below works, but these inconsistent behaviors can potentially cause unexpected behavior as shown in comments.
This PR aims to improve the behavior of the optimizers.
zero_grad
clears gradients only at a gradient accumulation boundary.force
to clear gradientszero_grad
have the same effectoptimizer.zero_grad
andengine.zero_grad
with any zero stagesmicro_step_id
for Z3 optimizer to make it consistent with engine'smicro_steps
(This PR depends on #6550)