-
Notifications
You must be signed in to change notification settings - Fork 5
Implement gradient clipping #286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grad clipping by norm is not the only way to clip grads. Clipping by absolute value is valid. Can you update this PR to enable both? Also, the grad clipping should be configuring outside the optimizer. The grad clipping is upstream of the optimizer - the optimizer has no knowledge of grad clipping - can you move the grad clipping config to a top-level section?
@yaoshiang done. ptal |
86636b3
to
74c583c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good! I realize you are pushing hard to finish this off, if you have cycles, please add a unit test of some kind. Numerical would be ideal but might take a bit of time. You could do like a single layer, with fixed numbers, have the loss function be MAE, so now you know the exact grads, try your clip, and see if you get what you expect. If you feed those instructions to an LLM it'll probably do a decent job. But consider it optional since you have numerical proof against the HF curves that the norm version at least is working./ Thanks!
Fixes #90
The Hugging Face trainer clips gradient by norm by default. It turns out this makes a big difference in training stability and simply adding gradient clipping will bring torchprime into parity with Hugging Face.
Experiment: http://tb/share/yuBUPRF6KqZKPhwJkXNvc
Tested: