-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] any clue for MFU drop? #6727
Comments
wallclock break down log looks fine, and deepspeed engine didnt warn potential memory allocation issue in this run
|
It happens every 100th step. There is nothing in the code happening every 100th step? |
hi! ty for your response. |
It may be garbage collection. The reason I am skeptical is gc wouldn't be that periodic in nature. Every 100th step seems intentional. |
y-axis is MFU and x-axis is training step.
I'm testing qwen 72b with huggingface trainer and whenever i train 72b scale model with zero3 offload, i encounter periodic performance drop.
I'm sry that i cant share training codes but it always happens even when i reduce batch tokens from 8k to 6k (B*T=valid batch tokens and B, T varies).
(for some reason, I'm testing 40GB A100 gpus and it's 12 node training result)
is there any logic to manage or flush memory periodically ?? (i couldn't find any clue)
ofc, i never evaluate model or save checkpoint on this horizon (above figure does not even reach 1 epoch)
The text was updated successfully, but these errors were encountered: