Replies: 2 comments 3 replies
-
Interesting, I'm not sure why this would happen - but if you save your model state and optimizer state along the way, maybe you could continue training from the most recent save? |
Beta Was this translation helpful? Give feedback.
1 reply
-
Do you mind sharing your training config and L2 norm curve? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Thanks Andrej for your tutorial! It's amazing to learn so many things from just one YouTube video
I tried to rerun you experiment with 8GPUs and 4 x 10B tokens with shuffled dataset before sharding. The training dynamics looks good up until approximately 60K step. Then the norm of gradient starts growing and the performance degrades quickly. See picture attached.
Was wondering, whether it is normal, and whether it is possible to somehow escape such point without retraining from scratch again?
Beta Was this translation helpful? Give feedback.
All reactions