-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concrete weight decay configuration for GPT-2 pretraining #40
Comments
The above is a comparison between Adam and Adan on GPT-2 345M pre-trained on the OpenwebText dataset. As you mentioned, you may consider referring to the config in #32, and there is no need to tune beta1 and beta2. Using the default value is okay. The most sensitive hyperparam is the lr and wd; you can choose wd from [0.02, 0.05, 0.1], beta3 can be chosen from [0.95, 0.999], and a larger lr and warmup fraction for Adan. We all follow this rule to tune the parameter for the 7B and even 65B models. If you still get an inferior, I may reconstruct your experiment from my side. |
Dear authors:
According to the
README.md
of this amazing project, theweight_decay
param should be0.02
, while in the configuration file attached in #32, theWD
seems to be0.05
. Also, onlybeta3
is explicitly specified in the aforementioned configuration file, I can only inspect from https://github.com/sail-sg/Adan/blob/main/gpt2/README.md thatHowever,
weight_decay=0.02
together with the other hyperparams above yields an inferior val loss curve compared with (that of the AdamW baseline)[https://github.com/karpathy/nanoGPT/blob/master/config/train_gpt2.py]. Thus, do you have any suggestion about the hyperparams I mentioned? Thanks!The text was updated successfully, but these errors were encountered: