Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concrete weight decay configuration for GPT-2 pretraining #40

Closed
DesperateExplorer opened this issue Aug 31, 2023 · 1 comment
Closed

Comments

@DesperateExplorer
Copy link

DesperateExplorer commented Aug 31, 2023

Dear authors:

According to the README.md of this amazing project, the weight_decay param should be 0.02, while in the configuration file attached in #32, the WD seems to be 0.05. Also, only beta3 is explicitly specified in the aforementioned configuration file, I can only inspect from https://github.com/sail-sg/Adan/blob/main/gpt2/README.md that

beta1 = 0.98
beta2 = 0.92

However, weight_decay=0.02 together with the other hyperparams above yields an inferior val loss curve compared with (that of the AdamW baseline)[https://github.com/karpathy/nanoGPT/blob/master/config/train_gpt2.py]. Thus, do you have any suggestion about the hyperparams I mentioned? Thanks!

@XingyuXie
Copy link
Collaborator

1301836556afce2756c7c2906b01c91

The above is a comparison between Adam and Adan on GPT-2 345M pre-trained on the OpenwebText dataset. As you mentioned, you may consider referring to the config in #32, and there is no need to tune beta1 and beta2. Using the default value is okay.

The most sensitive hyperparam is the lr and wd; you can choose wd from [0.02, 0.05, 0.1], beta3 can be chosen from [0.95, 0.999], and a larger lr and warmup fraction for Adan. We all follow this rule to tune the parameter for the 7B and even 65B models.

If you still get an inferior, I may reconstruct your experiment from my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants