Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embeddings are initialized with std of 0.02 #18

Open
eryk-mazus opened this issue Jun 12, 2024 · 2 comments
Open

Embeddings are initialized with std of 0.02 #18

eryk-mazus opened this issue Jun 12, 2024 · 2 comments

Comments

@eryk-mazus
Copy link

I noticed that in the following snippet, that the std of nn.Embedding is set to 0.02:

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            std = 0.02
            if hasattr(module, 'NANOGPT_SCALE_INIT'):
                std *= (2 * self.config.n_layer) ** -0.5
            torch.nn.init.normal_(module.weight, mean=0.0, std=std)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

The official implementation sets it to 0.01 as noted in the video. In only matters for positional embeddings due to weight sharing scheme of wte and lm_head

@peter-ni-noob
Copy link

0.02 is ok according to Megatron-LM training loop

@melqtx
Copy link

melqtx commented Jun 13, 2024

I noticed that in the following snippet, that the std of nn.Embedding is set to 0.02:

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            std = 0.02
            if hasattr(module, 'NANOGPT_SCALE_INIT'):
                std *= (2 * self.config.n_layer) ** -0.5
            torch.nn.init.normal_(module.weight, mean=0.0, std=std)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

The official implementation sets it to 0.01 as noted in the video. In only matters for positional embeddings due to weight sharing scheme of wte and lm_head

i doesnt mattet much, 0.02 kind of works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants