Embeddings are initialized with std of 0.02 #18

eryk-mazus · 2024-06-12T18:28:22Z

I noticed that in the following snippet, that the std of nn.Embedding is set to 0.02:

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            std = 0.02
            if hasattr(module, 'NANOGPT_SCALE_INIT'):
                std *= (2 * self.config.n_layer) ** -0.5
            torch.nn.init.normal_(module.weight, mean=0.0, std=std)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

The official implementation sets it to 0.01 as noted in the video. In only matters for positional embeddings due to weight sharing scheme of wte and lm_head

The text was updated successfully, but these errors were encountered:

peter-ni-noob · 2024-06-13T05:40:31Z

0.02 is ok according to Megatron-LM training loop

melqtx · 2024-06-13T12:39:39Z

I noticed that in the following snippet, that the std of nn.Embedding is set to 0.02:
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            std = 0.02
            if hasattr(module, 'NANOGPT_SCALE_INIT'):
                std *= (2 * self.config.n_layer) ** -0.5
            torch.nn.init.normal_(module.weight, mean=0.0, std=std)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
The official implementation sets it to 0.01 as noted in the video. In only matters for positional embeddings due to weight sharing scheme of wte and lm_head

i doesnt mattet much, 0.02 kind of works

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embeddings are initialized with std of 0.02 #18

Embeddings are initialized with std of 0.02 #18

eryk-mazus commented Jun 12, 2024

peter-ni-noob commented Jun 13, 2024

melqtx commented Jun 13, 2024

Embeddings are initialized with std of 0.02 #18

Embeddings are initialized with std of 0.02 #18

Comments

eryk-mazus commented Jun 12, 2024

peter-ni-noob commented Jun 13, 2024

melqtx commented Jun 13, 2024