Question regarding overfitting a single batch and the weight-tying scheme #61

pavanpreet-gandhi · 2024-07-16T21:58:00Z

pavanpreet-gandhi
Jul 16, 2024

Question regarding overfitting a single batch 00:56:42.

When I overfit a single batch without the "weight tying" between the token embedding tensor and the language model head, I can achieve a really low loss (just like in the video). However, with the "weight tying", the loss seems to bottom out around 1 - why does this happen?

In other words, why does the weight-tying seem to limit the models ability to overfit a single batch?

Answered by qaflan

Jul 31, 2024

I believe weight sharing is a form of regularization and for that reason it reduces overfitting. Weight sharing reduces the number of effective learnable parameters (which reduces overfitting) and also imposes constraints on the network weights which is what regularization methods do to prevent overfitting.

View full answer

qaflan · 2024-07-31T23:02:02Z

qaflan
Jul 31, 2024

I believe weight sharing is a form of regularization and for that reason it reduces overfitting. Weight sharing reduces the number of effective learnable parameters (which reduces overfitting) and also imposes constraints on the network weights which is what regularization methods do to prevent overfitting.

1 reply

pavanpreet-gandhi Aug 1, 2024
Author

That makes sense, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding overfitting a single batch and the weight-tying scheme #61

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Question regarding overfitting a single batch and the weight-tying scheme #61

pavanpreet-gandhi Jul 16, 2024

Replies: 1 comment · 1 reply

qaflan Jul 31, 2024

pavanpreet-gandhi Aug 1, 2024 Author

pavanpreet-gandhi
Jul 16, 2024

Replies: 1 comment 1 reply

qaflan
Jul 31, 2024

pavanpreet-gandhi Aug 1, 2024
Author