Question regarding overfitting a single batch and the weight-tying scheme #61
-
Question regarding overfitting a single batch 00:56:42. When I overfit a single batch without the "weight tying" between the token embedding tensor and the language model head, I can achieve a really low loss (just like in the video). However, with the "weight tying", the loss seems to bottom out around 1 - why does this happen? In other words, why does the weight-tying seem to limit the models ability to overfit a single batch? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
I believe weight sharing is a form of regularization and for that reason it reduces overfitting. Weight sharing reduces the number of effective learnable parameters (which reduces overfitting) and also imposes constraints on the network weights which is what regularization methods do to prevent overfitting. |
Beta Was this translation helpful? Give feedback.
I believe weight sharing is a form of regularization and for that reason it reduces overfitting. Weight sharing reduces the number of effective learnable parameters (which reduces overfitting) and also imposes constraints on the network weights which is what regularization methods do to prevent overfitting.