Open
Description
The DeepIce
model contains a method called no_weight_decay()
which is intended to specify that the cls_token
parameter should not be subject to weight decay during training:
@torch.jit.ignore
def no_weight_decay(self) -> Set:
"""cls_tocken should not be subject to weight decay during training."""
return {"cls_token"}
However, optimizer_grouped_parameters
are not specified during training, so this method has no effect.
I believe that in the original 2nd place code, FastAI's wrapper around AdamW handled this automatically.