-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AnglE loss #2471
AnglE loss #2471
Conversation
@tomaarsen Do you have an idea of why only the ubuntu unit tests are failing? |
@johneckberg, many thanks for your implementation! Could it combine with contrastive loss? |
@johneckberg I'm not sure, no. The logs are very confusing too, it seems like the runners just die. I'll investigate it more. @SeanLee97 do you mean the regular cosine loss (the cosent one) and in-batch negatives? I think maybe it makes most sense to have AnglELoss as only the angle-optimized objective, and to allow users to mix and match losses themselves to reproduce your final loss function?
|
@tomaarsen that's strange, thanks for the clarification! @SeanLee97, there is an open issue #2440 regarding combining losses. Different loss functions in the library require different input formats, so combining contrastive loss/MNR Loss with AnglE Loss is not possible yet. Please leave a comment if you have any ideas for a solution to this! |
Initial tests seem to indicate that AnglE on its own performs slightly worse than just CoSENT, but still notably better than just CosineSimilarityLoss. Combining AnglE + CoSENT + MNRL is possible, but seems to result in worse performance at small-ish batch sizes (128 or 256) than pure AnglE or CoSENT, seemingly because MNRL is not doing great in my STS experiment. Note that I've only ran this on a few (~4) scripts. |
Hey @tomaarsen! I also noticed this performance differential between CoSENT and AnglE when performing informal tests. This is somewhat visible in the ablation study in the AnglE paper, noting a small (.13%) performance increase when using just CoSENT over just AnglE. How did you combine MNR with AnglE and CoSENT? |
I'll have to dive back into the ablation study! I used: class FullAngleLoss(nn.Module):
def __init__(self, model) -> None:
super().__init__()
self.angle_loss = losses.AnglELoss(model=model)
self.cosent_loss = losses.CoSENTLoss(model=model)
self.ibn = losses.MultipleNegativesRankingLoss(model=model)
def forward(self, sentence_features, labels):
return self.angle_loss(sentence_features, labels) + self.cosent_loss(sentence_features, labels) + self.ibn(sentence_features, labels)
train_loss = FullAngleLoss(model=model) |
Just had another look at the ablation study - the findings mirror mine quite closely! |
hi @tomaarsen @johneckberg , thanks for testing! Here are some suggestions from UAE training to achieve good performance:
As for NLI (multinli + snli), we just used the entailment (label 1) and contradict (label 0) data for training. |
Very useful information! I will try to run some extra experiments. |
Thanks for the insight @SeanLee97! @tomaarsen, my understanding is that the ST implementation of MNR loss treats every input pair as a positive pair; is it possible that part of the performance issues on STS are coming from any negative or neutral input pairs being treated as positive pairs inside MNR loss? |
@johneckberg Oh, you're super right. It makes no sense to apply MNRL on the
|
@tomaarsen No worries, glad I could be a second set of eyes! I have been thinking about that sort of data formatting problem in relation to issue #2440, and can't think of any solid ways around it. In the AnglE repo, @SeanLee97 combines losses by always conforming to the y_true, y_pred input convention. Each input pair is just the first two in y_pred, the third and forth in y_pred, and so on. Then when using MNR loss, a target matrix gets created to filter the pairs by label. This is a good solution, but wouldn't work for ST. |
An interesting solution is to keep the losses separate. E.g. in the current ST codebase that would entail 2 dataloaders & 2 losses (one with AnglE + CoSENT and one with MNRL) that fire round-robin style. This might be less performant, though. |
In class FullAngleLoss(nn.Module):
def __init__(self, model) -> None:
super().__init__()
self.angle_loss = losses.AnglELoss(model=model)
self.cosent_loss = losses.CoSENTLoss(model=model)
self.ibn = losses.MultipleNegativesRankingLoss(model=model)
def forward(self, sentence_features, labels):
positive_pairs = [
{key: value[labels >= 0.8] for key, value in features.items()}
for features in copy.deepcopy(sentence_features)
]
loss = (
self.angle_loss(sentence_features, labels)
+ self.cosent_loss(sentence_features, labels)
+ 3 * self.ibn(positive_pairs, labels)
)
return loss
train_loss = FullAngleLoss(model=model) And this reaches a competitive 0.8486 Spearman correlation coefficient on the test set with batch_size of 64. |
The docstring changes make the docs slightly prettier
I believe this might be ready. Any last comments or suggestions before I move forward with this @johneckberg @SeanLee97?
|
@tomaarsen I don't have any! |
No more comments. |
Much appreciated to you both, this is a very exciting addition.
|
PR Overview:
Details: