Store embedding similarity/distance metric in model configuration #2441

tomaarsen · 2024-01-23T09:26:41Z

Hello!

Context

This issue describes a feature that I am planning to be included in a release before v3, or alternatively, in v3 of Sentence Transformers.
See also #1649 for a PR related to this proposal.

Details

Different models use different embedding similarity/distance metric or score functions to compare their embeddings. Two notable ones are dot-product and cosine similarity, though others exist. It may be valuable to include these in the model configuration (config_sentence_transformers.json) as strings:

{
    ...,
    "score_function": "cosine_similarity",
    ...,
}

or

{
    ...,
    "score_function": "dot_product",
    ...,
}

Ideally, upon loading the model, e.g. model.compare or model.score_function should point to the saved similarity measure string or function (cosine similarity by default). The name of these parameters, etc. are still undecided.

cc @bwanglzu @ir2718 @johneckberg @aamir-s18 as I know you're interested in my TODO list.

Tom Aarsen

The text was updated successfully, but these errors were encountered:

ir2718 · 2024-01-25T16:00:51Z

I'd like to have a go at this if others don't mind.

@tomaarsen
If I'm not mistaken, this score function is determined by the choice of the distance metric in the loss function?

tomaarsen · 2024-01-25T16:06:08Z

Please do! And yes, essentially, although sometimes one metric performs surprisingly well despite the loss function not purposefully aiming to optimize that distance/score metric. Sentence Transformers does have this class: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/evaluation/SimilarityFunction.py

But it may not be exhaustive. Some of the evaluators rely on the user to specify this in the evaluator initialization, but I think it could be stored in the model itself, then the users won't have to specify it explicitly either.

Apologies that I haven't gotten around to your other PR yet. I've been very busy these last few days working on #2436. Expect a huge PR for that very soon 😄

Tom Aarsen

ir2718 · 2024-01-25T16:32:30Z

Thanks for the quick answer and the enum heads up. Completely missed that file.

Also, don't worry about my other PR. Sentence-transformers is very different from the classic Huggingface libraries so I can only imagine the amount of stuff you've got on your hands.

tomaarsen · 2024-01-26T11:22:47Z

I see that there are a handful of places where distance functions are defined, e.g. also

sentence-transformers/sentence_transformers/losses/ContrastiveLoss.py

Line 8 in a1ed49c

class SiameseDistanceMetric(Enum):

Perhaps we should introduce one file/class where all of the distance metrics are defined. It's tricky to come up with a good design, though.

Tom Aarsen

ir2718 · 2024-01-26T16:01:41Z

@tomaarsen

What makes most sense to me is to fit everything into a single enum. It's probably going to require very little refactoring, and it's going to get rid of distance/similarity definitions all over the place.

Regarding the EmbeddingSimilarityEvaluator class, I noticed manhattan/euclidean similarity is defined as the negative distance. In case the user doesn't define the scoring function and fits a model using the EmbeddingSimilarityEvaluator, I would like to set the scoring function that corresponds to the highest scoring metric. Setting the scoring function as the negative distance seems a bit odd, and I've usually seen manhattan/euclidean similarity definitions such as: 1/(1 + dist(a, b)). What's your take on this?

tomaarsen · 2024-01-26T21:42:16Z

What makes most sense to me is to fit everything into a single enum. It's probably going to require very little refactoring, and it's going to get rid of distance/similarity definitions all over the place.

Agreed, that seems sensible.

Regarding the EmbeddingSimilarityEvaluator class, I noticed manhattan/euclidean similarity is defined as the negative distance.

I think this originates from the literature where similarity is inversely related to distance. E.g. cosine similarity is often written as
$$\frac{A \cdot B}{||A|| \cdot ||B||}$$
whereas cosine distance is:
$$1 - \frac{A \cdot B}{||A|| \cdot ||B||}$$
e.g. Wikipedia

I would need to dive into the definitions for manhattan/euclidean, but it seems plausible that it's simply the negation of the similarity (and vice versa). Presumably the proper way to compute the Spearman/Pearson for manhattan/euclidean is then with the distance, rather than the similarity? That could make our enum a bit tricky, perhaps...

Tom Aarsen

ir2718 · 2024-01-27T15:10:25Z

That could make our enum a bit tricky, perhaps...

That's exactly what I was thinking. Although, the only thing that changes when you use the similarity definition compared to the negative distance is the pearson coefficient. For now, I'll just set it as the negative distance.

tomaarsen · 2024-06-04T21:39:55Z

Implemented in the v3.0 release.

Tom Aarsen

tomaarsen mentioned this issue Jan 29, 2024

Implement AnglE loss #2437

Closed

ir2718 mentioned this issue Feb 19, 2024

Add a scoring function in model configuration #2490

Closed

tomaarsen closed this as completed Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store embedding similarity/distance metric in model configuration #2441

Store embedding similarity/distance metric in model configuration #2441

tomaarsen commented Jan 23, 2024

ir2718 commented Jan 25, 2024 •

edited

Loading

tomaarsen commented Jan 25, 2024

ir2718 commented Jan 25, 2024

tomaarsen commented Jan 26, 2024

ir2718 commented Jan 26, 2024

tomaarsen commented Jan 26, 2024 •

edited

Loading

ir2718 commented Jan 27, 2024

tomaarsen commented Jun 4, 2024

Store embedding similarity/distance metric in model configuration #2441

Store embedding similarity/distance metric in model configuration #2441

Comments

tomaarsen commented Jan 23, 2024

Context

Details

ir2718 commented Jan 25, 2024 • edited Loading

tomaarsen commented Jan 25, 2024

ir2718 commented Jan 25, 2024

tomaarsen commented Jan 26, 2024

ir2718 commented Jan 26, 2024

tomaarsen commented Jan 26, 2024 • edited Loading

ir2718 commented Jan 27, 2024

tomaarsen commented Jun 4, 2024

ir2718 commented Jan 25, 2024 •

edited

Loading

tomaarsen commented Jan 26, 2024 •

edited

Loading