Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store embedding similarity/distance metric in model configuration #2441

Closed
tomaarsen opened this issue Jan 23, 2024 · 8 comments
Closed

Store embedding similarity/distance metric in model configuration #2441

tomaarsen opened this issue Jan 23, 2024 · 8 comments

Comments

@tomaarsen
Copy link
Collaborator

Hello!

Context

This issue describes a feature that I am planning to be included in a release before v3, or alternatively, in v3 of Sentence Transformers.
See also #1649 for a PR related to this proposal.

Details

Different models use different embedding similarity/distance metric or score functions to compare their embeddings. Two notable ones are dot-product and cosine similarity, though others exist. It may be valuable to include these in the model configuration (config_sentence_transformers.json) as strings:

{
    ...,
    "score_function": "cosine_similarity",
    ...,
}

or

{
    ...,
    "score_function": "dot_product",
    ...,
}

Ideally, upon loading the model, e.g. model.compare or model.score_function should point to the saved similarity measure string or function (cosine similarity by default). The name of these parameters, etc. are still undecided.

cc @bwanglzu @ir2718 @johneckberg @aamir-s18 as I know you're interested in my TODO list.

  • Tom Aarsen
@ir2718
Copy link
Contributor

ir2718 commented Jan 25, 2024

I'd like to have a go at this if others don't mind.

@tomaarsen
If I'm not mistaken, this score function is determined by the choice of the distance metric in the loss function?

@tomaarsen
Copy link
Collaborator Author

Please do! And yes, essentially, although sometimes one metric performs surprisingly well despite the loss function not purposefully aiming to optimize that distance/score metric. Sentence Transformers does have this class: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/evaluation/SimilarityFunction.py

But it may not be exhaustive. Some of the evaluators rely on the user to specify this in the evaluator initialization, but I think it could be stored in the model itself, then the users won't have to specify it explicitly either.

Apologies that I haven't gotten around to your other PR yet. I've been very busy these last few days working on #2436. Expect a huge PR for that very soon 😄

  • Tom Aarsen

@ir2718
Copy link
Contributor

ir2718 commented Jan 25, 2024

Thanks for the quick answer and the enum heads up. Completely missed that file.

Also, don't worry about my other PR. Sentence-transformers is very different from the classic Huggingface libraries so I can only imagine the amount of stuff you've got on your hands.

@tomaarsen
Copy link
Collaborator Author

I see that there are a handful of places where distance functions are defined, e.g. also

Perhaps we should introduce one file/class where all of the distance metrics are defined. It's tricky to come up with a good design, though.

  • Tom Aarsen

@ir2718
Copy link
Contributor

ir2718 commented Jan 26, 2024

@tomaarsen

What makes most sense to me is to fit everything into a single enum. It's probably going to require very little refactoring, and it's going to get rid of distance/similarity definitions all over the place.

Regarding the EmbeddingSimilarityEvaluator class, I noticed manhattan/euclidean similarity is defined as the negative distance. In case the user doesn't define the scoring function and fits a model using the EmbeddingSimilarityEvaluator, I would like to set the scoring function that corresponds to the highest scoring metric. Setting the scoring function as the negative distance seems a bit odd, and I've usually seen manhattan/euclidean similarity definitions such as: 1/(1 + dist(a, b)). What's your take on this?

@tomaarsen
Copy link
Collaborator Author

tomaarsen commented Jan 26, 2024

What makes most sense to me is to fit everything into a single enum. It's probably going to require very little refactoring, and it's going to get rid of distance/similarity definitions all over the place.

Agreed, that seems sensible.

Regarding the EmbeddingSimilarityEvaluator class, I noticed manhattan/euclidean similarity is defined as the negative distance.

I think this originates from the literature where similarity is inversely related to distance. E.g. cosine similarity is often written as
$$\frac{A \cdot B}{||A|| \cdot ||B||}$$
whereas cosine distance is:
$$1 - \frac{A \cdot B}{||A|| \cdot ||B||}$$
e.g. Wikipedia

I would need to dive into the definitions for manhattan/euclidean, but it seems plausible that it's simply the negation of the similarity (and vice versa). Presumably the proper way to compute the Spearman/Pearson for manhattan/euclidean is then with the distance, rather than the similarity? That could make our enum a bit tricky, perhaps...

  • Tom Aarsen

@ir2718
Copy link
Contributor

ir2718 commented Jan 27, 2024

That could make our enum a bit tricky, perhaps...

That's exactly what I was thinking. Although, the only thing that changes when you use the similarity definition compared to the negative distance is the pearson coefficient. For now, I'll just set it as the negative distance.

@tomaarsen
Copy link
Collaborator Author

Implemented in the v3.0 release.

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants