-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store embedding similarity/distance metric in model configuration #2441
Comments
I'd like to have a go at this if others don't mind. @tomaarsen |
Please do! And yes, essentially, although sometimes one metric performs surprisingly well despite the loss function not purposefully aiming to optimize that distance/score metric. Sentence Transformers does have this class: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/evaluation/SimilarityFunction.py But it may not be exhaustive. Some of the evaluators rely on the user to specify this in the evaluator initialization, but I think it could be stored in the model itself, then the users won't have to specify it explicitly either. Apologies that I haven't gotten around to your other PR yet. I've been very busy these last few days working on #2436. Expect a huge PR for that very soon 😄
|
Thanks for the quick answer and the enum heads up. Completely missed that file. Also, don't worry about my other PR. Sentence-transformers is very different from the classic Huggingface libraries so I can only imagine the amount of stuff you've got on your hands. |
I see that there are a handful of places where distance functions are defined, e.g. also
Perhaps we should introduce one file/class where all of the distance metrics are defined. It's tricky to come up with a good design, though.
|
What makes most sense to me is to fit everything into a single enum. It's probably going to require very little refactoring, and it's going to get rid of distance/similarity definitions all over the place. Regarding the EmbeddingSimilarityEvaluator class, I noticed manhattan/euclidean similarity is defined as the negative distance. In case the user doesn't define the scoring function and fits a model using the EmbeddingSimilarityEvaluator, I would like to set the scoring function that corresponds to the highest scoring metric. Setting the scoring function as the negative distance seems a bit odd, and I've usually seen manhattan/euclidean similarity definitions such as: |
Agreed, that seems sensible.
I think this originates from the literature where similarity is inversely related to distance. E.g. cosine similarity is often written as I would need to dive into the definitions for manhattan/euclidean, but it seems plausible that it's simply the negation of the similarity (and vice versa). Presumably the proper way to compute the Spearman/Pearson for manhattan/euclidean is then with the distance, rather than the similarity? That could make our enum a bit tricky, perhaps...
|
That's exactly what I was thinking. Although, the only thing that changes when you use the similarity definition compared to the negative distance is the pearson coefficient. For now, I'll just set it as the negative distance. |
Implemented in the v3.0 release.
|
Hello!
Context
This issue describes a feature that I am planning to be included in a release before v3, or alternatively, in v3 of Sentence Transformers.
See also #1649 for a PR related to this proposal.
Details
Different models use different embedding similarity/distance metric or score functions to compare their embeddings. Two notable ones are dot-product and cosine similarity, though others exist. It may be valuable to include these in the model configuration (
config_sentence_transformers.json
) as strings:or
Ideally, upon loading the model, e.g.
model.compare
ormodel.score_function
should point to the saved similarity measure string or function (cosine similarity by default). The name of these parameters, etc. are still undecided.cc @bwanglzu @ir2718 @johneckberg @aamir-s18 as I know you're interested in my TODO list.
The text was updated successfully, but these errors were encountered: