Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StaticEmbedding.from_distillation("BAAI/bge-base-en-v1.5", device="cuda") not producing similarity results from docstring #3175

Closed
thomasht86 opened this issue Jan 17, 2025 · 4 comments · Fixed by #3177

Comments

@thomasht86
Copy link

Issue

From https://www.sbert.net/docs/package_reference/sentence_transformer/models.html#sentence_transformers.models.StaticEmbedding

(Commented out so that the distilled bge-base is the one that is run)

from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding
from tokenizers import Tokenizer

# Pre-distilled embeddings:
# static_embedding = StaticEmbedding.from_model2vec("minishlab/M2V_base_output")
# or distill your own embeddings:
static_embedding = StaticEmbedding.from_distillation("BAAI/bge-base-en-v1.5", device="cuda")
# or start with randomized embeddings:
# tokenizer = Tokenizer.from_pretrained("FacebookAI/xlm-roberta-base")
#static_embedding = StaticEmbedding(tokenizer, embedding_dim=512)

model = SentenceTransformer(modules=[static_embedding])

embeddings = model.encode(["What are Pandas?", "The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫; pinyin: dàxióngmāo), also known as the panda bear or simply the panda, is a bear native to south central China."])
similarity = model.similarity(embeddings[0], embeddings[1])
# tensor([[0.9177]]) (If you use the distilled bge-base)

For me, this produces

tensor([[0.5375]])

Environment

sentence-transformers=3.3.1
model2vec==0.3.6

Colab to reproduce: Open In Colab

@tomaarsen
Copy link
Collaborator

Hello!

Hmm, I'm able to reproduce the 0.5375, and I'm unable to get the 0.9177 now, even with older versions of ST and M2V. I have no clue how I once managed to get that value.

cc @Pringled @stephantul do either of you have any idea how I might have gotten 0.9177?

  • Tom Aarsen

@Pringled
Copy link
Contributor

Pringled commented Jan 17, 2025

Hey! Strange, I also get 0.5375, with older versions as well. I also tried using bge-base directly (SentenceTransformer("BAAI/bge-base-en-v1.5")) but that gives 0.7500. I also tried some of the other models we had around that time, but none seem to reproduce 0.9177. One interesting note: the new default model (minishlab/potion-base-8M) gets 0.7157 for this example, much closer to the original model's result, so if you do decide to update the docs, it might be nice to switch the example model there as well.

@thomasht86
Copy link
Author

I see. Thanks for quick response.
Feel free to close the issue if/when example is updated, and I'll look into minishlab/potion-base-8M.

@tomaarsen
Copy link
Collaborator

Here are the currently interesting Static Embedding models:

See the new Static Embeddings blogpost for more details on the latter two: https://huggingface.co/blog/static-embeddings


  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants