Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

which model should be used? #13

Open
MohammadAsadolahi opened this issue Mar 14, 2024 · 5 comments
Open

which model should be used? #13

MohammadAsadolahi opened this issue Mar 14, 2024 · 5 comments

Comments

@MohammadAsadolahi
Copy link

Hi and thank you for sharing this amazing work.

i want to use Gritlm to produce embeddings to be stored in some vector database for document retrieval. But. there are many models on the huggingface.

  1. I wanted to know which model is the best to produce text embeddings for document retrieval based on a query?
  2. When i want to search the query using cosine similarity, should i use any instruction or something or just produce the embedding and search based on cosine?
@Muennighoff
Copy link
Collaborator

  1. https://huggingface.co/GritLM/GritLM-7B
  2. You can always add instructions, but for documents it is not needed/unlikely to help a lot.

@phartman-keysight
Copy link

  1. So you don't recommend adding instructions "Represent X for finding Y" when embedding documents or queries?
  2. If not, do you recommend adding instructions like that if we finetune?

@Muennighoff
Copy link
Collaborator

  1. I do recommend instructions for queries. For documents it is not needed/unlikely to help a lot.
  2. I recommend adding instructions for finetuning, especially for queries. You can also add them for documents and it may help.

More research is needed to establish the exact benefit from document instructions. Here's the part from the paper that discusses it a bit:

Embedding dataset We benchmark MEDI [ 143 ], a new version of MEDI with better negatives
which we build and call MEDI2, and the E5 dataset [160 ]. While MEDI and MEDI2 always
preface instructions with “Represent” (see e.g. Figure 10), the E5 dataset places no constraint
on the instruction prefix (see e.g. Figure 11). Thus, when using the E5 dataset the “<|embed|>”
formatting is critical to tell the model that it will be subject to the representation loss, not the
generative loss (Figure 3). Further, MEDI and MEDI2 always contain instructions for both queries
and documents, which we refer to as two-sided instructions. Meanwhile, the E5 dataset uses onesided instructions for asymmetric datasets [104 ], whereby the documents receive no instructions,
only the queries. The advantage of not using document instructions is that the document corpus
can be encoded once and then cached and reused across a variety of tasks. During training on E5,
symmetric tasks are also in a one-sided setting, but we still evaluate them in the two-sided format.
This should not be a problem as the cosine similarity function we use during training is transitive:
if sentence A with instruction is similar to sentence B without instruction, and sentence B without
instruction is similar to sentence C with instruction, then we can confidently say that sentence A
with instruction is also similar to sentence C with instruction. As depicted in Table 6, using the E5
dataset performs best by a wide margin. An inspection of samples, suggests that this is likely due to
its superior hard negatives and diversity of tasks generated by GPT-4 (Appendix N). For our final
runs with the E5 dataset, we additionally add scientific data (§3.1).

@phartman-keysight
Copy link

Okay so use instructions for document retrieval, just for the query embedding side not the document embedding side. Thanks for the excerpt I understand the one sided instructions now.

Do you have any other recommendations for finetuning the existing grit model for embedding only?

@Muennighoff
Copy link
Collaborator

Exactly. You can also use them for the document embedding side if you want, but the benefit is unclear to me. Would be interesting to know! If you are only interested in embedding performance, I would probably fine-tune from the embedding-only variant instead: https://huggingface.co/GritLM/emb_m7_nodes16_fast

Other than that, I'd follow the recommendations in the paper (bidirectional attn, large batch size etc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants