Use of Huggingface Transformers for Embeddings trades inference speed for embedding speed? #251
-
Are others noticing that inference speed is now slower using a vectorstore created through the new embedding approach? Relevant changes: 23d24c8#comments I can see that the speed of ingest.py has increased, but if I compare the latest code of |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
It should not affect the speed of the LLM directly. |
Beta Was this translation helpful? Give feedback.
-
In case others stumble across this in the future I now have a better understanding of why the change to the document embeddings are not impacting the performance of the LLM inference step. Previously I had assumed (wrongly) that the document embeddings generated in These relevant documents are then passed to the LLM for inference as normal text. I still think there are cases where we may want more complex embedding representations when we have very large corpuses of text but for now the new method of HGFT embedding generation is well worth the speed increase. |
Beta Was this translation helpful? Give feedback.
-
Will HuggingFaceEmbeddings have gpu support? llamaCPP was adding gpu support this week. |
Beta Was this translation helpful? Give feedback.
It should not affect the speed of the LLM directly.
A change on the size of the returned pieces of context from the embeddings could affect it indirectly: the longer the prompt, the longer it takes for the LLM to process and respond. Maybe with the new embeddings we are generating slightly bigger prompts. You could adjust the chunk and overlap size in
ingest.py
file and test it out. Also, you could reduce the number of sources from the default (4) to, for example, 2; that'd should a big impact on the overall speed