Use of Huggingface Transformers for Embeddings trades inference speed for embedding speed? #251

andrewginns · 2023-05-17T13:55:59Z

andrewginns
May 17, 2023

Are others noticing that inference speed is now slower using a vectorstore created through the new embedding approach?

Relevant changes: 23d24c8#comments

I can see that the speed of ingest.py has increased, but if I compare the latest code of privateGPT.py to a modified version of the latest file using the previous approach for vectorstore creation the inference speed has regressed for equivalent queries.

Answered by imartinez

May 17, 2023

It should not affect the speed of the LLM directly.
A change on the size of the returned pieces of context from the embeddings could affect it indirectly: the longer the prompt, the longer it takes for the LLM to process and respond. Maybe with the new embeddings we are generating slightly bigger prompts. You could adjust the chunk and overlap size in ingest.py file and test it out. Also, you could reduce the number of sources from the default (4) to, for example, 2; that'd should a big impact on the overall speed

View full answer

imartinez · 2023-05-17T14:05:00Z

imartinez
May 17, 2023
Maintainer

It should not affect the speed of the LLM directly.
A change on the size of the returned pieces of context from the embeddings could affect it indirectly: the longer the prompt, the longer it takes for the LLM to process and respond. Maybe with the new embeddings we are generating slightly bigger prompts. You could adjust the chunk and overlap size in ingest.py file and test it out. Also, you could reduce the number of sources from the default (4) to, for example, 2; that'd should a big impact on the overall speed

2 replies

andrewginns May 17, 2023
Author

Thanks for the response, I'm currently trying to do direct comparisons between the two and will add some actual data to this discussion.

My knowledge is still pretty limited but intuitively your comment about bigger prompts makes some sense.

In essence I guess changing to the Huggingface method is similar to changing the type of model the embeddings are generated with, which as you point out does not inherently affect the speed of the LLM itself.

andrewginns May 17, 2023
Author

Some data, it seems like the new technique gives me more verbose answers.

The quality seems roughly equivalent and the increased inference time seems to be a reflection of this.

ingest.py comparison:

Embeddings model ggml-vicuna-13b-1.1-q5_0.bin vs all-MiniLM-L6-v2 (Huggingface Embeddings)
543 seconds (old) vs 0.79 seconds (new)
Timings for Chroma.from_documents()

privateGPT.py comparison:

LLM ggml-vicuna-13b-1.1-q5_0.bin
61.94 seconds (old) vs seconds 113.61 seconds (new)
81 tokens vs 157 tokens
Timings for qa(query)

andrewginns · 2023-05-18T08:56:41Z

andrewginns
May 18, 2023
Author

In case others stumble across this in the future I now have a better understanding of why the change to the document embeddings are not impacting the performance of the LLM inference step.

Previously I had assumed (wrongly) that the document embeddings generated in ingest.py were used as the embeddings input to the LLM used for inference. In fact this is not the case, the document embeddings are only used in the similarity search over source_documents for the query. Therefore they only impact the quality of the initial relevant document search.

These relevant documents are then passed to the LLM for inference as normal text. I still think there are cases where we may want more complex embedding representations when we have very large corpuses of text but for now the new method of HGFT embedding generation is well worth the speed increase.

0 replies

thekit · 2023-05-19T00:23:09Z

thekit
May 19, 2023

Will HuggingFaceEmbeddings have gpu support? llamaCPP was adding gpu support this week.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of Huggingface Transformers for Embeddings trades inference speed for embedding speed? #251

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Use of Huggingface Transformers for Embeddings trades inference speed for embedding speed? #251

andrewginns May 17, 2023

Replies: 3 comments · 2 replies

imartinez May 17, 2023 Maintainer

andrewginns May 17, 2023 Author

andrewginns May 17, 2023 Author

andrewginns May 18, 2023 Author

thekit May 19, 2023

andrewginns
May 17, 2023

Replies: 3 comments 2 replies

imartinez
May 17, 2023
Maintainer

andrewginns May 17, 2023
Author

andrewginns May 17, 2023
Author

andrewginns
May 18, 2023
Author

thekit
May 19, 2023