Replies: 4 comments 3 replies
-
I get better response times, but they are still quite slow. CPU based inferences will lag far behind GPU equivalents, but GPUs on the proper scale are very expensive. |
Beta Was this translation helpful? Give feedback.
-
I thought I was my machine. I have a corei7, 16GB RAM, SSD, and Geforce, but it takes like 20 mins to get a response. I suspect I'm not using the GPU, but the CPU. Anyway how can I check what it's using? Thanks |
Beta Was this translation helpful? Give feedback.
-
Everything is done all CPU at this time. There are efforts afoot to enable GPU usage, but what that would be worth is unknown given typical consumer GPUs. Sorry I can't offer anything more specific. |
Beta Was this translation helpful? Give feedback.
-
Performance of inference is dependent on the backend utilised, either GPT4All or llama.cpp. It's therefore not really dependent on this repo which acts more like a wrapper around the these tools so I don't think there's an opportunity to do the same thing as the embedding performance improvement. What could be possible is allowing for control over optional parameters to the backend - llama.cpp options like For those with performance issues now and using the llama.cpp backend you can edit the call to the model in
Reference: https://abetlen.github.io/llama-cpp-python/#llama_cpp.llama.Llama.__init__ |
Beta Was this translation helpful? Give feedback.
-
While the ingestion improvement added yesterday was superb, thanks (now I can upload 1000 files in a very reasonable time)...
During the inference I noticed:
it is very slow in responding to a query, naturally my computer may be at fault here, I am using over >6GB of RAM to run a query, and it is taking over 15 min to get a response for a question, I am not sure how it works but from where I see it it is not great at this point, so hopefully this is already identified and being analyzed.
When I have got answers to questions, they did not have a sufficient quality, in some cases, it could simply not find the response at all, even if it was in the knowledge base, in other cases, it would list some responses, relevant to a degree (at least if I was asking for a keyword, this keyword was in the responses, but the rest of the context was not well understood. In general it looks as if the responses are just the chunks from the DB without any post-processing, i.e. in some cases the response was truncated in the middle of nowhere, probably due to the chunk being incomplete and missing some mechanism to join via overlap with the next chunk?
I missed also a final wrapping of the response, normally what I see with Langchain is that when they get a top list of responses based on the similarity search, there is then a final query to the LLM to send the query again with the ICL (in-context) being those responses, to, then, produce a final response to the user, but I guess that this is missing?
I am sending this feedback in the Q&A section, in case this is relevant, I would appreciate some sort of feedback... and of course, I am ready to clarify or provide additional input or reference if needed
Beta Was this translation helpful? Give feedback.
All reactions