it would be great to improve performance for the inference stage #286

joru10 · 2023-05-18T14:45:21Z

joru10
May 18, 2023

While the ingestion improvement added yesterday was superb, thanks (now I can upload 1000 files in a very reasonable time)...
During the inference I noticed:

it is very slow in responding to a query, naturally my computer may be at fault here, I am using over >6GB of RAM to run a query, and it is taking over 15 min to get a response for a question, I am not sure how it works but from where I see it it is not great at this point, so hopefully this is already identified and being analyzed.
When I have got answers to questions, they did not have a sufficient quality, in some cases, it could simply not find the response at all, even if it was in the knowledge base, in other cases, it would list some responses, relevant to a degree (at least if I was asking for a keyword, this keyword was in the responses, but the rest of the context was not well understood. In general it looks as if the responses are just the chunks from the DB without any post-processing, i.e. in some cases the response was truncated in the middle of nowhere, probably due to the chunk being incomplete and missing some mechanism to join via overlap with the next chunk?
I missed also a final wrapping of the response, normally what I see with Langchain is that when they get a top list of responses based on the similarity search, there is then a final query to the LLM to send the query again with the ICL (in-context) being those responses, to, then, produce a final response to the user, but I guess that this is missing?

I am sending this feedback in the Q&A section, in case this is relevant, I would appreciate some sort of feedback... and of course, I am ready to clarify or provide additional input or reference if needed

johnbrisbin · 2023-05-18T21:50:06Z

johnbrisbin
May 18, 2023

I get better response times, but they are still quite slow. CPU based inferences will lag far behind GPU equivalents, but GPUs on the proper scale are very expensive.
I fear this is a 'Dancing Bear' case. "It is not how well the bear dances, but that it dances at all."
That said, there should be improvements in both ingestion and inference performance, but there is no telling when. Inference quality should get better, probably with better query techniques, as much as anything. (and new models).

0 replies

jswd · 2023-05-19T00:22:37Z

jswd
May 19, 2023

I thought I was my machine. I have a corei7, 16GB RAM, SSD, and Geforce, but it takes like 20 mins to get a response. I suspect I'm not using the GPU, but the CPU. Anyway how can I check what it's using? Thanks

1 reply

jamsnrihk May 19, 2023

I think not so much relation with HW, i am using 128G RAM and Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz 2.40, still need around 8 mins to get an answer. if ask more than 3 questions, will get "ggml_new_tensor_impl: not enough space in the context's memory pool (needed 8073239872, available 8070229200)"

johnbrisbin · 2023-05-19T01:21:02Z

johnbrisbin
May 19, 2023

Everything is done all CPU at this time. There are efforts afoot to enable GPU usage, but what that would be worth is unknown given typical consumer GPUs. Sorry I can't offer anything more specific.

0 replies

andrewginns · 2023-05-19T08:01:06Z

andrewginns
May 19, 2023

Performance of inference is dependent on the backend utilised, either GPT4All or llama.cpp. It's therefore not really dependent on this repo which acts more like a wrapper around the these tools so I don't think there's an opportunity to do the same thing as the embedding performance improvement.

What could be possible is allowing for control over optional parameters to the backend - llama.cpp options like use_mlock, n_threads etc.

For those with performance issues now and using the llama.cpp backend you can edit the call to the model in privateGPT.py to multithread and keep the model in memory.

n_cpus = len(os.sched_getaffinity(0))
case "LlamaCpp":
    llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False, n_threads=n_cpus, use_mlock=True)

Reference: https://abetlen.github.io/llama-cpp-python/#llama_cpp.llama.Llama.__init__

2 replies

joru10 May 19, 2023
Author

Any feedback about point 3, so the final wrapper of the response to the end user instead of just listing the top K chunks (which is what I am currently experiencing)

I am not sure about point 2, I guess this would require quite some work, to tune the chunks embeddings (and overlap?) parameters (if possible) to better split the semantic units, etc.

andrewginns May 19, 2023

Unfortunately I don't have much experience in LangChain but the inference logic seems simple. Perhaps there's a better function to call than RetrievalQA.from_chain_type() that includes the extra steps you note are missing?

On point 2 I believe you're correct the whole document seems to be embedded without preprocessing. Then the similarity search is impacted by the quality of the embeddings, using the HGFT embeddings constrains us to sentence-transformer models from HF.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

it would be great to improve performance for the inference stage #286

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

it would be great to improve performance for the inference stage #286

joru10 May 18, 2023

Replies: 4 comments · 3 replies

johnbrisbin May 18, 2023

jswd May 19, 2023

jamsnrihk May 19, 2023

johnbrisbin May 19, 2023

andrewginns May 19, 2023

joru10 May 19, 2023 Author

andrewginns May 19, 2023

joru10
May 18, 2023

Replies: 4 comments 3 replies

johnbrisbin
May 18, 2023

jswd
May 19, 2023

johnbrisbin
May 19, 2023

andrewginns
May 19, 2023

joru10 May 19, 2023
Author