Change server approach to handle parallel requests#1550
Change server approach to handle parallel requests#1550sergey-zinchenko wants to merge 2 commits intoabetlen:mainfrom
Conversation
|
@abetlen What do you think about that changes? |
|
Hey, thanks for this pr. Is it possible that we can get the pr merged? 😄 |
|
@gerdemann @Smartappli Hi! I authored this PR two month ago) Looks like it has some conflicts now. I can fix it today if there is somebody who can merge it right after that) |
|
@gerdemann @Smartappli and I se some activity during that two month related to the way how server handles parallel requests in main branch. Is that still an issue? |
|
I still get this error when two requests are made at the same time: I tried to install your branch directly and test it. But I get this error: Do you have any idea what I am doing wrong? |
Hi, I encountered the same issue. The service is still not handling concurrent requests properly. When I send a second request while the LLM is still generating a response for the first request, I receive this error. |
|
I have implemented a smaller alternative change that solves the same problem in #1798 |
I have made a change to the way the server handles concurrent requests. In this PR, arriving requests will wait for the model's global async lock. I.e., requests will be organized in something like a queue. On top of that, I added a configuration for the unicorn to have only ten concurrent requests. So finally, up to ten parallel requests will await like "in a queue" for the model lock, and the current request will not be interrupted. If 11's request arrives, the server will send out 503 response immediately. This approach suits the common scenarios with multiuser chatbot UI and API access.
I also changed some other stuff to fix PEP warnings by linter in IDE.