You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for this project, I just started working with it yesterday. Really nice
I have two rtx3090, each with 24GB. I use Ollama to run models (thanks for adding the local_ollama module!)
Two details for context on this issue:
When the model used is larger than 24GB, it loads across both GPUs, as expected. Both are used concurrently to answer a question
When the model is smaller than 24GB (fitting on one GPU) it runs most efficiently on a single GPU and Ollama knows to only use one. So only one GPU is utilized
When the model is smaller, I can’t help but feel that this is a waste of the idle GPU and/or a waste of the users’ time (albeit a small one)
This is not a fault in Gepetto, but I was considering how Gepetto could be adapted to utilize both, by adding the capability of using two independent instances of Ollama, one for each question
I took a quick look at the Gepetto code and I see that “explain” and “rename” are each a single question, so there’s no way to distribute either single operation across multiple Ollama instances
The best I could come up with to make better use of both GPUs is to request each operation (“explain” and “rename”) roughly concurrently, so each GPU is exercised
In theory this makes Gepetto twice as fast for those cases where the user wants to perform both operations
As I mentioned, to actually accomplish this, at least one change in Gepetto is required, which is ask to ask both questions concurrently, by adding an “explain and rename” context menu option that fires off both
The second half of the solution could be implemented in Gepetto, or it could be left up to the user to solve. To actually use both GPUs, assuming Gepetto is first modified to ask both questions at once, there are two approaches, I think:
Add logic to Gepetto to accept two or more API endpoints and send each question to a different endpoint, with each endpoint having a separate instance of Ollama (e.g. one on port 11434 and one on port 11435)
Leave Gepetto logic as it is, put the burden on the user to set up a simple round-robin proxy (either a layer 3/4 LB like haproxy, or an HTTP LB like nginx+mod_proxy) to automatically distribute the questions
A few questions:
Are you interested in this? The change required would be to ask both questions (roughly) simultaneously, by adding an “explain & rename” to the context menu. I understand you may not be interested in “double the speed”, as it’s not as impacting as it sounds - typically the response time is negligible (2-3 seconds). It might be a solution for a problem that doesn’t exist - is it worth saving a user a few seconds?
Do you have any thoughts about the two load-balancing options? If it was my project, I think I would lean towards simplicity and put it on the user to load balance. It avoids needing new logic in Gepetto and would be more flexible in the long run, if more features are added to Gepetto. It also allows the user to more easily adjust the LLM endpoint(s) on the fly
Do you have time to add this functionality yourself or would you insist on a PR? It should be a relatively small effort as I believe you already have an asynchronous implementation- it might just be a quick copy/paste or two
If you don’t think this is worthwhile, please feel free to close the issue with or without a comment
Thanks again!
The text was updated successfully, but these errors were encountered:
Hi! Thanks a lot for the detailed report, really appreciated! In theory, using the two GPUs in parallel could work, but the truth is that the two operations aren't really independent from one another. I've noticed empirically that the variable names are more expressive if requested after the LLM has described the function, because in the second call its own comments are passed back to it.
In this case, my feeling is that the quality of the answer is more important than the speed, and that in the vast majority of cases doing the work in parallel will not be what the user wants even if that's available. Any thoughts?
Thank you for this project, I just started working with it yesterday. Really nice
I have two rtx3090, each with 24GB. I use Ollama to run models (thanks for adding the local_ollama module!)
Two details for context on this issue:
When the model is smaller, I can’t help but feel that this is a waste of the idle GPU and/or a waste of the users’ time (albeit a small one)
This is not a fault in Gepetto, but I was considering how Gepetto could be adapted to utilize both, by adding the capability of using two independent instances of Ollama, one for each question
I took a quick look at the Gepetto code and I see that “explain” and “rename” are each a single question, so there’s no way to distribute either single operation across multiple Ollama instances
The best I could come up with to make better use of both GPUs is to request each operation (“explain” and “rename”) roughly concurrently, so each GPU is exercised
In theory this makes Gepetto twice as fast for those cases where the user wants to perform both operations
As I mentioned, to actually accomplish this, at least one change in Gepetto is required, which is ask to ask both questions concurrently, by adding an “explain and rename” context menu option that fires off both
The second half of the solution could be implemented in Gepetto, or it could be left up to the user to solve. To actually use both GPUs, assuming Gepetto is first modified to ask both questions at once, there are two approaches, I think:
A few questions:
If you don’t think this is worthwhile, please feel free to close the issue with or without a comment
Thanks again!
The text was updated successfully, but these errors were encountered: