Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context menu “explain & rename” to take advantage of multiple GPU #49

Open
mzpqnxow opened this issue Nov 24, 2024 · 1 comment
Open

Comments

@mzpqnxow
Copy link

Thank you for this project, I just started working with it yesterday. Really nice

I have two rtx3090, each with 24GB. I use Ollama to run models (thanks for adding the local_ollama module!)

Two details for context on this issue:

  • When the model used is larger than 24GB, it loads across both GPUs, as expected. Both are used concurrently to answer a question
  • When the model is smaller than 24GB (fitting on one GPU) it runs most efficiently on a single GPU and Ollama knows to only use one. So only one GPU is utilized

When the model is smaller, I can’t help but feel that this is a waste of the idle GPU and/or a waste of the users’ time (albeit a small one)

This is not a fault in Gepetto, but I was considering how Gepetto could be adapted to utilize both, by adding the capability of using two independent instances of Ollama, one for each question

I took a quick look at the Gepetto code and I see that “explain” and “rename” are each a single question, so there’s no way to distribute either single operation across multiple Ollama instances

The best I could come up with to make better use of both GPUs is to request each operation (“explain” and “rename”) roughly concurrently, so each GPU is exercised

In theory this makes Gepetto twice as fast for those cases where the user wants to perform both operations

As I mentioned, to actually accomplish this, at least one change in Gepetto is required, which is ask to ask both questions concurrently, by adding an “explain and rename” context menu option that fires off both

The second half of the solution could be implemented in Gepetto, or it could be left up to the user to solve. To actually use both GPUs, assuming Gepetto is first modified to ask both questions at once, there are two approaches, I think:

  1. Add logic to Gepetto to accept two or more API endpoints and send each question to a different endpoint, with each endpoint having a separate instance of Ollama (e.g. one on port 11434 and one on port 11435)
  2. Leave Gepetto logic as it is, put the burden on the user to set up a simple round-robin proxy (either a layer 3/4 LB like haproxy, or an HTTP LB like nginx+mod_proxy) to automatically distribute the questions

A few questions:

  • Are you interested in this? The change required would be to ask both questions (roughly) simultaneously, by adding an “explain & rename” to the context menu. I understand you may not be interested in “double the speed”, as it’s not as impacting as it sounds - typically the response time is negligible (2-3 seconds). It might be a solution for a problem that doesn’t exist - is it worth saving a user a few seconds?
  • Do you have any thoughts about the two load-balancing options? If it was my project, I think I would lean towards simplicity and put it on the user to load balance. It avoids needing new logic in Gepetto and would be more flexible in the long run, if more features are added to Gepetto. It also allows the user to more easily adjust the LLM endpoint(s) on the fly
  • Do you have time to add this functionality yourself or would you insist on a PR? It should be a relatively small effort as I believe you already have an asynchronous implementation- it might just be a quick copy/paste or two

If you don’t think this is worthwhile, please feel free to close the issue with or without a comment

Thanks again!

@JusticeRage
Copy link
Owner

Hi! Thanks a lot for the detailed report, really appreciated! In theory, using the two GPUs in parallel could work, but the truth is that the two operations aren't really independent from one another. I've noticed empirically that the variable names are more expressive if requested after the LLM has described the function, because in the second call its own comments are passed back to it.
In this case, my feeling is that the quality of the answer is more important than the speed, and that in the vast majority of cases doing the work in parallel will not be what the user wants even if that's available. Any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants