Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploying HuggingFace model/pipeline using uvicorn-gunicorn-fastapi-docker on Google Cloud Run #238

Open
GiorgioBarnabo opened this issue Feb 22, 2023 · 2 comments

Comments

@GiorgioBarnabo
Copy link

Hi everybody,

I am pretty new to web app development and have doubts about how to make the best out of this incredible docker image.
In short, I have been trying to deploy an huggingface pipeline on Google Cloud Run using the uvicorn-gunicorn-fastapi-docker image. The model takes about 3.5GB, while the base cloud-run instance can have up to 16 vCPUs and 32GB of RAM. At deployment time, I also need to manually specify the maximum number of concurrent requests before autoscaling happens.

How should I set up the number of workers/threads for gunicorn/uvicorn, and the characteristics of the base cloud run instance? I noticed that, for every additional worker and/or thread, 3.5GB of RAM are needed. Also, during execution, memory leakage occurs, which would require a worker to be restarted every now and then.

My naif guess is that I should have as many workers as the number of vCPU and a RAM of at least 3.5GB times the number of workers. Is that correct? What about the number of concurrent requests?

Right now, my uvicorn command in the dockerfile looks like this:

CMD uvicorn main:app --host 0.0.0.0 --port 8080 --workers 4 --access-log --use-colors

Nonetheless, with this setting, after a while the RAM gets saturated that the service breaks down :(

Any help is more then welcome.

Thank you in advance. Best

@ahron1
Copy link

ahron1 commented Jul 14, 2023

If you use def for the fastapi function, it creates a new thread (from a threadpool) for each incoming request. The model has a single copy in GPU.

If you use async def with N workers, it creates a total of N forks. Each request is handled by one of these N forks. For each of the N workers, there is a copy of the model in the GPU. Workers don't share memory or other resources.

To decide the number of workers: N = number of threads + 1.
You also need more than enough GPU to fit N copies of the model.

So if you are GPU limited, that's your criteria to decide the number of workers.

What I wrote above is based on what I observed in a few tests. It might well be incorrect.

@ahron1
Copy link

ahron1 commented Jul 14, 2023

I would also recommend using Gunicorn instead of Uvicorn to run the app

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants