Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow cold starts even with flashboot #111

Open
avacaondata opened this issue Sep 18, 2024 · 3 comments
Open

Very slow cold starts even with flashboot #111

avacaondata opened this issue Sep 18, 2024 · 3 comments

Comments

@avacaondata
Copy link

Hi, we're experiencing very slow cold starts even when activating flash boot, and this didn't happen before, with the same model architecture (Llama-3.1-8B) (different custom versions though). In fact, when it's about 1' since the last request, the worker appears initializing again, downloading weights etc. We've tried attaching a data storage to it, thinking that would lower cold start times (our hypothesis was that weights were downloaded there once and then just loaded from that disk at subsequent requests). However, that made things even worse, the delay times going up to 2' for a request.
We're about to launch an AI-based app, and have been using Runpod for development for some months now, but these delay times are not acceptable for the app to run properly. We need to scale down to 0 to keep costs variable at the beginning until we have more customers (then we will use active workers, for sure).
Can you please provide a solution for this? Is there some way to configure Runpod serverless endpoints so that delay times come back to where they were a week or two ago (1-2s)? @alpayariyak @Jorghi12 @pandyamarut @justinmerrell @carlson-svg @mikljohansson @casper-hansen @joennlae @willsamu @rachfop @vladmihaisima
Thanks in advance.

@pandyamarut
Copy link
Collaborator

pandyamarut commented Sep 18, 2024

@avacaondata Sorry for this inconvenience. I can understand the frustration. We changed the flow a bit, for making it easy to update the vllm version but there's a model caching feature we are rolling out soon, this should solve it.

Model is indeed downloaded once and it's loaded from disk then, but let me check this as well.
Alternatively, you can try with model baked in a docker image.

@avacaondata
Copy link
Author

Thanks for the quick reply and for your understanding @pandyamarut :)
Could you please guide us a little bit on best practices for deploying models effectively before that model caching feature release? We would really appreciate that.
If we pre-build the docker container instead of using the web-interface vLLM template you provide, would that help? Or what other things can we do to improve cold-start times?
Thanks again for checking this out, I hope this can be fixed soon so that we can still use Runpod for our production deployment, except for this, our experience with your cloud has been very nice so far.

@Yhlong00
Copy link

If you build the model as part of the Docker image, it’ll definitely help with the cold start. Loading the model from the local disk on the host server to GPU vRAM is faster than pulling it from a network volume. You can also extend the idle timeout a bit so the worker doesn’t go to sleep right after finishing a request, which helps avoid cold starts. And of course, keeping an active worker is the best way to prevent cold starts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants