-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow cold starts even with flashboot #111
Comments
@avacaondata Sorry for this inconvenience. I can understand the frustration. We changed the flow a bit, for making it easy to update the vllm version but there's a model caching feature we are rolling out soon, this should solve it. Model is indeed downloaded once and it's loaded from disk then, but let me check this as well. |
Thanks for the quick reply and for your understanding @pandyamarut :) |
If you build the model as part of the Docker image, it’ll definitely help with the cold start. Loading the model from the local disk on the host server to GPU vRAM is faster than pulling it from a network volume. You can also extend the idle timeout a bit so the worker doesn’t go to sleep right after finishing a request, which helps avoid cold starts. And of course, keeping an active worker is the best way to prevent cold starts. |
Hi, we're experiencing very slow cold starts even when activating flash boot, and this didn't happen before, with the same model architecture (Llama-3.1-8B) (different custom versions though). In fact, when it's about 1' since the last request, the worker appears initializing again, downloading weights etc. We've tried attaching a data storage to it, thinking that would lower cold start times (our hypothesis was that weights were downloaded there once and then just loaded from that disk at subsequent requests). However, that made things even worse, the delay times going up to 2' for a request.
We're about to launch an AI-based app, and have been using Runpod for development for some months now, but these delay times are not acceptable for the app to run properly. We need to scale down to 0 to keep costs variable at the beginning until we have more customers (then we will use active workers, for sure).
Can you please provide a solution for this? Is there some way to configure Runpod serverless endpoints so that delay times come back to where they were a week or two ago (1-2s)? @alpayariyak @Jorghi12 @pandyamarut @justinmerrell @carlson-svg @mikljohansson @casper-hansen @joennlae @willsamu @rachfop @vladmihaisima
Thanks in advance.
The text was updated successfully, but these errors were encountered: