Very slow cold starts even with flashboot #111

avacaondata · 2024-09-18T21:42:10Z

Hi, we're experiencing very slow cold starts even when activating flash boot, and this didn't happen before, with the same model architecture (Llama-3.1-8B) (different custom versions though). In fact, when it's about 1' since the last request, the worker appears initializing again, downloading weights etc. We've tried attaching a data storage to it, thinking that would lower cold start times (our hypothesis was that weights were downloaded there once and then just loaded from that disk at subsequent requests). However, that made things even worse, the delay times going up to 2' for a request.
We're about to launch an AI-based app, and have been using Runpod for development for some months now, but these delay times are not acceptable for the app to run properly. We need to scale down to 0 to keep costs variable at the beginning until we have more customers (then we will use active workers, for sure).
Can you please provide a solution for this? Is there some way to configure Runpod serverless endpoints so that delay times come back to where they were a week or two ago (1-2s)? @alpayariyak @Jorghi12 @pandyamarut @justinmerrell @carlson-svg @mikljohansson @casper-hansen @joennlae @willsamu @rachfop @vladmihaisima
Thanks in advance.

pandyamarut · 2024-09-18T22:12:53Z

@avacaondata Sorry for this inconvenience. I can understand the frustration. We changed the flow a bit, for making it easy to update the vllm version but there's a model caching feature we are rolling out soon, this should solve it.

Model is indeed downloaded once and it's loaded from disk then, but let me check this as well.
Alternatively, you can try with model baked in a docker image.

avacaondata · 2024-09-19T00:49:27Z

Thanks for the quick reply and for your understanding @pandyamarut :)
Could you please guide us a little bit on best practices for deploying models effectively before that model caching feature release? We would really appreciate that.
If we pre-build the docker container instead of using the web-interface vLLM template you provide, would that help? Or what other things can we do to improve cold-start times?
Thanks again for checking this out, I hope this can be fixed soon so that we can still use Runpod for our production deployment, except for this, our experience with your cloud has been very nice so far.

Yhlong00 · 2024-09-19T01:26:31Z

If you build the model as part of the Docker image, it’ll definitely help with the cold start. Loading the model from the local disk on the host server to GPU vRAM is faster than pulling it from a network volume. You can also extend the idle timeout a bit so the worker doesn’t go to sleep right after finishing a request, which helps avoid cold starts. And of course, keeping an active worker is the best way to prevent cold starts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow cold starts even with flashboot #111

Very slow cold starts even with flashboot #111

avacaondata commented Sep 18, 2024

pandyamarut commented Sep 18, 2024 •

edited

Loading

avacaondata commented Sep 19, 2024

Yhlong00 commented Sep 19, 2024

Very slow cold starts even with flashboot #111

Very slow cold starts even with flashboot #111

Comments

avacaondata commented Sep 18, 2024

pandyamarut commented Sep 18, 2024 • edited Loading

avacaondata commented Sep 19, 2024

Yhlong00 commented Sep 19, 2024

pandyamarut commented Sep 18, 2024 •

edited

Loading