Skip to content

feat: support runai streamer for vllm #423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

cr7258
Copy link
Contributor

@cr7258 cr7258 commented May 19, 2025

What this PR does / why we need it

Add a new config runai-streamer in the vLLM BackendRuntime to allow loading model using Run:ai Model Streamer to enhance model loading times. Currently, only vllm supports Run:ai Model Streamer.

[RunAI Streamer] Overall time to stream 942.3 MiB of all files: 0.18s, 5.0 GiB/s

kl qwen2-0--5b-0  
Defaulted container "model-runner" out of: model-runner, model-loader (init)
INFO 05-19 02:00:18 __init__.py:207] Automatically detected platform cuda.
INFO 05-19 02:00:18 api_server.py:912] vLLM API server version 0.7.3
INFO 05-19 02:00:18 api_server.py:913] args: Namespace(host='0.0.0.0', port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/workspace/models/models--Qwen--Qwen2-0.5B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='runai_streamer', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['qwen2-0--5b'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 05-19 02:00:18 api_server.py:209] Started engine process with PID 22
INFO 05-19 02:00:22 __init__.py:207] Automatically detected platform cuda.
INFO 05-19 02:00:24 config.py:549] This model supports multiple tasks: {'classify', 'embed', 'reward', 'score', 'generate'}. Defaulting to 'generate'.
INFO 05-19 02:00:28 config.py:549] This model supports multiple tasks: {'reward', 'classify', 'embed', 'score', 'generate'}. Defaulting to 'generate'.
INFO 05-19 02:00:28 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/workspace/models/models--Qwen--Qwen2-0.5B-Instruct', speculative_config=None, tokenizer='/workspace/models/models--Qwen--Qwen2-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.RUNAI_STREAMER, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=qwen2-0--5b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 05-19 02:00:29 cuda.py:229] Using Flash Attention backend.
INFO 05-19 02:00:30 model_runner.py:1110] Starting to load model /workspace/models/models--Qwen--Qwen2-0.5B-Instruct...
Loading safetensors using Runai Model Streamer:   0% Completed | 0/1 [00:00<?, ?it/s]
[RunAI Streamer] CPU Buffer size: 942.3 MiB for file: model.safetensors
Read throughput is 9.41 GB per second 
Loading safetensors using Runai Model Streamer: 100% Completed | 1/1 [00:00<00:00,  5.47it/s]
Loading safetensors using Runai Model Streamer: 100% Completed | 1/1 [00:00<00:00,  5.47it/s]

[RunAI Streamer] Overall time to stream 942.3 MiB of all files: 0.18s, 5.0 GiB/s
INFO 05-19 02:00:30 model_runner.py:1115] Loading model weights took 0.9277 GB
INFO 05-19 02:00:31 worker.py:267] Memory profiling takes 0.88 seconds
INFO 05-19 02:00:31 worker.py:267] the current vLLM instance can use total_gpu_memory (22.18GiB) x gpu_memory_utilization (0.90) = 19.97GiB
INFO 05-19 02:00:31 worker.py:267] model weights take 0.93GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.44GiB; the rest of the memory reserved for KV Cache is 17.54GiB.
INFO 05-19 02:00:31 executor_base.py:111] # cuda blocks: 95795, # CPU blocks: 21845
INFO 05-19 02:00:31 executor_base.py:116] Maximum concurrency for 32768 tokens per request: 46.77x
INFO 05-19 02:00:36 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:12<00:00,  2.78it/s]
INFO 05-19 02:00:49 model_runner.py:1562] Graph capturing finished in 13 secs, took 0.15 GiB
INFO 05-19 02:00:49 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 18.59 seconds
INFO 05-19 02:00:50 api_server.py:958] Starting vLLM API server on http://0.0.0.0:8080
INFO 05-19 02:00:50 launcher.py:23] Available routes are:
INFO 05-19 02:00:50 launcher.py:31] Route: /openapi.json, Methods: HEAD, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /docs, Methods: HEAD, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /redoc, Methods: HEAD, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /health, Methods: GET
INFO 05-19 02:00:50 launcher.py:31] Route: /ping, Methods: POST, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /tokenize, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /detokenize, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/models, Methods: GET
INFO 05-19 02:00:50 launcher.py:31] Route: /version, Methods: GET
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /pooling, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /score, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/score, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /rerank, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /invocations, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     240.243.170.78:46952 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:46958 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:35464 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:35468 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:35482 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:51796 - "GET /health HTTP/1.1" 200 OK

Which issue(s) this PR fixes

Fixes #352

Special notes for your reviewer

Does this PR introduce a user-facing change?

support runai streamer for vllm

@InftyAI-Agent InftyAI-Agent added needs-triage Indicates an issue or PR lacks a label and requires one. needs-priority Indicates a PR lacks a label and requires one. do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels May 19, 2025
@InftyAI-Agent InftyAI-Agent requested a review from kerthcet May 19, 2025 09:11
@cr7258
Copy link
Contributor Author

cr7258 commented May 19, 2025

/kind feature

@InftyAI-Agent InftyAI-Agent added feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels May 19, 2025
@kerthcet
Copy link
Member

What we hope to achieve here is generally two things:

  • can we bump this into our model loader component and make it inference-agnostic
  • can we support with other GPUs other than Nvidia

Both needs experiments, sorry I didn't explain clearly here. The original comment: #352 (comment)

The configuration is open for users actually, so we don't need to do anything I think.

@cr7258
Copy link
Contributor Author

cr7258 commented May 20, 2025

can we bump this into our model loader component and make it inference-agnostic

As I understand it, the model loader is responsible for downloading models from remote storage, such as Hugging Face or OSS, to the local disk. When the inference container starts, it uses the model that has already been downloaded locally.

Run:ai Model Streamer can speed up model loading by concurrently loading already-read tensors into the GPU while continuing to read other tensors from storage. This acceleration happens after the model has been downloaded locally, so I don't think we have anything to do in model loader for supporting Run:ai Model Streamer.

Additionally, Run:ai Model Streamer is not inference-agnostic — it requires integration with an inference engine, and currently only vLLM is supported. (Related PR)

@kerthcet
Copy link
Member

I thought a bit about this, I think you're right, we can do nothing here. The original idea here is try to explore whether we can load the models to the GPU and sent the GPU alloc address to the inference engine. However, seems no engine supports this and foreseeable future.

But one thing we should be care about here is we still load the models to the disk rather than to the cpu buffer -> GPU memory, so I suggest let's add annotation to the Playground | Inference Service, then in orchestration, once we detected that the Inference Service has the annotation, we'll not construct the initContainer, also will not render the ModelPath in the arguments, so the inference engine will handle all the loading logic.

Will you like to refactor the PR based on this? @cr7258

@@ -77,6 +77,26 @@ spec:
limits:
cpu: 8
memory: 16Gi
- name: runai-streamer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be part of the example but I wouldn't like to make it part of the default template.

@cr7258
Copy link
Contributor Author

cr7258 commented May 26, 2025

@kerthcet Ok, I'll refactor the PR this week.

@cr7258
Copy link
Contributor Author

cr7258 commented Jun 1, 2025

@kerthcet I have refactored the PR according to your suggestion. Please take a look, Thanks.

Now Run:ai Model Streamer supports two streaming approaches:

  • Streaming from S3 (S3 -> CPU buffer -> GPU): Use llmaz.io/skip-model-loader: "true" annotation (the annotation is added in the OpenModel) to skip model-loader initContainer, vLLM will load model directly from S3. example
  • Streaming from a file system (Hugging Face -> Local disk -> CPU buffer -> GPU): Firstly, the model-loader initContainer downloads the model to local disk space. Then, the Streamer concurrently reads tensor data from files into a dedicated CPU buffer and transfers the tensors to GPU memory. example

Here are the logs and the LLM pod for streaming from S3.

 kubectl logs deepseek-r1-distill-qwen-1-5b-0

INFO 06-01 17:42:57 __init__.py:207] Automatically detected platform cuda.
INFO 06-01 17:42:57 api_server.py:912] vLLM API server version 0.7.3
INFO 06-01 17:42:57 api_server.py:913] args: Namespace(host='0.0.0.0', port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='s3://cr7258/DeepSeek-R1-Distill-Qwen-1.5B', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='runai_streamer', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['deepseek-r1-distill-qwen-1-5b'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 06-01 17:42:57 api_server.py:209] Started engine process with PID 21
INFO 06-01 17:43:02 __init__.py:207] Automatically detected platform cuda.
INFO 06-01 17:43:05 config.py:549] This model supports multiple tasks: {'embed', 'reward', 'generate', 'classify', 'score'}. Defaulting to 'generate'.
WARNING 06-01 17:43:05 arg_utils.py:1187] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 06-01 17:43:05 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 06-01 17:43:09 config.py:549] This model supports multiple tasks: {'score', 'reward', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
WARNING 06-01 17:43:09 arg_utils.py:1187] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 06-01 17:43:09 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 06-01 17:43:09 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/tmp/tmp7oj9ysi2', speculative_config=None, tokenizer='/tmp/tmpkhc7j_h4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.RUNAI_STREAMER, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=deepseek-r1-distill-qwen-1-5b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 06-01 17:43:10 cuda.py:229] Using Flash Attention backend.
INFO 06-01 17:43:11 model_runner.py:1110] Starting to load model /tmp/tmp7oj9ysi2...
Loading safetensors using Runai Model Streamer:   0% Completed | 0/1 [00:00<?, ?it/s]
[RunAI Streamer] CPU Buffer size: 3.3 GiB for file: model.safetensors
Read throughput is 598.55 MB per second 
Loading safetensors using Runai Model Streamer: 100% Completed | 1/1 [00:06<00:00,  6.26s/it]
Loading safetensors using Runai Model Streamer: 100% Completed | 1/1 [00:06<00:00,  6.26s/it]

[RunAI Streamer] Overall time to stream 3.3 GiB of all files: 6.26s, 541.5 MiB/s
INFO 06-01 17:43:18 model_runner.py:1115] Loading model weights took 3.3460 GB
INFO 06-01 17:43:18 worker.py:267] Memory profiling takes 0.52 seconds
INFO 06-01 17:43:18 worker.py:267] the current vLLM instance can use total_gpu_memory (22.18GiB) x gpu_memory_utilization (0.90) = 19.97GiB
INFO 06-01 17:43:18 worker.py:267] model weights take 3.35GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.39GiB; the rest of the memory reserved for KV Cache is 15.17GiB.
INFO 06-01 17:43:19 executor_base.py:111] # cuda blocks: 35510, # CPU blocks: 9362
INFO 06-01 17:43:19 executor_base.py:116] Maximum concurrency for 131072 tokens per request: 4.33x
INFO 06-01 17:43:24 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:13<00:00,  2.55it/s]
INFO 06-01 17:43:38 model_runner.py:1562] Graph capturing finished in 14 secs, took 0.20 GiB
INFO 06-01 17:43:38 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 20.20 seconds
INFO 06-01 17:43:39 api_server.py:958] Starting vLLM API server on http://0.0.0.0:8080
INFO 06-01 17:43:39 launcher.py:23] Available routes are:
INFO 06-01 17:43:39 launcher.py:31] Route: /openapi.json, Methods: GET, HEAD
INFO 06-01 17:43:39 launcher.py:31] Route: /docs, Methods: GET, HEAD
INFO 06-01 17:43:39 launcher.py:31] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 06-01 17:43:39 launcher.py:31] Route: /redoc, Methods: GET, HEAD
INFO 06-01 17:43:39 launcher.py:31] Route: /health, Methods: GET
INFO 06-01 17:43:39 launcher.py:31] Route: /ping, Methods: GET, POST
INFO 06-01 17:43:39 launcher.py:31] Route: /tokenize, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /detokenize, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/models, Methods: GET
INFO 06-01 17:43:39 launcher.py:31] Route: /version, Methods: GET
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /pooling, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /score, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/score, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /rerank, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /invocations, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     240.243.170.78:57500 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:57502 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:45068 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:45082 - "GET /health HTTP/1.1" 200 OK
kubectl get pod deepseek-r1-distill-qwen-1-5b-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: 381531474d3e5bce66c8d41227870222869255d3c66fd4d99eb3656aed46b06f
    cni.projectcalico.org/podIP: 100.64.1.35/32
    cni.projectcalico.org/podIPs: 100.64.1.35/32
    leaderworkerset.sigs.k8s.io/size: "1"
  creationTimestamp: "2025-06-02T00:38:55Z"
  generateName: deepseek-r1-distill-qwen-1-5b-
  labels:
    apps.kubernetes.io/pod-index: "0"
    controller-revision-hash: deepseek-r1-distill-qwen-1-5b-854446fd48
    leaderworkerset.sigs.k8s.io/group-index: "0"
    leaderworkerset.sigs.k8s.io/group-key: fdd0812c01eb16406e88b2bc006cddb7081625d8
    leaderworkerset.sigs.k8s.io/name: deepseek-r1-distill-qwen-1-5b
    leaderworkerset.sigs.k8s.io/template-revision-hash: 57c5d68dc6
    leaderworkerset.sigs.k8s.io/worker-index: "0"
    llmaz.io/model-family-name: deepseek
    llmaz.io/model-name: deepseek-r1-distill-qwen-1-5b
    statefulset.kubernetes.io/pod-name: deepseek-r1-distill-qwen-1-5b-0
  name: deepseek-r1-distill-qwen-1-5b-0
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: deepseek-r1-distill-qwen-1-5b
    uid: 3e95a2d9-db74-42c6-8f17-db378e29dc99
  resourceVersion: "40962332"
  uid: 6f5be468-2de2-4ecd-9edd-67a3fdd00fce
spec:
  containers:
  - args:
    - --model
    - s3://cr7258/DeepSeek-R1-Distill-Qwen-1.5B
    - --served-model-name
    - deepseek-r1-distill-qwen-1-5b
    - --host
    - 0.0.0.0
    - --port
    - "8080"
    - --load-format
    - runai_streamer
    command:
    - python3
    - -m
    - vllm.entrypoints.openai.api_server
    env:
    - name: LWS_LEADER_ADDRESS
      value: deepseek-r1-distill-qwen-1-5b-0.deepseek-r1-distill-qwen-1-5b.default
    - name: LWS_GROUP_SIZE
      value: "1"
    - name: LWS_WORKER_INDEX
      value: "0"
    - name: RUNAI_STREAMER_S3_REQUEST_TIMEOUT_MS
      value: "10000"
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          key: AWS_ACCESS_KEY_ID
          name: aws-access-secret
          optional: true
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          key: AWS_SECRET_ACCESS_KEY
          name: aws-access-secret
          optional: true
    - name: KUBERNETES_SERVICE_HOST
      value: api.seven.perfx-k8s.internal.canary.k8s.ondemand.com
    image: vllm/vllm-openai:v0.7.3
    imagePullPolicy: IfNotPresent
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            while true; do
              RUNNING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_running' | grep -v '#' | awk '{print $2}')
              WAITING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_waiting' | grep -v '#' | awk '{print $2}')
              if [ "$RUNNING" = "0.0" ] && [ "$WAITING" = "0.0" ]; then
                echo "Terminating: No active or waiting requests, safe to terminate" >> /proc/1/fd/1
                exit 0
              else
                echo "Terminating: Running: $RUNNING, Waiting: $WAITING" >> /proc/1/fd/1
                sleep 5
              fi
            done
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 15
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: model-runner
    ports:
    - containerPort: 8080
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: "4"
        memory: 16Gi
        nvidia.com/gpu: "1"
      requests:
        cpu: "4"
        memory: 16Gi
        nvidia.com/gpu: "1"
    startupProbe:
      failureThreshold: 30
      httpGet:
        path: /health
        port: 8080
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /dev/shm
      name: dshm
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-p2ldw
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: deepseek-r1-distill-qwen-1-5b-0
  nodeName: ip-10-180-67-112.ec2.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  subdomain: deepseek-r1-distill-qwen-1-5b
  terminationGracePeriodSeconds: 130
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir:
      medium: Memory
      sizeLimit: 2Gi
    name: dshm
  - name: kube-api-access-p2ldw
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2025-06-02T00:42:47Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2025-06-02T00:38:55Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2025-06-02T00:43:45Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2025-06-02T00:43:45Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2025-06-02T00:38:55Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://59115f054d95b2c234cb440697e6ac28540806f8bc47949c637f1af8a6445f0d
    image: docker.io/vllm/vllm-openai:v0.7.3
    imageID: docker.io/vllm/vllm-openai@sha256:4f4037303e8c7b69439db1077bb849a0823517c0f785b894dc8e96d58ef3a0c2
    lastState: {}
    name: model-runner
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2025-06-02T00:42:46Z"
  hostIP: 10.180.67.112
  hostIPs:
  - ip: 10.180.67.112
  phase: Running
  podIP: 100.64.1.35
  podIPs:
  - ip: 100.64.1.35
  qosClass: Guaranteed
  startTime: "2025-06-02T00:38:55Z"

@cr7258 cr7258 requested a review from kerthcet June 1, 2025 15:39
Copy link
Member

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have an integration test?

@@ -30,6 +30,7 @@ var _ ModelSourceProvider = &URIProvider{}

const (
OSS = "OSS"
S3 = "S3"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we support GCS as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -111,6 +112,10 @@ func (w *OpenModelWebhook) generateValidate(obj runtime.Object) field.ErrorList
if _, _, _, err := util.ParseOSS(address); err != nil {
allErrs = append(allErrs, field.Invalid(sourcePath.Child("uri"), *model.Spec.Source.URI, "URI with wrong address"))
}
case modelSource.S3:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCS as well here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -164,3 +170,36 @@ func (p *ModelHubProvider) InjectModelLoader(template *corev1.PodTemplateSpec, i
func spreadEnvToInitContainer(containerEnv []corev1.EnvVar, initContainer *corev1.Container) {
initContainer.Env = append(initContainer.Env, containerEnv...)
}

func (p *ModelHubProvider) InjectModelEnvVars(template *corev1.PodTemplateSpec) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have already injected the HF token in above L115. Keep one is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InjectModelEnvVars function is used to inject model credentials into the model-runner container instead of the model-loader initContainer, in case the model-runner container handles the model loading itself.

kind: OpenModel
metadata:
name: deepseek-r1-distill-qwen-1-5b
annotations:
Copy link
Member

@kerthcet kerthcet Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we controller this in Playground and iSVC, I think it's more flexible there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A playground may associate with multiple OpenModel. For example, opt-350m is loaded by the model-runner container itself, while opt-125m is loaded by the model-loader initContainer. Therefore, I think we should place the annotation in OpenModel.

apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
  name: opt-350m
  annotations:
    llmaz.io/skip-model-loader: "true"
spec:
  familyName: opt
  source:
    modelHub:
      modelID: facebook/opt-350m
  inferenceConfig:
    flavors:
      - name: a10 # gpu type
        limits:
          nvidia.com/gpu: 1
---
apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
  name: opt-125m
spec:
  familyName: opt
  source:
    modelHub:
      modelID: facebook/opt-125m
---
apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
  name: vllm-speculator
spec:
  replicas: 1
  modelClaims:
    models:
    - name: opt-350m # the target model
      role: main
    - name: opt-125m  # the draft model
      role: draft

The final LLM pod looks like this:

kgp vllm-speculator-0 -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: 731abc79f3fd2ed667a0003e41100bba64fcaf459850c48de837899226798cb7
    cni.projectcalico.org/podIP: 100.64.8.29/32
    cni.projectcalico.org/podIPs: 100.64.8.29/32
    leaderworkerset.sigs.k8s.io/size: "1"
  creationTimestamp: "2025-06-02T07:40:16Z"
  generateName: vllm-speculator-
  labels:
    apps.kubernetes.io/pod-index: "0"
    controller-revision-hash: vllm-speculator-657b7dc4cc
    leaderworkerset.sigs.k8s.io/group-index: "0"
    leaderworkerset.sigs.k8s.io/group-key: fe5f4052a1971b9c5d3ea770f2809e11105693b8
    leaderworkerset.sigs.k8s.io/name: vllm-speculator
    leaderworkerset.sigs.k8s.io/template-revision-hash: 96599544f
    leaderworkerset.sigs.k8s.io/worker-index: "0"
    llmaz.io/model-family-name: opt
    llmaz.io/model-name: opt-350m
    statefulset.kubernetes.io/pod-name: vllm-speculator-0
  name: vllm-speculator-0
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: vllm-speculator
    uid: 930ce4cf-ead5-4ac8-ac1c-3bb9ab8f8866
  resourceVersion: "41268933"
  uid: 3eba84fd-e569-49b4-9982-2eb7187f2feb
spec:
  containers:
  - args:
    - --model
    - facebook/opt-350m
    - --served-model-name
    - opt-350m
    - --speculative_model
    - /workspace/models/models--facebook--opt-125m
    - --host
    - 0.0.0.0
    - --port
    - "8080"
    - --num_speculative_tokens
    - "5"
    - -tp
    - "1"
    command:
    - python3
    - -m
    - vllm.entrypoints.openai.api_server
    env:
    - name: LWS_LEADER_ADDRESS
      value: vllm-speculator-0.vllm-speculator.default
    - name: LWS_GROUP_SIZE
      value: "1"
    - name: LWS_WORKER_INDEX
      value: "0"
    - name: HUGGING_FACE_HUB_TOKEN
      valueFrom:
        secretKeyRef:
          key: HF_TOKEN
          name: modelhub-secret
          optional: true
    - name: HF_TOKEN
      valueFrom:
        secretKeyRef:
          key: HF_TOKEN
          name: modelhub-secret
          optional: true
    - name: KUBERNETES_SERVICE_HOST
      value: api.seven.perfx-k8s.internal.canary.k8s.ondemand.com
    image: vllm/vllm-openai:v0.7.3
    imagePullPolicy: IfNotPresent
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            while true; do
              RUNNING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_running' | grep -v '#' | awk '{print $2}')
              WAITING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_waiting' | grep -v '#' | awk '{print $2}')
              if [ "$RUNNING" = "0.0" ] && [ "$WAITING" = "0.0" ]; then
                echo "Terminating: No active or waiting requests, safe to terminate" >> /proc/1/fd/1
                exit 0
              else
                echo "Terminating: Running: $RUNNING, Waiting: $WAITING" >> /proc/1/fd/1
                sleep 5
              fi
            done
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 15
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: model-runner
    ports:
    - containerPort: 8080
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: "8"
        memory: 16Gi
        nvidia.com/gpu: "1"
      requests:
        cpu: "4"
        memory: 8Gi
        nvidia.com/gpu: "1"
    startupProbe:
      failureThreshold: 30
      httpGet:
        path: /health
        port: 8080
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /dev/shm
      name: dshm
    - mountPath: /workspace/models/
      name: model-volume
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-zbsfx
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: vllm-speculator-0
  initContainers:
  - env:
    - name: LWS_LEADER_ADDRESS
      value: vllm-speculator-0.vllm-speculator.default
    - name: LWS_GROUP_SIZE
      value: "1"
    - name: LWS_WORKER_INDEX
      value: "0"
    - name: MODEL_SOURCE_TYPE
      value: modelhub
    - name: MODEL_ID
      value: facebook/opt-125m
    - name: MODEL_HUB_NAME
      value: Huggingface
    - name: REVISION
      value: main
    - name: HUGGING_FACE_HUB_TOKEN
      valueFrom:
        secretKeyRef:
          key: HF_TOKEN
          name: modelhub-secret
          optional: true
    - name: HF_TOKEN
      valueFrom:
        secretKeyRef:
          key: HF_TOKEN
          name: modelhub-secret
          optional: true
    - name: KUBERNETES_SERVICE_HOST
      value: api.seven.perfx-k8s.internal.canary.k8s.ondemand.com
    image: inftyai/model-loader:v0.0.10
    imagePullPolicy: IfNotPresent
    name: model-loader-1
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /workspace/models/
      name: model-volume
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-zbsfx
      readOnly: true
  nodeName: ip-10-180-71-146.ec2.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  subdomain: vllm-speculator
  terminationGracePeriodSeconds: 130
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir:
      medium: Memory
      sizeLimit: 2Gi
    name: dshm
  - emptyDir: {}
    name: model-volume
  - name: kube-api-access-zbsfx
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2025-06-02T07:40:17Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2025-06-02T07:40:16Z"
    message: 'containers with incomplete status: [model-loader-1]'
    reason: ContainersNotInitialized
    status: "False"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2025-06-02T07:40:16Z"
    message: 'containers with unready status: [model-runner]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2025-06-02T07:40:16Z"
    message: 'containers with unready status: [model-runner]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2025-06-02T07:40:16Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: vllm/vllm-openai:v0.7.3
    imageID: ""
    lastState: {}
    name: model-runner
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        reason: PodInitializing
  hostIP: 10.180.71.146
  hostIPs:
  - ip: 10.180.71.146
  initContainerStatuses:
  - containerID: containerd://90b6cc1592273be115f75c8d813812884caa78f7dcc28b22205852747eea701c
    image: docker.io/inftyai/model-loader:v0.0.10
    imageID: docker.io/inftyai/model-loader@sha256:b67a8bb3acbc496a62801b2110056b9774e52ddc029b379c7370113c7879c7d9
    lastState: {}
    name: model-loader-1
    ready: false
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2025-06-02T07:40:17Z"
  phase: Pending
  podIP: 100.64.8.29
  podIPs:
  - ip: 100.64.8.29
  qosClass: Burstable
  startTime: "2025-06-02T07:40:16Z"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should leave the knob at the playground level, for example, in the future when we have a cache layer, people can still choose to load the models from s3 directly or from the cache, especially in the performance comparison tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we can follow this later ofcourse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't think this is a typical example, load different models with different approaches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll refactor it tomorrow.

@cr7258 cr7258 requested a review from kerthcet June 2, 2025 08:20
modelInfo["DraftModelPath"] = modelSource.NewModelSourceProvider(p.models[1]).ModelPath()
skipModelLoader = false
draftModel := p.models[1]
if annotations := draftModel.GetAnnotations(); annotations != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total logic would be more simple if we extract the annotation form isvc.

// Return once not the main model, because all the below has already been injected.
if index != 0 {
return
func spreadEnvToInitContainer(containerEnv []corev1.EnvVar, initContainer *corev1.Container) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again would like to see same behaviors across all the models.

@cr7258
Copy link
Contributor Author

cr7258 commented Jun 4, 2025

@kerthcet I have moved the annotation to the Playground. Please review it, thanks.
For the failure of e2e tests, we don't have GPU resources so we can't test on vLLM backendRuntime?

@cr7258 cr7258 requested a review from kerthcet June 4, 2025 05:24
Copy link
Member

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no GPU nodes for tests now, can we just comment out the asserts about service ready? We can uncomment them in the future..

@@ -168,12 +168,33 @@ func buildWorkloadApplyConfiguration(service *inferenceapi.Service, models []*co
func injectModelProperties(template *applyconfigurationv1.LeaderWorkerTemplateApplyConfiguration, models []*coreapi.OpenModel, service *inferenceapi.Service) {
isMultiNodesInference := template.LeaderTemplate != nil

// Skip model-loader initContainer if llmaz.io/skip-model-loader annotation is set.
skipModelLoader := false
if annotations := service.GetAnnotations(); annotations != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it a helper func so we can reuse it:

func SkipModelLoader(obj metav1.Object) bool {
	if annotations := obj.GetAnnotations(); annotations != nil {
            return annotations[inferenceapi.SkipModelLoaderAnnoKey] == "true"
        }
        return false
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

})

ginkgo.It("Deploy S3 model with llmaz.io/skip-model-loader annotation", func() {
model := wrapper.MakeModel("deepseek-r1-distill-qwen-1-5b").FamilyName("deepseek").ModelSourceWithURI("s3://test-bucket/DeepSeek-R1-Distill-Qwen-1.5B").Obj()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a valid address? s3://test-bucket/DeepSeek-R1-Distill-Qwen-1.5B, if not, I guess it will never succeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, can we have a community-dedicated S3 bucket for testing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now no. I'll try to make one this weekend.

@@ -244,6 +244,134 @@ func ValidateServicePods(ctx context.Context, k8sClient client.Client, service *
}).Should(gomega.Succeed())
}

// ValidateSkipModelLoaderService validates the Playground resource with llmaz.io/skip-model-loader annotation
func ValidateSkipModelLoaderService(ctx context.Context, k8sClient client.Client, service *inferenceapi.Service) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just remove the ValidateSkipModelLoaderService but make ValidateSkipModelLoader part of the ValidateService? I just saw a lot of similar asserts here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@kerthcet
Copy link
Member

kerthcet commented Jun 5, 2025

vllm-cpu is another way, but we need to build an image ourself.

@cr7258
Copy link
Contributor Author

cr7258 commented Jun 5, 2025

@kerthcet
Copy link
Member

kerthcet commented Jun 6, 2025

Kindly reminding, we'll release one version this weekend to catch the kubecon HK. Hope to have this feature.

@cr7258
Copy link
Contributor Author

cr7258 commented Jun 8, 2025

The vllm-cpu can't serve the model successfully, it continuously terminates with 132 error code. So, I commented out some e2e asserts and will revisit them after we have GPU resources for e2e tests.

# logs
opt-125m-0 model-runner [W605 13:51:49.898291639 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
opt-125m-0 model-runner   Overriding a previously registered kernel for the same operator and the same dispatch key
opt-125m-0 model-runner   operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
opt-125m-0 model-runner     registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
opt-125m-0 model-runner   dispatch key: AutocastCPU
opt-125m-0 model-runner   previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327
opt-125m-0 model-runner        new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator())

# container status
  containerStatuses:
  - containerID: containerd://bfa9555eca8808aabd01e430ebdfb3edc7d1a1ecf0fac6eb1daf4ba897cbe1bc
    image: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.9.0
    imageID: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo@sha256:8a7db9cd4fd8550f3737a24e64eab7486c55a45975f73305fa6bbd7b93819de4
    lastState:
      terminated:
        containerID: containerd://526ca024155bdc471e181e3952ecd131f23b406edc44a6a1226a4148a1b32db8
        exitCode: 132
        finishedAt: "2025-06-05T13:53:56Z"
        reason: Error
        startedAt: "2025-06-05T13:53:42Z"

@cr7258
Copy link
Contributor Author

cr7258 commented Jun 8, 2025

@kerthcet All tests have passed now, please review the PR again. Thanks.

@cr7258 cr7258 requested a review from kerthcet June 8, 2025 15:14
@kerthcet
Copy link
Member

kerthcet commented Jun 8, 2025

I'll take a look tomorrow morning, actually today's morning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a label and requires one. needs-triage Indicates an issue or PR lacks a label and requires one.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support runai model streamer for fast model loading
3 participants