feat: support runai streamer for vllm #423

cr7258 · 2025-05-19T09:10:53Z

What this PR does / why we need it

Add a new config runai-streamer in the vLLM BackendRuntime to allow loading model using Run:ai Model Streamer to enhance model loading times. Currently, only vllm supports Run:ai Model Streamer.

[RunAI Streamer] Overall time to stream 942.3 MiB of all files: 0.18s, 5.0 GiB/s

kl qwen2-0--5b-0  
Defaulted container "model-runner" out of: model-runner, model-loader (init)
INFO 05-19 02:00:18 __init__.py:207] Automatically detected platform cuda.
INFO 05-19 02:00:18 api_server.py:912] vLLM API server version 0.7.3
INFO 05-19 02:00:18 api_server.py:913] args: Namespace(host='0.0.0.0', port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/workspace/models/models--Qwen--Qwen2-0.5B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='runai_streamer', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['qwen2-0--5b'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 05-19 02:00:18 api_server.py:209] Started engine process with PID 22
INFO 05-19 02:00:22 __init__.py:207] Automatically detected platform cuda.
INFO 05-19 02:00:24 config.py:549] This model supports multiple tasks: {'classify', 'embed', 'reward', 'score', 'generate'}. Defaulting to 'generate'.
INFO 05-19 02:00:28 config.py:549] This model supports multiple tasks: {'reward', 'classify', 'embed', 'score', 'generate'}. Defaulting to 'generate'.
INFO 05-19 02:00:28 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/workspace/models/models--Qwen--Qwen2-0.5B-Instruct', speculative_config=None, tokenizer='/workspace/models/models--Qwen--Qwen2-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.RUNAI_STREAMER, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=qwen2-0--5b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 05-19 02:00:29 cuda.py:229] Using Flash Attention backend.
INFO 05-19 02:00:30 model_runner.py:1110] Starting to load model /workspace/models/models--Qwen--Qwen2-0.5B-Instruct...
Loading safetensors using Runai Model Streamer:   0% Completed | 0/1 [00:00<?, ?it/s]
[RunAI Streamer] CPU Buffer size: 942.3 MiB for file: model.safetensors
Read throughput is 9.41 GB per second 
Loading safetensors using Runai Model Streamer: 100% Completed | 1/1 [00:00<00:00,  5.47it/s]
Loading safetensors using Runai Model Streamer: 100% Completed | 1/1 [00:00<00:00,  5.47it/s]

[RunAI Streamer] Overall time to stream 942.3 MiB of all files: 0.18s, 5.0 GiB/s
INFO 05-19 02:00:30 model_runner.py:1115] Loading model weights took 0.9277 GB
INFO 05-19 02:00:31 worker.py:267] Memory profiling takes 0.88 seconds
INFO 05-19 02:00:31 worker.py:267] the current vLLM instance can use total_gpu_memory (22.18GiB) x gpu_memory_utilization (0.90) = 19.97GiB
INFO 05-19 02:00:31 worker.py:267] model weights take 0.93GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.44GiB; the rest of the memory reserved for KV Cache is 17.54GiB.
INFO 05-19 02:00:31 executor_base.py:111] # cuda blocks: 95795, # CPU blocks: 21845
INFO 05-19 02:00:31 executor_base.py:116] Maximum concurrency for 32768 tokens per request: 46.77x
INFO 05-19 02:00:36 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:12<00:00,  2.78it/s]
INFO 05-19 02:00:49 model_runner.py:1562] Graph capturing finished in 13 secs, took 0.15 GiB
INFO 05-19 02:00:49 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 18.59 seconds
INFO 05-19 02:00:50 api_server.py:958] Starting vLLM API server on http://0.0.0.0:8080
INFO 05-19 02:00:50 launcher.py:23] Available routes are:
INFO 05-19 02:00:50 launcher.py:31] Route: /openapi.json, Methods: HEAD, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /docs, Methods: HEAD, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /redoc, Methods: HEAD, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /health, Methods: GET
INFO 05-19 02:00:50 launcher.py:31] Route: /ping, Methods: POST, GET
INFO 05-19 02:00:50 launcher.py:31] Route: /tokenize, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /detokenize, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/models, Methods: GET
INFO 05-19 02:00:50 launcher.py:31] Route: /version, Methods: GET
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /pooling, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /score, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/score, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /rerank, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 05-19 02:00:50 launcher.py:31] Route: /invocations, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     240.243.170.78:46952 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:46958 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:35464 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:35468 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:35482 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:51796 - "GET /health HTTP/1.1" 200 OK

Which issue(s) this PR fixes

Fixes #352

Special notes for your reviewer

Does this PR introduce a user-facing change?

support runai streamer for vllm

cr7258 · 2025-05-19T09:11:53Z

/kind feature

kerthcet · 2025-05-20T02:37:46Z

What we hope to achieve here is generally two things:

can we bump this into our model loader component and make it inference-agnostic
can we support with other GPUs other than Nvidia

Both needs experiments, sorry I didn't explain clearly here. The original comment: #352 (comment)

The configuration is open for users actually, so we don't need to do anything I think.

cr7258 · 2025-05-20T04:30:17Z

can we bump this into our model loader component and make it inference-agnostic

As I understand it, the model loader is responsible for downloading models from remote storage, such as Hugging Face or OSS, to the local disk. When the inference container starts, it uses the model that has already been downloaded locally.

Run:ai Model Streamer can speed up model loading by concurrently loading already-read tensors into the GPU while continuing to read other tensors from storage. This acceleration happens after the model has been downloaded locally, so I don't think we have anything to do in model loader for supporting Run:ai Model Streamer.

Additionally, Run:ai Model Streamer is not inference-agnostic — it requires integration with an inference engine, and currently only vLLM is supported. (Related PR)

kerthcet · 2025-05-25T01:05:42Z

I thought a bit about this, I think you're right, we can do nothing here. The original idea here is try to explore whether we can load the models to the GPU and sent the GPU alloc address to the inference engine. However, seems no engine supports this and foreseeable future.

But one thing we should be care about here is we still load the models to the disk rather than to the cpu buffer -> GPU memory, so I suggest let's add annotation to the Playground | Inference Service, then in orchestration, once we detected that the Inference Service has the annotation, we'll not construct the initContainer, also will not render the ModelPath in the arguments, so the inference engine will handle all the loading logic.

Will you like to refactor the PR based on this? @cr7258

kerthcet · 2025-05-25T01:11:33Z

chart/templates/backends/vllm.yaml

@@ -77,6 +77,26 @@ spec:
        limits:
          cpu: 8
          memory: 16Gi
+    - name: runai-streamer


It can be part of the example but I wouldn't like to make it part of the default template.

cr7258 · 2025-05-26T04:02:44Z

@kerthcet Ok, I'll refactor the PR this week.

cr7258 · 2025-06-01T15:38:27Z

@kerthcet I have refactored the PR according to your suggestion. Please take a look, Thanks.

Now Run:ai Model Streamer supports two streaming approaches:

Streaming from S3 (S3 -> CPU buffer -> GPU): Use llmaz.io/skip-model-loader: "true" annotation (the annotation is added in the OpenModel) to skip model-loader initContainer, vLLM will load model directly from S3. example
Streaming from a file system (Hugging Face -> Local disk -> CPU buffer -> GPU): Firstly, the model-loader initContainer downloads the model to local disk space. Then, the Streamer concurrently reads tensor data from files into a dedicated CPU buffer and transfers the tensors to GPU memory. example

Here are the logs and the LLM pod for streaming from S3.

 kubectl logs deepseek-r1-distill-qwen-1-5b-0

INFO 06-01 17:42:57 __init__.py:207] Automatically detected platform cuda.
INFO 06-01 17:42:57 api_server.py:912] vLLM API server version 0.7.3
INFO 06-01 17:42:57 api_server.py:913] args: Namespace(host='0.0.0.0', port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='s3://cr7258/DeepSeek-R1-Distill-Qwen-1.5B', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='runai_streamer', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['deepseek-r1-distill-qwen-1-5b'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 06-01 17:42:57 api_server.py:209] Started engine process with PID 21
INFO 06-01 17:43:02 __init__.py:207] Automatically detected platform cuda.
INFO 06-01 17:43:05 config.py:549] This model supports multiple tasks: {'embed', 'reward', 'generate', 'classify', 'score'}. Defaulting to 'generate'.
WARNING 06-01 17:43:05 arg_utils.py:1187] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 06-01 17:43:05 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 06-01 17:43:09 config.py:549] This model supports multiple tasks: {'score', 'reward', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
WARNING 06-01 17:43:09 arg_utils.py:1187] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 06-01 17:43:09 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 06-01 17:43:09 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/tmp/tmp7oj9ysi2', speculative_config=None, tokenizer='/tmp/tmpkhc7j_h4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.RUNAI_STREAMER, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=deepseek-r1-distill-qwen-1-5b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 06-01 17:43:10 cuda.py:229] Using Flash Attention backend.
INFO 06-01 17:43:11 model_runner.py:1110] Starting to load model /tmp/tmp7oj9ysi2...
Loading safetensors using Runai Model Streamer:   0% Completed | 0/1 [00:00<?, ?it/s]
[RunAI Streamer] CPU Buffer size: 3.3 GiB for file: model.safetensors
Read throughput is 598.55 MB per second 
Loading safetensors using Runai Model Streamer: 100% Completed | 1/1 [00:06<00:00,  6.26s/it]
Loading safetensors using Runai Model Streamer: 100% Completed | 1/1 [00:06<00:00,  6.26s/it]

[RunAI Streamer] Overall time to stream 3.3 GiB of all files: 6.26s, 541.5 MiB/s
INFO 06-01 17:43:18 model_runner.py:1115] Loading model weights took 3.3460 GB
INFO 06-01 17:43:18 worker.py:267] Memory profiling takes 0.52 seconds
INFO 06-01 17:43:18 worker.py:267] the current vLLM instance can use total_gpu_memory (22.18GiB) x gpu_memory_utilization (0.90) = 19.97GiB
INFO 06-01 17:43:18 worker.py:267] model weights take 3.35GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.39GiB; the rest of the memory reserved for KV Cache is 15.17GiB.
INFO 06-01 17:43:19 executor_base.py:111] # cuda blocks: 35510, # CPU blocks: 9362
INFO 06-01 17:43:19 executor_base.py:116] Maximum concurrency for 131072 tokens per request: 4.33x
INFO 06-01 17:43:24 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:13<00:00,  2.55it/s]
INFO 06-01 17:43:38 model_runner.py:1562] Graph capturing finished in 14 secs, took 0.20 GiB
INFO 06-01 17:43:38 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 20.20 seconds
INFO 06-01 17:43:39 api_server.py:958] Starting vLLM API server on http://0.0.0.0:8080
INFO 06-01 17:43:39 launcher.py:23] Available routes are:
INFO 06-01 17:43:39 launcher.py:31] Route: /openapi.json, Methods: GET, HEAD
INFO 06-01 17:43:39 launcher.py:31] Route: /docs, Methods: GET, HEAD
INFO 06-01 17:43:39 launcher.py:31] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 06-01 17:43:39 launcher.py:31] Route: /redoc, Methods: GET, HEAD
INFO 06-01 17:43:39 launcher.py:31] Route: /health, Methods: GET
INFO 06-01 17:43:39 launcher.py:31] Route: /ping, Methods: GET, POST
INFO 06-01 17:43:39 launcher.py:31] Route: /tokenize, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /detokenize, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/models, Methods: GET
INFO 06-01 17:43:39 launcher.py:31] Route: /version, Methods: GET
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /pooling, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /score, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/score, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /rerank, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /invocations, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     240.243.170.78:57500 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:57502 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:45068 - "GET /health HTTP/1.1" 200 OK
INFO:     240.243.170.78:45082 - "GET /health HTTP/1.1" 200 OK

kubectl get pod deepseek-r1-distill-qwen-1-5b-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: 381531474d3e5bce66c8d41227870222869255d3c66fd4d99eb3656aed46b06f
    cni.projectcalico.org/podIP: 100.64.1.35/32
    cni.projectcalico.org/podIPs: 100.64.1.35/32
    leaderworkerset.sigs.k8s.io/size: "1"
  creationTimestamp: "2025-06-02T00:38:55Z"
  generateName: deepseek-r1-distill-qwen-1-5b-
  labels:
    apps.kubernetes.io/pod-index: "0"
    controller-revision-hash: deepseek-r1-distill-qwen-1-5b-854446fd48
    leaderworkerset.sigs.k8s.io/group-index: "0"
    leaderworkerset.sigs.k8s.io/group-key: fdd0812c01eb16406e88b2bc006cddb7081625d8
    leaderworkerset.sigs.k8s.io/name: deepseek-r1-distill-qwen-1-5b
    leaderworkerset.sigs.k8s.io/template-revision-hash: 57c5d68dc6
    leaderworkerset.sigs.k8s.io/worker-index: "0"
    llmaz.io/model-family-name: deepseek
    llmaz.io/model-name: deepseek-r1-distill-qwen-1-5b
    statefulset.kubernetes.io/pod-name: deepseek-r1-distill-qwen-1-5b-0
  name: deepseek-r1-distill-qwen-1-5b-0
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: deepseek-r1-distill-qwen-1-5b
    uid: 3e95a2d9-db74-42c6-8f17-db378e29dc99
  resourceVersion: "40962332"
  uid: 6f5be468-2de2-4ecd-9edd-67a3fdd00fce
spec:
  containers:
  - args:
    - --model
    - s3://cr7258/DeepSeek-R1-Distill-Qwen-1.5B
    - --served-model-name
    - deepseek-r1-distill-qwen-1-5b
    - --host
    - 0.0.0.0
    - --port
    - "8080"
    - --load-format
    - runai_streamer
    command:
    - python3
    - -m
    - vllm.entrypoints.openai.api_server
    env:
    - name: LWS_LEADER_ADDRESS
      value: deepseek-r1-distill-qwen-1-5b-0.deepseek-r1-distill-qwen-1-5b.default
    - name: LWS_GROUP_SIZE
      value: "1"
    - name: LWS_WORKER_INDEX
      value: "0"
    - name: RUNAI_STREAMER_S3_REQUEST_TIMEOUT_MS
      value: "10000"
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          key: AWS_ACCESS_KEY_ID
          name: aws-access-secret
          optional: true
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          key: AWS_SECRET_ACCESS_KEY
          name: aws-access-secret
          optional: true
    - name: KUBERNETES_SERVICE_HOST
      value: api.seven.perfx-k8s.internal.canary.k8s.ondemand.com
    image: vllm/vllm-openai:v0.7.3
    imagePullPolicy: IfNotPresent
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            while true; do
              RUNNING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_running' | grep -v '#' | awk '{print $2}')
              WAITING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_waiting' | grep -v '#' | awk '{print $2}')
              if [ "$RUNNING" = "0.0" ] && [ "$WAITING" = "0.0" ]; then
                echo "Terminating: No active or waiting requests, safe to terminate" >> /proc/1/fd/1
                exit 0
              else
                echo "Terminating: Running: $RUNNING, Waiting: $WAITING" >> /proc/1/fd/1
                sleep 5
              fi
            done
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 15
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: model-runner
    ports:
    - containerPort: 8080
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: "4"
        memory: 16Gi
        nvidia.com/gpu: "1"
      requests:
        cpu: "4"
        memory: 16Gi
        nvidia.com/gpu: "1"
    startupProbe:
      failureThreshold: 30
      httpGet:
        path: /health
        port: 8080
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /dev/shm
      name: dshm
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-p2ldw
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: deepseek-r1-distill-qwen-1-5b-0
  nodeName: ip-10-180-67-112.ec2.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  subdomain: deepseek-r1-distill-qwen-1-5b
  terminationGracePeriodSeconds: 130
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir:
      medium: Memory
      sizeLimit: 2Gi
    name: dshm
  - name: kube-api-access-p2ldw
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2025-06-02T00:42:47Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2025-06-02T00:38:55Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2025-06-02T00:43:45Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2025-06-02T00:43:45Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2025-06-02T00:38:55Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://59115f054d95b2c234cb440697e6ac28540806f8bc47949c637f1af8a6445f0d
    image: docker.io/vllm/vllm-openai:v0.7.3
    imageID: docker.io/vllm/vllm-openai@sha256:4f4037303e8c7b69439db1077bb849a0823517c0f785b894dc8e96d58ef3a0c2
    lastState: {}
    name: model-runner
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2025-06-02T00:42:46Z"
  hostIP: 10.180.67.112
  hostIPs:
  - ip: 10.180.67.112
  phase: Running
  podIP: 100.64.1.35
  podIPs:
  - ip: 100.64.1.35
  qosClass: Guaranteed
  startTime: "2025-06-02T00:38:55Z"

kerthcet

Can we have an integration test?

kerthcet · 2025-06-02T00:42:00Z

pkg/controller_helper/modelsource/uri.go

@@ -30,6 +30,7 @@ var _ ModelSourceProvider = &URIProvider{}

 const (
 	OSS      = "OSS"
+	S3       = "S3"


I think we support GCS as well.

kerthcet · 2025-06-02T00:43:15Z

pkg/webhook/openmodel_webhook.go

@@ -111,6 +112,10 @@ func (w *OpenModelWebhook) generateValidate(obj runtime.Object) field.ErrorList
 					if _, _, _, err := util.ParseOSS(address); err != nil {
 						allErrs = append(allErrs, field.Invalid(sourcePath.Child("uri"), *model.Spec.Source.URI, "URI with wrong address"))
 					}
+				case modelSource.S3:


GCS as well here.

kerthcet · 2025-06-02T00:55:57Z

pkg/controller_helper/modelsource/modelhub.go

@@ -164,3 +170,36 @@ func (p *ModelHubProvider) InjectModelLoader(template *corev1.PodTemplateSpec, i
 func spreadEnvToInitContainer(containerEnv []corev1.EnvVar, initContainer *corev1.Container) {
 	initContainer.Env = append(initContainer.Env, containerEnv...)
 }
+
+func (p *ModelHubProvider) InjectModelEnvVars(template *corev1.PodTemplateSpec) {


I think we have already injected the HF token in above L115. Keep one is enough.

The InjectModelEnvVars function is used to inject model credentials into the model-runner container instead of the model-loader initContainer, in case the model-runner container handles the model loading itself.

kerthcet · 2025-06-02T01:03:59Z

docs/examples/runai-streamer/playground-streaming-from-s3.yaml

+kind: OpenModel
+metadata:
+  name: deepseek-r1-distill-qwen-1-5b
+  annotations:


Can we controller this in Playground and iSVC, I think it's more flexible there.

A playground may associate with multiple OpenModel. For example, opt-350m is loaded by the model-runner container itself, while opt-125m is loaded by the model-loader initContainer. Therefore, I think we should place the annotation in OpenModel.

apiVersion: llmaz.io/v1alpha1 kind: OpenModel metadata: name: opt-350m annotations: llmaz.io/skip-model-loader: "true" spec: familyName: opt source: modelHub: modelID: facebook/opt-350m inferenceConfig: flavors: - name: a10 # gpu type limits: nvidia.com/gpu: 1 --- apiVersion: llmaz.io/v1alpha1 kind: OpenModel metadata: name: opt-125m spec: familyName: opt source: modelHub: modelID: facebook/opt-125m --- apiVersion: inference.llmaz.io/v1alpha1 kind: Playground metadata: name: vllm-speculator spec: replicas: 1 modelClaims: models: - name: opt-350m # the target model role: main - name: opt-125m # the draft model role: draft

The final LLM pod looks like this:

kgp vllm-speculator-0 -oyaml apiVersion: v1 kind: Pod metadata: annotations: cni.projectcalico.org/containerID: 731abc79f3fd2ed667a0003e41100bba64fcaf459850c48de837899226798cb7 cni.projectcalico.org/podIP: 100.64.8.29/32 cni.projectcalico.org/podIPs: 100.64.8.29/32 leaderworkerset.sigs.k8s.io/size: "1" creationTimestamp: "2025-06-02T07:40:16Z" generateName: vllm-speculator- labels: apps.kubernetes.io/pod-index: "0" controller-revision-hash: vllm-speculator-657b7dc4cc leaderworkerset.sigs.k8s.io/group-index: "0" leaderworkerset.sigs.k8s.io/group-key: fe5f4052a1971b9c5d3ea770f2809e11105693b8 leaderworkerset.sigs.k8s.io/name: vllm-speculator leaderworkerset.sigs.k8s.io/template-revision-hash: 96599544f leaderworkerset.sigs.k8s.io/worker-index: "0" llmaz.io/model-family-name: opt llmaz.io/model-name: opt-350m statefulset.kubernetes.io/pod-name: vllm-speculator-0 name: vllm-speculator-0 namespace: default ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: true controller: true kind: StatefulSet name: vllm-speculator uid: 930ce4cf-ead5-4ac8-ac1c-3bb9ab8f8866 resourceVersion: "41268933" uid: 3eba84fd-e569-49b4-9982-2eb7187f2feb spec: containers: - args: - --model - facebook/opt-350m - --served-model-name - opt-350m - --speculative_model - /workspace/models/models--facebook--opt-125m - --host - 0.0.0.0 - --port - "8080" - --num_speculative_tokens - "5" - -tp - "1" command: - python3 - -m - vllm.entrypoints.openai.api_server env: - name: LWS_LEADER_ADDRESS value: vllm-speculator-0.vllm-speculator.default - name: LWS_GROUP_SIZE value: "1" - name: LWS_WORKER_INDEX value: "0" - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: key: HF_TOKEN name: modelhub-secret optional: true - name: HF_TOKEN valueFrom: secretKeyRef: key: HF_TOKEN name: modelhub-secret optional: true - name: KUBERNETES_SERVICE_HOST value: api.seven.perfx-k8s.internal.canary.k8s.ondemand.com image: vllm/vllm-openai:v0.7.3 imagePullPolicy: IfNotPresent lifecycle: preStop: exec: command: - /bin/sh - -c - | while true; do RUNNING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_running' | grep -v '#' | awk '{print $2}') WAITING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_waiting' | grep -v '#' | awk '{print $2}') if [ "$RUNNING" = "0.0" ] && [ "$WAITING" = "0.0" ]; then echo "Terminating: No active or waiting requests, safe to terminate" >> /proc/1/fd/1 exit 0 else echo "Terminating: Running: $RUNNING, Waiting: $WAITING" >> /proc/1/fd/1 sleep 5 fi done livenessProbe: failureThreshold: 3 httpGet: path: /health port: 8080 scheme: HTTP initialDelaySeconds: 15 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 name: model-runner ports: - containerPort: 8080 name: http protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: /health port: 8080 scheme: HTTP initialDelaySeconds: 5 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 resources: limits: cpu: "8" memory: 16Gi nvidia.com/gpu: "1" requests: cpu: "4" memory: 8Gi nvidia.com/gpu: "1" startupProbe: failureThreshold: 30 httpGet: path: /health port: 8080 scheme: HTTP periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /dev/shm name: dshm - mountPath: /workspace/models/ name: model-volume readOnly: true - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-zbsfx readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true hostname: vllm-speculator-0 initContainers: - env: - name: LWS_LEADER_ADDRESS value: vllm-speculator-0.vllm-speculator.default - name: LWS_GROUP_SIZE value: "1" - name: LWS_WORKER_INDEX value: "0" - name: MODEL_SOURCE_TYPE value: modelhub - name: MODEL_ID value: facebook/opt-125m - name: MODEL_HUB_NAME value: Huggingface - name: REVISION value: main - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: key: HF_TOKEN name: modelhub-secret optional: true - name: HF_TOKEN valueFrom: secretKeyRef: key: HF_TOKEN name: modelhub-secret optional: true - name: KUBERNETES_SERVICE_HOST value: api.seven.perfx-k8s.internal.canary.k8s.ondemand.com image: inftyai/model-loader:v0.0.10 imagePullPolicy: IfNotPresent name: model-loader-1 resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /workspace/models/ name: model-volume - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-zbsfx readOnly: true nodeName: ip-10-180-71-146.ec2.internal preemptionPolicy: PreemptLowerPriority priority: 0 restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: default serviceAccountName: default subdomain: vllm-speculator terminationGracePeriodSeconds: 130 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 volumes: - emptyDir: medium: Memory sizeLimit: 2Gi name: dshm - emptyDir: {} name: model-volume - name: kube-api-access-zbsfx projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: token - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace status: conditions: - lastProbeTime: null lastTransitionTime: "2025-06-02T07:40:17Z" status: "True" type: PodReadyToStartContainers - lastProbeTime: null lastTransitionTime: "2025-06-02T07:40:16Z" message: 'containers with incomplete status: [model-loader-1]' reason: ContainersNotInitialized status: "False" type: Initialized - lastProbeTime: null lastTransitionTime: "2025-06-02T07:40:16Z" message: 'containers with unready status: [model-runner]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2025-06-02T07:40:16Z" message: 'containers with unready status: [model-runner]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2025-06-02T07:40:16Z" status: "True" type: PodScheduled containerStatuses: - image: vllm/vllm-openai:v0.7.3 imageID: "" lastState: {} name: model-runner ready: false restartCount: 0 started: false state: waiting: reason: PodInitializing hostIP: 10.180.71.146 hostIPs: - ip: 10.180.71.146 initContainerStatuses: - containerID: containerd://90b6cc1592273be115f75c8d813812884caa78f7dcc28b22205852747eea701c image: docker.io/inftyai/model-loader:v0.0.10 imageID: docker.io/inftyai/model-loader@sha256:b67a8bb3acbc496a62801b2110056b9774e52ddc029b379c7370113c7879c7d9 lastState: {} name: model-loader-1 ready: false restartCount: 0 started: true state: running: startedAt: "2025-06-02T07:40:17Z" phase: Pending podIP: 100.64.8.29 podIPs: - ip: 100.64.8.29 qosClass: Burstable startTime: "2025-06-02T07:40:16Z"

I still think we should leave the knob at the playground level, for example, in the future when we have a cache layer, people can still choose to load the models from s3 directly or from the cache, especially in the performance comparison tests.

But we can follow this later ofcourse.

I personally don't think this is a typical example, load different models with different approaches.

Ok, I'll refactor it tomorrow.

kerthcet · 2025-06-02T09:00:37Z

pkg/controller_helper/backendruntime/backendruntime.go

-		modelInfo["DraftModelPath"] = modelSource.NewModelSourceProvider(p.models[1]).ModelPath()
+		skipModelLoader = false
+		draftModel := p.models[1]
+		if annotations := draftModel.GetAnnotations(); annotations != nil {


The total logic would be more simple if we extract the annotation form isvc.

kerthcet · 2025-06-02T09:11:52Z

pkg/controller_helper/modelsource/modelhub.go

-	// Return once not the main model, because all the below has already been injected.
-	if index != 0 {
-		return
+func spreadEnvToInitContainer(containerEnv []corev1.EnvVar, initContainer *corev1.Container) {


Again would like to see same behaviors across all the models.

cr7258 · 2025-06-04T05:24:42Z

@kerthcet I have moved the annotation to the Playground. Please review it, thanks.
For the failure of e2e tests, we don't have GPU resources so we can't test on vLLM backendRuntime?

kerthcet

We have no GPU nodes for tests now, can we just comment out the asserts about service ready? We can uncomment them in the future..

kerthcet · 2025-06-04T14:52:18Z

pkg/controller/inference/service_controller.go

@@ -168,12 +168,33 @@ func buildWorkloadApplyConfiguration(service *inferenceapi.Service, models []*co
 func injectModelProperties(template *applyconfigurationv1.LeaderWorkerTemplateApplyConfiguration, models []*coreapi.OpenModel, service *inferenceapi.Service) {
 	isMultiNodesInference := template.LeaderTemplate != nil

+	// Skip model-loader initContainer if llmaz.io/skip-model-loader annotation is set.
+	skipModelLoader := false
+	if annotations := service.GetAnnotations(); annotations != nil {


Let's make it a helper func so we can reuse it:

func SkipModelLoader(obj metav1.Object) bool { if annotations := obj.GetAnnotations(); annotations != nil { return annotations[inferenceapi.SkipModelLoaderAnnoKey] == "true" } return false }

kerthcet · 2025-06-04T15:04:10Z

test/e2e/playground_test.go

+	})
+
+	ginkgo.It("Deploy S3 model with llmaz.io/skip-model-loader annotation", func() {
+		model := wrapper.MakeModel("deepseek-r1-distill-qwen-1-5b").FamilyName("deepseek").ModelSourceWithURI("s3://test-bucket/DeepSeek-R1-Distill-Qwen-1.5B").Obj()


Is this a valid address? s3://test-bucket/DeepSeek-R1-Distill-Qwen-1.5B, if not, I guess it will never succeed.

No, can we have a community-dedicated S3 bucket for testing?

Right now no. I'll try to make one this weekend.

kerthcet · 2025-06-04T15:17:49Z

test/util/validation/validate_service.go

@@ -244,6 +244,134 @@ func ValidateServicePods(ctx context.Context, k8sClient client.Client, service *
 	}).Should(gomega.Succeed())
 }

+// ValidateSkipModelLoaderService validates the Playground resource with llmaz.io/skip-model-loader annotation
+func ValidateSkipModelLoaderService(ctx context.Context, k8sClient client.Client, service *inferenceapi.Service) {


Can we just remove the ValidateSkipModelLoaderService but make ValidateSkipModelLoader part of the ValidateService? I just saw a lot of similar asserts here.

kerthcet · 2025-06-05T02:27:24Z

vllm-cpu is another way, but we need to build an image ourself.

cr7258 · 2025-06-05T03:20:16Z

vLLM has prebuild cpu images, let's try it.

https://docs.vllm.ai/en/stable/getting_started/installation/cpu.html#pre-built-images
https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo

kerthcet · 2025-06-06T15:16:20Z

Kindly reminding, we'll release one version this weekend to catch the kubecon HK. Hope to have this feature.

cr7258 · 2025-06-08T15:13:58Z

The vllm-cpu can't serve the model successfully, it continuously terminates with 132 error code. So, I commented out some e2e asserts and will revisit them after we have GPU resources for e2e tests.

# logs
opt-125m-0 model-runner [W605 13:51:49.898291639 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
opt-125m-0 model-runner   Overriding a previously registered kernel for the same operator and the same dispatch key
opt-125m-0 model-runner   operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
opt-125m-0 model-runner     registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
opt-125m-0 model-runner   dispatch key: AutocastCPU
opt-125m-0 model-runner   previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327
opt-125m-0 model-runner        new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator())

# container status
  containerStatuses:
  - containerID: containerd://bfa9555eca8808aabd01e430ebdfb3edc7d1a1ecf0fac6eb1daf4ba897cbe1bc
    image: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.9.0
    imageID: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo@sha256:8a7db9cd4fd8550f3737a24e64eab7486c55a45975f73305fa6bbd7b93819de4
    lastState:
      terminated:
        containerID: containerd://526ca024155bdc471e181e3952ecd131f23b406edc44a6a1226a4148a1b32db8
        exitCode: 132
        finishedAt: "2025-06-05T13:53:56Z"
        reason: Error
        startedAt: "2025-06-05T13:53:42Z"

cr7258 · 2025-06-08T15:14:45Z

@kerthcet All tests have passed now, please review the PR again. Thanks.

kerthcet · 2025-06-08T16:35:38Z

I'll take a look tomorrow morning, actually today's morning.

feat: support runai streamer for vllm

f96ca4f

InftyAI-Agent added needs-triage Indicates an issue or PR lacks a label and requires one. needs-priority Indicates a PR lacks a label and requires one. do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels May 19, 2025

InftyAI-Agent requested a review from kerthcet May 19, 2025 09:11

InftyAI-Agent added feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels May 19, 2025

kerthcet reviewed May 25, 2025

View reviewed changes

cr7258 added 7 commits June 1, 2025 12:43

refactor

e32f04f

refactor

dc56e6d

fix

dff2062

refactor

369ae1a

add unit tests

1866cb2

fix

2e9cdb7

fix

5a6ffbc

cr7258 requested a review from kerthcet June 1, 2025 15:39

fix

df13470

kerthcet reviewed Jun 2, 2025

View reviewed changes

cr7258 added 2 commits June 2, 2025 15:49

fix

127be3f

gcs

fe4443b

cr7258 requested a review from kerthcet June 2, 2025 08:20

kerthcet reviewed Jun 2, 2025

View reviewed changes

refactor v2

3275713

e2e

44c7f46

cr7258 requested a review from kerthcet June 4, 2025 05:24

kerthcet reviewed Jun 4, 2025

View reviewed changes

cr7258 added 3 commits June 5, 2025 10:48

add SkipModelLoader help function

3f9d438

merge ValidateSkipModelLoaderService into ValidateService

1ace459

Merge branch 'main' into support-runai-streamer

480d2cf

cr7258 added 3 commits June 8, 2025 22:05

fix

34700a0

fix

9cfd7ef

fix e2e tests

a8c8394

cr7258 requested a review from kerthcet June 8, 2025 15:14

Uh oh!

feat: support runai streamer for vllm #423

Are you sure you want to change the base?

feat: support runai streamer for vllm #423

Uh oh!

Conversation

cr7258 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it

Which issue(s) this PR fixes

Special notes for your reviewer

Does this PR introduce a user-facing change?

Uh oh!

cr7258 commented May 19, 2025

Uh oh!

kerthcet commented May 20, 2025

Uh oh!

cr7258 commented May 20, 2025

Uh oh!

kerthcet commented May 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cr7258 commented May 26, 2025

Uh oh!

cr7258 commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kerthcet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kerthcet Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cr7258 commented Jun 4, 2025

Uh oh!

kerthcet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cr7258 commented May 19, 2025 •

edited

Loading

cr7258 commented Jun 1, 2025 •

edited

Loading

kerthcet Jun 2, 2025 •

edited

Loading