-
-
Notifications
You must be signed in to change notification settings - Fork 29
feat: support runai streamer for vllm #423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/kind feature |
What we hope to achieve here is generally two things:
Both needs experiments, sorry I didn't explain clearly here. The original comment: #352 (comment) The configuration is open for users actually, so we don't need to do anything I think. |
As I understand it, the model loader is responsible for downloading models from remote storage, such as Hugging Face or OSS, to the local disk. When the inference container starts, it uses the model that has already been downloaded locally. Run:ai Model Streamer can speed up model loading by concurrently loading already-read tensors into the GPU while continuing to read other tensors from storage. This acceleration happens after the model has been downloaded locally, so I don't think we have anything to do in model loader for supporting Run:ai Model Streamer. Additionally, Run:ai Model Streamer is not inference-agnostic — it requires integration with an inference engine, and currently only vLLM is supported. (Related PR) |
I thought a bit about this, I think you're right, we can do nothing here. The original idea here is try to explore whether we can load the models to the GPU and sent the GPU alloc address to the inference engine. However, seems no engine supports this and foreseeable future. But one thing we should be care about here is we still load the models to the disk rather than to the cpu buffer -> GPU memory, so I suggest let's add annotation to the Playground | Inference Service, then in orchestration, once we detected that the Inference Service has the annotation, we'll not construct the initContainer, also will not render the ModelPath in the arguments, so the inference engine will handle all the loading logic. Will you like to refactor the PR based on this? @cr7258 |
chart/templates/backends/vllm.yaml
Outdated
@@ -77,6 +77,26 @@ spec: | |||
limits: | |||
cpu: 8 | |||
memory: 16Gi | |||
- name: runai-streamer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be part of the example but I wouldn't like to make it part of the default template.
@kerthcet Ok, I'll refactor the PR this week. |
@kerthcet I have refactored the PR according to your suggestion. Please take a look, Thanks. Now Run:ai Model Streamer supports two streaming approaches:
Here are the logs and the LLM pod for streaming from S3. kubectl logs deepseek-r1-distill-qwen-1-5b-0
INFO 06-01 17:42:57 __init__.py:207] Automatically detected platform cuda.
INFO 06-01 17:42:57 api_server.py:912] vLLM API server version 0.7.3
INFO 06-01 17:42:57 api_server.py:913] args: Namespace(host='0.0.0.0', port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='s3://cr7258/DeepSeek-R1-Distill-Qwen-1.5B', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='runai_streamer', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['deepseek-r1-distill-qwen-1-5b'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 06-01 17:42:57 api_server.py:209] Started engine process with PID 21
INFO 06-01 17:43:02 __init__.py:207] Automatically detected platform cuda.
INFO 06-01 17:43:05 config.py:549] This model supports multiple tasks: {'embed', 'reward', 'generate', 'classify', 'score'}. Defaulting to 'generate'.
WARNING 06-01 17:43:05 arg_utils.py:1187] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 06-01 17:43:05 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 06-01 17:43:09 config.py:549] This model supports multiple tasks: {'score', 'reward', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
WARNING 06-01 17:43:09 arg_utils.py:1187] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 06-01 17:43:09 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 06-01 17:43:09 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/tmp/tmp7oj9ysi2', speculative_config=None, tokenizer='/tmp/tmpkhc7j_h4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.RUNAI_STREAMER, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=deepseek-r1-distill-qwen-1-5b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 06-01 17:43:10 cuda.py:229] Using Flash Attention backend.
INFO 06-01 17:43:11 model_runner.py:1110] Starting to load model /tmp/tmp7oj9ysi2...
Loading safetensors using Runai Model Streamer: 0% Completed | 0/1 [00:00<?, ?it/s]
[RunAI Streamer] CPU Buffer size: 3.3 GiB for file: model.safetensors
Read throughput is 598.55 MB per second
Loading safetensors using Runai Model Streamer: 100% Completed | 1/1 [00:06<00:00, 6.26s/it]
Loading safetensors using Runai Model Streamer: 100% Completed | 1/1 [00:06<00:00, 6.26s/it]
[RunAI Streamer] Overall time to stream 3.3 GiB of all files: 6.26s, 541.5 MiB/s
INFO 06-01 17:43:18 model_runner.py:1115] Loading model weights took 3.3460 GB
INFO 06-01 17:43:18 worker.py:267] Memory profiling takes 0.52 seconds
INFO 06-01 17:43:18 worker.py:267] the current vLLM instance can use total_gpu_memory (22.18GiB) x gpu_memory_utilization (0.90) = 19.97GiB
INFO 06-01 17:43:18 worker.py:267] model weights take 3.35GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.39GiB; the rest of the memory reserved for KV Cache is 15.17GiB.
INFO 06-01 17:43:19 executor_base.py:111] # cuda blocks: 35510, # CPU blocks: 9362
INFO 06-01 17:43:19 executor_base.py:116] Maximum concurrency for 131072 tokens per request: 4.33x
INFO 06-01 17:43:24 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:13<00:00, 2.55it/s]
INFO 06-01 17:43:38 model_runner.py:1562] Graph capturing finished in 14 secs, took 0.20 GiB
INFO 06-01 17:43:38 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 20.20 seconds
INFO 06-01 17:43:39 api_server.py:958] Starting vLLM API server on http://0.0.0.0:8080
INFO 06-01 17:43:39 launcher.py:23] Available routes are:
INFO 06-01 17:43:39 launcher.py:31] Route: /openapi.json, Methods: GET, HEAD
INFO 06-01 17:43:39 launcher.py:31] Route: /docs, Methods: GET, HEAD
INFO 06-01 17:43:39 launcher.py:31] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 06-01 17:43:39 launcher.py:31] Route: /redoc, Methods: GET, HEAD
INFO 06-01 17:43:39 launcher.py:31] Route: /health, Methods: GET
INFO 06-01 17:43:39 launcher.py:31] Route: /ping, Methods: GET, POST
INFO 06-01 17:43:39 launcher.py:31] Route: /tokenize, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /detokenize, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/models, Methods: GET
INFO 06-01 17:43:39 launcher.py:31] Route: /version, Methods: GET
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /pooling, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /score, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/score, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /rerank, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 06-01 17:43:39 launcher.py:31] Route: /invocations, Methods: POST
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: 240.243.170.78:57500 - "GET /health HTTP/1.1" 200 OK
INFO: 240.243.170.78:57502 - "GET /health HTTP/1.1" 200 OK
INFO: 240.243.170.78:45068 - "GET /health HTTP/1.1" 200 OK
INFO: 240.243.170.78:45082 - "GET /health HTTP/1.1" 200 OK kubectl get pod deepseek-r1-distill-qwen-1-5b-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
cni.projectcalico.org/containerID: 381531474d3e5bce66c8d41227870222869255d3c66fd4d99eb3656aed46b06f
cni.projectcalico.org/podIP: 100.64.1.35/32
cni.projectcalico.org/podIPs: 100.64.1.35/32
leaderworkerset.sigs.k8s.io/size: "1"
creationTimestamp: "2025-06-02T00:38:55Z"
generateName: deepseek-r1-distill-qwen-1-5b-
labels:
apps.kubernetes.io/pod-index: "0"
controller-revision-hash: deepseek-r1-distill-qwen-1-5b-854446fd48
leaderworkerset.sigs.k8s.io/group-index: "0"
leaderworkerset.sigs.k8s.io/group-key: fdd0812c01eb16406e88b2bc006cddb7081625d8
leaderworkerset.sigs.k8s.io/name: deepseek-r1-distill-qwen-1-5b
leaderworkerset.sigs.k8s.io/template-revision-hash: 57c5d68dc6
leaderworkerset.sigs.k8s.io/worker-index: "0"
llmaz.io/model-family-name: deepseek
llmaz.io/model-name: deepseek-r1-distill-qwen-1-5b
statefulset.kubernetes.io/pod-name: deepseek-r1-distill-qwen-1-5b-0
name: deepseek-r1-distill-qwen-1-5b-0
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: StatefulSet
name: deepseek-r1-distill-qwen-1-5b
uid: 3e95a2d9-db74-42c6-8f17-db378e29dc99
resourceVersion: "40962332"
uid: 6f5be468-2de2-4ecd-9edd-67a3fdd00fce
spec:
containers:
- args:
- --model
- s3://cr7258/DeepSeek-R1-Distill-Qwen-1.5B
- --served-model-name
- deepseek-r1-distill-qwen-1-5b
- --host
- 0.0.0.0
- --port
- "8080"
- --load-format
- runai_streamer
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
env:
- name: LWS_LEADER_ADDRESS
value: deepseek-r1-distill-qwen-1-5b-0.deepseek-r1-distill-qwen-1-5b.default
- name: LWS_GROUP_SIZE
value: "1"
- name: LWS_WORKER_INDEX
value: "0"
- name: RUNAI_STREAMER_S3_REQUEST_TIMEOUT_MS
value: "10000"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
key: AWS_ACCESS_KEY_ID
name: aws-access-secret
optional: true
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
key: AWS_SECRET_ACCESS_KEY
name: aws-access-secret
optional: true
- name: KUBERNETES_SERVICE_HOST
value: api.seven.perfx-k8s.internal.canary.k8s.ondemand.com
image: vllm/vllm-openai:v0.7.3
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
while true; do
RUNNING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_running' | grep -v '#' | awk '{print $2}')
WAITING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_waiting' | grep -v '#' | awk '{print $2}')
if [ "$RUNNING" = "0.0" ] && [ "$WAITING" = "0.0" ]; then
echo "Terminating: No active or waiting requests, safe to terminate" >> /proc/1/fd/1
exit 0
else
echo "Terminating: Running: $RUNNING, Waiting: $WAITING" >> /proc/1/fd/1
sleep 5
fi
done
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: model-runner
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "4"
memory: 16Gi
nvidia.com/gpu: "1"
requests:
cpu: "4"
memory: 16Gi
nvidia.com/gpu: "1"
startupProbe:
failureThreshold: 30
httpGet:
path: /health
port: 8080
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-p2ldw
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: deepseek-r1-distill-qwen-1-5b-0
nodeName: ip-10-180-67-112.ec2.internal
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
subdomain: deepseek-r1-distill-qwen-1-5b
terminationGracePeriodSeconds: 130
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- emptyDir:
medium: Memory
sizeLimit: 2Gi
name: dshm
- name: kube-api-access-p2ldw
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2025-06-02T00:42:47Z"
status: "True"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2025-06-02T00:38:55Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2025-06-02T00:43:45Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2025-06-02T00:43:45Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2025-06-02T00:38:55Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://59115f054d95b2c234cb440697e6ac28540806f8bc47949c637f1af8a6445f0d
image: docker.io/vllm/vllm-openai:v0.7.3
imageID: docker.io/vllm/vllm-openai@sha256:4f4037303e8c7b69439db1077bb849a0823517c0f785b894dc8e96d58ef3a0c2
lastState: {}
name: model-runner
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2025-06-02T00:42:46Z"
hostIP: 10.180.67.112
hostIPs:
- ip: 10.180.67.112
phase: Running
podIP: 100.64.1.35
podIPs:
- ip: 100.64.1.35
qosClass: Guaranteed
startTime: "2025-06-02T00:38:55Z" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have an integration test?
@@ -30,6 +30,7 @@ var _ ModelSourceProvider = &URIProvider{} | |||
|
|||
const ( | |||
OSS = "OSS" | |||
S3 = "S3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we support GCS as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pkg/webhook/openmodel_webhook.go
Outdated
@@ -111,6 +112,10 @@ func (w *OpenModelWebhook) generateValidate(obj runtime.Object) field.ErrorList | |||
if _, _, _, err := util.ParseOSS(address); err != nil { | |||
allErrs = append(allErrs, field.Invalid(sourcePath.Child("uri"), *model.Spec.Source.URI, "URI with wrong address")) | |||
} | |||
case modelSource.S3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCS as well here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -164,3 +170,36 @@ func (p *ModelHubProvider) InjectModelLoader(template *corev1.PodTemplateSpec, i | |||
func spreadEnvToInitContainer(containerEnv []corev1.EnvVar, initContainer *corev1.Container) { | |||
initContainer.Env = append(initContainer.Env, containerEnv...) | |||
} | |||
|
|||
func (p *ModelHubProvider) InjectModelEnvVars(template *corev1.PodTemplateSpec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have already injected the HF token in above L115. Keep one is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The InjectModelEnvVars
function is used to inject model credentials into the model-runner
container instead of the model-loader
initContainer, in case the model-runner
container handles the model loading itself.
kind: OpenModel | ||
metadata: | ||
name: deepseek-r1-distill-qwen-1-5b | ||
annotations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we controller this in Playground and iSVC, I think it's more flexible there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A playground may associate with multiple OpenModel. For example, opt-350m
is loaded by the model-runner container itself, while opt-125m
is loaded by the model-loader
initContainer. Therefore, I think we should place the annotation in OpenModel.
apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
name: opt-350m
annotations:
llmaz.io/skip-model-loader: "true"
spec:
familyName: opt
source:
modelHub:
modelID: facebook/opt-350m
inferenceConfig:
flavors:
- name: a10 # gpu type
limits:
nvidia.com/gpu: 1
---
apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
name: opt-125m
spec:
familyName: opt
source:
modelHub:
modelID: facebook/opt-125m
---
apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
name: vllm-speculator
spec:
replicas: 1
modelClaims:
models:
- name: opt-350m # the target model
role: main
- name: opt-125m # the draft model
role: draft
The final LLM pod looks like this:
kgp vllm-speculator-0 -oyaml
apiVersion: v1
kind: Pod
metadata:
annotations:
cni.projectcalico.org/containerID: 731abc79f3fd2ed667a0003e41100bba64fcaf459850c48de837899226798cb7
cni.projectcalico.org/podIP: 100.64.8.29/32
cni.projectcalico.org/podIPs: 100.64.8.29/32
leaderworkerset.sigs.k8s.io/size: "1"
creationTimestamp: "2025-06-02T07:40:16Z"
generateName: vllm-speculator-
labels:
apps.kubernetes.io/pod-index: "0"
controller-revision-hash: vllm-speculator-657b7dc4cc
leaderworkerset.sigs.k8s.io/group-index: "0"
leaderworkerset.sigs.k8s.io/group-key: fe5f4052a1971b9c5d3ea770f2809e11105693b8
leaderworkerset.sigs.k8s.io/name: vllm-speculator
leaderworkerset.sigs.k8s.io/template-revision-hash: 96599544f
leaderworkerset.sigs.k8s.io/worker-index: "0"
llmaz.io/model-family-name: opt
llmaz.io/model-name: opt-350m
statefulset.kubernetes.io/pod-name: vllm-speculator-0
name: vllm-speculator-0
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: StatefulSet
name: vllm-speculator
uid: 930ce4cf-ead5-4ac8-ac1c-3bb9ab8f8866
resourceVersion: "41268933"
uid: 3eba84fd-e569-49b4-9982-2eb7187f2feb
spec:
containers:
- args:
- --model
- facebook/opt-350m
- --served-model-name
- opt-350m
- --speculative_model
- /workspace/models/models--facebook--opt-125m
- --host
- 0.0.0.0
- --port
- "8080"
- --num_speculative_tokens
- "5"
- -tp
- "1"
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
env:
- name: LWS_LEADER_ADDRESS
value: vllm-speculator-0.vllm-speculator.default
- name: LWS_GROUP_SIZE
value: "1"
- name: LWS_WORKER_INDEX
value: "0"
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
key: HF_TOKEN
name: modelhub-secret
optional: true
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HF_TOKEN
name: modelhub-secret
optional: true
- name: KUBERNETES_SERVICE_HOST
value: api.seven.perfx-k8s.internal.canary.k8s.ondemand.com
image: vllm/vllm-openai:v0.7.3
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
while true; do
RUNNING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_running' | grep -v '#' | awk '{print $2}')
WAITING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_waiting' | grep -v '#' | awk '{print $2}')
if [ "$RUNNING" = "0.0" ] && [ "$WAITING" = "0.0" ]; then
echo "Terminating: No active or waiting requests, safe to terminate" >> /proc/1/fd/1
exit 0
else
echo "Terminating: Running: $RUNNING, Waiting: $WAITING" >> /proc/1/fd/1
sleep 5
fi
done
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: model-runner
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "8"
memory: 16Gi
nvidia.com/gpu: "1"
requests:
cpu: "4"
memory: 8Gi
nvidia.com/gpu: "1"
startupProbe:
failureThreshold: 30
httpGet:
path: /health
port: 8080
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /workspace/models/
name: model-volume
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-zbsfx
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: vllm-speculator-0
initContainers:
- env:
- name: LWS_LEADER_ADDRESS
value: vllm-speculator-0.vllm-speculator.default
- name: LWS_GROUP_SIZE
value: "1"
- name: LWS_WORKER_INDEX
value: "0"
- name: MODEL_SOURCE_TYPE
value: modelhub
- name: MODEL_ID
value: facebook/opt-125m
- name: MODEL_HUB_NAME
value: Huggingface
- name: REVISION
value: main
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
key: HF_TOKEN
name: modelhub-secret
optional: true
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HF_TOKEN
name: modelhub-secret
optional: true
- name: KUBERNETES_SERVICE_HOST
value: api.seven.perfx-k8s.internal.canary.k8s.ondemand.com
image: inftyai/model-loader:v0.0.10
imagePullPolicy: IfNotPresent
name: model-loader-1
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /workspace/models/
name: model-volume
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-zbsfx
readOnly: true
nodeName: ip-10-180-71-146.ec2.internal
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
subdomain: vllm-speculator
terminationGracePeriodSeconds: 130
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- emptyDir:
medium: Memory
sizeLimit: 2Gi
name: dshm
- emptyDir: {}
name: model-volume
- name: kube-api-access-zbsfx
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2025-06-02T07:40:17Z"
status: "True"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2025-06-02T07:40:16Z"
message: 'containers with incomplete status: [model-loader-1]'
reason: ContainersNotInitialized
status: "False"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2025-06-02T07:40:16Z"
message: 'containers with unready status: [model-runner]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2025-06-02T07:40:16Z"
message: 'containers with unready status: [model-runner]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2025-06-02T07:40:16Z"
status: "True"
type: PodScheduled
containerStatuses:
- image: vllm/vllm-openai:v0.7.3
imageID: ""
lastState: {}
name: model-runner
ready: false
restartCount: 0
started: false
state:
waiting:
reason: PodInitializing
hostIP: 10.180.71.146
hostIPs:
- ip: 10.180.71.146
initContainerStatuses:
- containerID: containerd://90b6cc1592273be115f75c8d813812884caa78f7dcc28b22205852747eea701c
image: docker.io/inftyai/model-loader:v0.0.10
imageID: docker.io/inftyai/model-loader@sha256:b67a8bb3acbc496a62801b2110056b9774e52ddc029b379c7370113c7879c7d9
lastState: {}
name: model-loader-1
ready: false
restartCount: 0
started: true
state:
running:
startedAt: "2025-06-02T07:40:17Z"
phase: Pending
podIP: 100.64.8.29
podIPs:
- ip: 100.64.8.29
qosClass: Burstable
startTime: "2025-06-02T07:40:16Z"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should leave the knob at the playground level, for example, in the future when we have a cache layer, people can still choose to load the models from s3 directly or from the cache, especially in the performance comparison tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we can follow this later ofcourse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally don't think this is a typical example, load different models with different approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll refactor it tomorrow.
modelInfo["DraftModelPath"] = modelSource.NewModelSourceProvider(p.models[1]).ModelPath() | ||
skipModelLoader = false | ||
draftModel := p.models[1] | ||
if annotations := draftModel.GetAnnotations(); annotations != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The total logic would be more simple if we extract the annotation form isvc.
// Return once not the main model, because all the below has already been injected. | ||
if index != 0 { | ||
return | ||
func spreadEnvToInitContainer(containerEnv []corev1.EnvVar, initContainer *corev1.Container) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again would like to see same behaviors across all the models.
@kerthcet I have moved the annotation to the Playground. Please review it, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no GPU nodes for tests now, can we just comment out the asserts about service ready
? We can uncomment them in the future..
@@ -168,12 +168,33 @@ func buildWorkloadApplyConfiguration(service *inferenceapi.Service, models []*co | |||
func injectModelProperties(template *applyconfigurationv1.LeaderWorkerTemplateApplyConfiguration, models []*coreapi.OpenModel, service *inferenceapi.Service) { | |||
isMultiNodesInference := template.LeaderTemplate != nil | |||
|
|||
// Skip model-loader initContainer if llmaz.io/skip-model-loader annotation is set. | |||
skipModelLoader := false | |||
if annotations := service.GetAnnotations(); annotations != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it a helper func so we can reuse it:
func SkipModelLoader(obj metav1.Object) bool {
if annotations := obj.GetAnnotations(); annotations != nil {
return annotations[inferenceapi.SkipModelLoaderAnnoKey] == "true"
}
return false
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/e2e/playground_test.go
Outdated
}) | ||
|
||
ginkgo.It("Deploy S3 model with llmaz.io/skip-model-loader annotation", func() { | ||
model := wrapper.MakeModel("deepseek-r1-distill-qwen-1-5b").FamilyName("deepseek").ModelSourceWithURI("s3://test-bucket/DeepSeek-R1-Distill-Qwen-1.5B").Obj() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a valid address? s3://test-bucket/DeepSeek-R1-Distill-Qwen-1.5B, if not, I guess it will never succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, can we have a community-dedicated S3 bucket for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now no. I'll try to make one this weekend.
@@ -244,6 +244,134 @@ func ValidateServicePods(ctx context.Context, k8sClient client.Client, service * | |||
}).Should(gomega.Succeed()) | |||
} | |||
|
|||
// ValidateSkipModelLoaderService validates the Playground resource with llmaz.io/skip-model-loader annotation | |||
func ValidateSkipModelLoaderService(ctx context.Context, k8sClient client.Client, service *inferenceapi.Service) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just remove the ValidateSkipModelLoaderService
but make ValidateSkipModelLoader
part of the ValidateService
? I just saw a lot of similar asserts here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
vllm-cpu is another way, but we need to build an image ourself. |
vLLM has prebuild cpu images, let's try it. https://docs.vllm.ai/en/stable/getting_started/installation/cpu.html#pre-built-images |
Kindly reminding, we'll release one version this weekend to catch the kubecon HK. Hope to have this feature. |
The vllm-cpu can't serve the model successfully, it continuously terminates with # logs
opt-125m-0 model-runner [W605 13:51:49.898291639 OperatorEntry.cpp:154] Warning: Warning only once for all operators, other operators may also be overridden.
opt-125m-0 model-runner Overriding a previously registered kernel for the same operator and the same dispatch key
opt-125m-0 model-runner operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
opt-125m-0 model-runner registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
opt-125m-0 model-runner dispatch key: AutocastCPU
opt-125m-0 model-runner previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327
opt-125m-0 model-runner new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator())
# container status
containerStatuses:
- containerID: containerd://bfa9555eca8808aabd01e430ebdfb3edc7d1a1ecf0fac6eb1daf4ba897cbe1bc
image: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.9.0
imageID: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo@sha256:8a7db9cd4fd8550f3737a24e64eab7486c55a45975f73305fa6bbd7b93819de4
lastState:
terminated:
containerID: containerd://526ca024155bdc471e181e3952ecd131f23b406edc44a6a1226a4148a1b32db8
exitCode: 132
finishedAt: "2025-06-05T13:53:56Z"
reason: Error
startedAt: "2025-06-05T13:53:42Z" |
@kerthcet All tests have passed now, please review the PR again. Thanks. |
I'll take a look tomorrow morning, actually today's morning. |
What this PR does / why we need it
Add a new config
runai-streamer
in the vLLM BackendRuntime to allow loading model using Run:ai Model Streamer to enhance model loading times. Currently, only vllm supports Run:ai Model Streamer.Which issue(s) this PR fixes
Fixes #352
Special notes for your reviewer
Does this PR introduce a user-facing change?