-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Description
Your current environment
Error processing images with Qwen3-VL
🐛 Describe the bug
I am using vllm with two 5070ti video cards and the qwe3-vl-4b-instruct model, and after approximately 10 sequential requests to process images, the service crashes and with some responses it repeats the same thing infinitely.
My docker-compose:
services:
vllm-server:
image: dockerhub.timeweb.cloud/vllm/vllm-openai:latest
container_name: vllm-server
ports:
- "8000:8000"
volumes:
- /home/notires/huggingface:/root/.local/share/vllm
- ./cache:/root/.cache/huggingface
environment:
- CUDA_VISIBLE_DEVICES=0,1
- TOKENIZERS_PARALLELISM=false
- HF_HOME=/root/.cache/huggingface
command:
- --model=/root/.local/share/vllm/qwen3-vl-4b-instruct
- --served-model-name=qwen3-vl
- --host=0.0.0.0
- --port=8000
- --tensor-parallel-size=2
- --download-dir=/root/.local/share/vllm
- --max-num-seqs=128
- --gpu-memory-utilization=0.85
- --max-model-len=16384
- --enable-chunked-prefill
- --enable-prefix-caching
- --trust-remote-code
- --mm-encoder-tp-mode=data
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
restart: unless-stopped
ERROR
vllm-server | (APIServer pid=1) INFO: 172.19.0.1:44742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm-server | (Worker_TP1 pid=117) INFO 11-04 11:30:20 [multiproc_executor.py:558] Parent process exited, terminating worker
vllm-server | (Worker_TP0 pid=116) INFO 11-04 11:30:20 [multiproc_executor.py:558] Parent process exited, terminating worker
vllm-server | (APIServer pid=1) ERROR 11-04 11:30:20 [core_client.py:564] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.
vllm-server | (Worker_TP1 pid=117) INFO 11-04 11:30:20 [multiproc_executor.py:599] WorkerProc shutting down.
vllm-server | (Worker_TP0 pid=116) INFO 11-04 11:30:20 [multiproc_executor.py:599] WorkerProc shutting down.
vllm-server | (APIServer pid=1) ERROR 11-04 11:30:20 [async_llm.py:480] AsyncLLM output_handler failed.
vllm-server | (APIServer pid=1) ERROR 11-04 11:30:20 [async_llm.py:480] Traceback (most recent call last):
vllm-server | (APIServer pid=1) ERROR 11-04 11:30:20 [async_llm.py:480] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 439, in output_handler
vllm-server | (APIServer pid=1) ERROR 11-04 11:30:20 [async_llm.py:480] outputs = await engine_core.get_output_async()
vllm-server | (APIServer pid=1) ERROR 11-04 11:30:20 [async_llm.py:480] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-server | (APIServer pid=1) ERROR 11-04 11:30:20 [async_llm.py:480] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 846, in get_output_async
vllm-server | (APIServer pid=1) ERROR 11-04 11:30:20 [async_llm.py:480] raise self._format_exception(outputs) from None
vllm-server | (APIServer pid=1) ERROR 11-04 11:30:20 [async_llm.py:480] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
vllm-server | (APIServer pid=1) INFO: 172.19.0.1:44742 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
vllm-server | (APIServer pid=1) INFO: 172.19.0.1:35672 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
vllm-server | (APIServer pid=1) INFO: Shutting down
vllm-server | (APIServer pid=1) INFO: Waiting for application shutdown.
vllm-server | (APIServer pid=1) INFO: Application shutdown complete.
vllm-server | (APIServer pid=1) INFO: Finished server process [1]
vllm-server | /usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
vllm-server | warnings.warn('resource_tracker: There appear to be %d '
vllm-server exited with code 0 (restarting)
vllm-server | INFO 11-04 11:30:30 [init.py:216] Automatically detected platform cuda.
vllm-server | (APIServer pid=1) INFO 11-04 11:30:31 [api_server.py:1839] vLLM API server version 0.11.0ble Watch
vllm-server | (APIServer pid=1) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
vllm-server | (APIServer pid=1) INFO 11-04 11:30:31 [utils.py:233] non-default args: {'host': '0.0.0.0', 'model': '/root/.local/share/vllm/qwen3-vl-4b-instruct', 'trust_remote_code': True, 'max_model_len': 16384, 'served_model_name': ['qwen3-vl'], 'download_dir': '/root/.local/share/vllm', 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.85, 'enable_prefix_caching': True, 'mm_encoder_tp_mode': 'data', 'max_num_seqs': 128, 'enable_chunked_prefill': True}
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.