You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I use docker-compose to deploy Xinference. Most of the time it works fine, but at some random moment, a KeyError is triggered, causing the entire service to fail. Here are my steps.
This file is named xinference-local.yml, then docker-compose -f xinference-local.yml up -d to set up.
After a prolonged period of usage (using code).....Here is the error log:
xinference-local | , generate config: {'temperature': 0.1, 'stream': True, 'stop': ['<|endoftext|>', '<|im_start|>', '<|im_end|>'], 'stop_token_ids': [151643, 151644, 151645]}
xinference-local | 2024-05-14 08:08:04,096 xinference.core.model 113 DEBUG After request chat, current serve request count: 0 for the model qwen1.5-chat
xinference-local | 2024-05-14 08:08:04,097 xinference.core.model 113 DEBUG Leave wrapped_func, elapsed time: 0 s
xinference-local | 2024-05-14 08:08:04,099 xinference.api.restful_api 1 ERROR Chat completion stream got an error: b'\xc5a\xae \t\x1b\xf4\x8a\xa0\x95\x9e\xe4\xc3\xd5\xb1\xc5BG\x11?]:\xb1\x0c\xb8\x83\xb3\x9d\xb7\xa2}0'
xinference-local | Traceback (most recent call last):
xinference-local | File "/opt/conda/lib/python3.10/site-packages/xinference/api/restful_api.py", line 1365, in stream_results
xinference-local | async for item in iterator:
xinference-local | File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 335, in __anext__
xinference-local | self._actor_ref = await actor_ref(
xinference-local | File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 125, in actor_ref
xinference-local | return await ctx.actor_ref(*args, **kwargs)
xinference-local | File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 197, in actor_ref
xinference-local | result = await self._wait(future, actor_ref.address, message)
xinference-local | File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 115, in _wait
xinference-local | return await future
xinference-local | File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/core.py", line 87, in _listen
xinference-local | future = self._client_to_message_futures[client].pop(message.message_id)
xinference-local | KeyError: b'\xc5a\xae \t\x1b\xf4\x8a\xa0\x95\x9e\xe4\xc3\xd5\xb1\xc5BG\x11?]:\xb1\x0c\xb8\x83\xb3\x9d\xb7\xa2}0'
xinference-local | INFO 05-14 08:08:04 metrics.py:229] Avg prompt throughput: 1085.6 tokens/s, Avg generation throughput: 42.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 43.6%, CPU KV cache usage: 0.0%
xinference-local | INFO 05-14 08:08:04 async_llm_engine.py:120] Finished request 15c42c7e-11c9-11ef-a06b-0242ac1e0202.
I want to know how this error is triggered. Is it due to excessive memory usage, or is there dirty data in the request?
I deployed the qwen1.5-14b-chat large model in its entirety on a single A100. This exception is triggered after continuously calling the model for about 6 hours. Each prompt is different, and after redeployment, the same prompt does not trigger the exception.
Expected behavior
The large model will not get stuck, and even if one of the large models gets stuck, it will not affect the others.
Additional context
The memory usage is 38770MiB / 40960MiB, and the GPU utilization is approximately 80%-90% based on visual observation.
The text was updated successfully, but these errors were encountered:
Describe the bug
I use docker-compose to deploy Xinference. Most of the time it works fine, but at some random moment, a KeyError is triggered, causing the entire service to fail. Here are my steps.
To Reproduce
This file is named
xinference-local.yml
, thendocker-compose -f xinference-local.yml up -d
to set up.After a prolonged period of usage (using code).....Here is the error log:
I want to know how this error is triggered. Is it due to excessive memory usage, or is there dirty data in the request?
I deployed the qwen1.5-14b-chat large model in its entirety on a single A100. This exception is triggered after continuously calling the model for about 6 hours. Each prompt is different, and after redeployment, the same prompt does not trigger the exception.
Expected behavior
The large model will not get stuck, and even if one of the large models gets stuck, it will not affect the others.
Additional context
The memory usage is 38770MiB / 40960MiB, and the GPU utilization is approximately 80%-90% based on visual observation.
The text was updated successfully, but these errors were encountered: