Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The accidental triggering of a KeyError exception can cause the entire service to crash. #1496

Open
yinghaodang opened this issue May 14, 2024 · 0 comments
Labels
Milestone

Comments

@yinghaodang
Copy link

Describe the bug

I use docker-compose to deploy Xinference. Most of the time it works fine, but at some random moment, a KeyError is triggered, causing the entire service to fail. Here are my steps.

To Reproduce

version: '3.8'

services:
  xinference-local:
    image: xprobe/xinference:v0.11.0
    container_name: xinference-local
    ports:
      - 9999:9997
    environment:
      - XINFERENCE_MODEL_SRC=modelscope
      - XINFERENCE_HOME=/root/MODEL_PATH
    volumes:
      - /home/ecidi/MODEL_PATH:/root/MODEL_PATH
    restart: always
    shm_size: '512g'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: xinference-local -H 0.0.0.0 --log-level debug
    networks:
      - xinference-local
networks:
  xinference-local:
    driver: bridge
    ipam:
      driver: default
      config:
        - subnet: "172.30.2.0/24"

This file is named xinference-local.yml, then docker-compose -f xinference-local.yml up -d to set up.

After a prolonged period of usage (using code).....Here is the error log:

xinference-local  | , generate config: {'temperature': 0.1, 'stream': True, 'stop': ['<|endoftext|>', '<|im_start|>', '<|im_end|>'], 'stop_token_ids': [151643, 151644, 151645]}
xinference-local  | 2024-05-14 08:08:04,096 xinference.core.model 113 DEBUG    After request chat, current serve request count: 0 for the model qwen1.5-chat
xinference-local  | 2024-05-14 08:08:04,097 xinference.core.model 113 DEBUG    Leave wrapped_func, elapsed time: 0 s
xinference-local  | 2024-05-14 08:08:04,099 xinference.api.restful_api 1 ERROR    Chat completion stream got an error: b'\xc5a\xae \t\x1b\xf4\x8a\xa0\x95\x9e\xe4\xc3\xd5\xb1\xc5BG\x11?]:\xb1\x0c\xb8\x83\xb3\x9d\xb7\xa2}0'
xinference-local  | Traceback (most recent call last):
xinference-local  |   File "/opt/conda/lib/python3.10/site-packages/xinference/api/restful_api.py", line 1365, in stream_results
xinference-local  |     async for item in iterator:
xinference-local  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 335, in __anext__
xinference-local  |     self._actor_ref = await actor_ref(
xinference-local  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 125, in actor_ref
xinference-local  |     return await ctx.actor_ref(*args, **kwargs)
xinference-local  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 197, in actor_ref
xinference-local  |     result = await self._wait(future, actor_ref.address, message)
xinference-local  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 115, in _wait
xinference-local  |     return await future
xinference-local  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/core.py", line 87, in _listen
xinference-local  |     future = self._client_to_message_futures[client].pop(message.message_id)
xinference-local  | KeyError: b'\xc5a\xae \t\x1b\xf4\x8a\xa0\x95\x9e\xe4\xc3\xd5\xb1\xc5BG\x11?]:\xb1\x0c\xb8\x83\xb3\x9d\xb7\xa2}0'
xinference-local  | INFO 05-14 08:08:04 metrics.py:229] Avg prompt throughput: 1085.6 tokens/s, Avg generation throughput: 42.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 43.6%, CPU KV cache usage: 0.0%
xinference-local  | INFO 05-14 08:08:04 async_llm_engine.py:120] Finished request 15c42c7e-11c9-11ef-a06b-0242ac1e0202.

I want to know how this error is triggered. Is it due to excessive memory usage, or is there dirty data in the request?
I deployed the qwen1.5-14b-chat large model in its entirety on a single A100. This exception is triggered after continuously calling the model for about 6 hours. Each prompt is different, and after redeployment, the same prompt does not trigger the exception.

Expected behavior

The large model will not get stuck, and even if one of the large models gets stuck, it will not affect the others.

Additional context

The memory usage is 38770MiB / 40960MiB, and the GPU utilization is approximately 80%-90% based on visual observation.

@XprobeBot XprobeBot added the gpu label May 14, 2024
@XprobeBot XprobeBot modified the milestones: v0.11.1, v0.11.2 May 14, 2024
@XprobeBot XprobeBot modified the milestones: v0.11.2, v0.11.3 May 24, 2024
@XprobeBot XprobeBot modified the milestones: v0.11.3, v0.11.4 May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants