Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: strange repetition issue #788

Open
ehartford opened this issue Oct 23, 2024 · 6 comments
Open

[Bug]: strange repetition issue #788

ehartford opened this issue Oct 23, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@ehartford
Copy link

Your current environment

The output of `python env.py` ```text Your output of `python env.py` here ```

🐛 Describe the bug

There is a strange output that happens in aphrodite, but it doesn't happen in other engines such as tabbyapi.

curl output:

curl http://eric-quad:2242/v1/chat/completions   -H "Content-Type: application/json"   -d '{
            "messages": [
              {
                "role": "user",
                "content": "Why is the sky blue?"
              }
            ], "model": "DolphinPod/dolphin-2.9.1-llama3.1-8b"
          }'
{"id":"chat-53ed73f2ba3143c1b7cb33f4c2cafdc2","object":"chat.completion","created":1729706300,"model":"DolphinPod/dolphin-2.9.1-llama3.1-8b","choices":[{"index":0,"message":{"role":"assistant","content":"The sky is blue because the color blue is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase for the language. The word is a general phrase","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":14,"total_tokens":607,"completion_tokens":593},"prompt_logprobs":null}⏎

aphrodite output:

(base) eric@eric-desk:~/aphrodite-engine$ ./runtime.sh aphrodite run DolphinPod/dolphin-2.9.1-llama3.1-8b --max-model-len 8000
INFO:     Multiprocessing frontend to use ipc:///tmp/b1c097e5-7c6d-4f8e-b7ae-81e0c53ca1fe for RPC Path.
INFO:     Started engine process with PID 76774
INFO:     -------------------------------------------------------------------------------------
INFO:     Initializing Aphrodite Engine (v0.6.2.post1 commit dcb794a3) with the following config:
INFO:     Model = 'DolphinPod/dolphin-2.9.1-llama3.1-8b'
INFO:     DataType = torch.bfloat16
INFO:     Tensor Parallel Size = 1
INFO:     Pipeline Parallel Size = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Context Length = 8000
INFO:     Enforce Eager Mode = False
INFO:     Prefix Caching = False
INFO:     Device = device(type='cuda')
INFO:     Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines')
INFO:     -------------------------------------------------------------------------------------
INFO:     Loading model DolphinPod/dolphin-2.9.1-llama3.1-8b...
INFO:     Using model weights format ['*.safetensors']
⠇ Loading modules... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 14/14 100% 0:00:02
INFO:     Model weights loaded in 2.80 seconds.
INFO:     Weights memory usage: 14.99 GiB
INFO:     Profiling peak memory usage...
INFO:     Model profiling took 1.81 seconds.
INFO:     Estimated KV Cache memory usage: 1.73 GB
INFO:     # GPU blocks: 2371, # CPU blocks: 2048
INFO:     Minimum concurrency: 4.74x
INFO:     Maximum sequence length allowed in the cache: 37936
INFO:     Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the
model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO:     CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing
`gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO:     Graph capturing finished in 10.64 secs
INFO:     Aphrodite to use /tmp/tmp16t2tw7_ as PROMETHEUS_MULTIPROC_DIR
WARNING:  embedding_mode is False. Embedding API will not work.
INFO:     Started server process [76365]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:2242 (Press CTRL+C to quit)
INFO:     172.17.0.2:56292 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     Received request chat-53ed73f2ba3143c1b7cb33f4c2cafdc2: prompt: , params: SamplingParams(temperature=0.7,
max_tokens=7986), prompt_token_ids: [], lora_request: None, prompt_adapter_request: None.
INFO:     Added request chat-53ed73f2ba3143c1b7cb33f4c2cafdc2.
INFO:     Avg prompt throughput: 1.6 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs,
Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 48.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs,
Pending: 0 reqs, GPU KV cache usage: 0.7%, CPU KV cache usage: 0.0%.
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 47.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs,
Pending: 0 reqs, GPU KV cache usage: 1.3%, CPU KV cache usage: 0.0%.
INFO:     Finished request chat-53ed73f2ba3143c1b7cb33f4c2cafdc2.
INFO:     192.168.86.21:49866 - "POST /v1/chat/completions HTTP/1.1" 200 OK
@ehartford ehartford added the bug Something isn't working label Oct 23, 2024
@ehartford
Copy link
Author

ehartford commented Oct 23, 2024

Here is the same query, same model, run on tabbyapi:

curl output:

(base) eric@eric-desk:~$ curl http://127.0.0.1:5000/v1/chat/completions   -H "Content-Type: application/json" -H "Authorization: Bearer bd23c74aae243dc7e1302b2cccd93938"   -d '
           { "messages": [
              {
                "role": "user",
                "content": "Why is the sky blue?"
              }
            ]
          }'
{"id":"chatcmpl-b25229683a9d4f06b4102fc989852ed5","choices":[{"index":0,"finish_reason":"stop","stop_str":"<|im_end|>","message":{"role":"assistant","content":"The sky appears blue due to a phenomenon called Rayleigh scattering. As sunlight enters Earth's atmosphere, it interacts with gas molecules and small particles like dust and water droplets. The short-wavelength colors (blue and violet) are scattered more effectively due to their smaller wavelengths, while longer-wavelength colors (red and yellow) are more likely to be absorbed or pass through the atmosphere without being scattered. This causes the blue light to be scattered in all directions, making the sky appear blue during the day.","tool_calls":null},"logprobs":null}],"created":1729708671,"model":"7959095b9f1bc6a003894084bcfdcb2b6809089a","object":"chat.completion","usage":{"prompt_tokens":15,"completion_tokens":101,"total_tokens":116}}(base) eric@eric-desk:~$

tabby output

(tabby) eric@eric-desk:~/tabbyAPI$ python main.py --model-name DolphinPod_dolphin-2.9.1-llama3.1-8b --max-seq-len 8192 --cache-size 8192
INFO:     The 'config.yml' file cannot be found
INFO:     ExllamaV2 version: 0.2.3
INFO:     Your API key is: bd23c74aae243dc7e1302b2cccd93938
INFO:     Your admin key is: 2558718950dc164504436d8f6df539d4
INFO:
INFO:     If these keys get compromised, make sure to delete api_tokens.yml and restart the server. Have fun!
INFO:     Generation logging is disabled
WARNING:  Draft model is disabled because a model name wasn't provided. Please check your config.yml!
WARNING:  The given cache_size (8192) is less than 2 * max_seq_len and may be too small for requests using CFG.
WARNING:  Ignore this warning if you do not plan on using CFG.
INFO:     Attempting to load a prompt template if present.
INFO:     Using template "from_tokenizer_config" for chat completions.
INFO:     Loading model:
/home/eric/.cache/huggingface/hub/models--DolphinPod--dolphin-2.9.1-llama3.1-8b/snapshots/7959095b9f1bc6a003894084bcfdcb2b68090
89a
INFO:     Loading with autosplit
Loading model modules ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 67/67 0:00:00
INFO:     Model successfully loaded.
INFO:     Developer documentation: http://127.0.0.1:5000/redoc
INFO:     Starting OAI API
INFO:     Completions: http://127.0.0.1:5000/v1/completions
INFO:     Chat completions: http://127.0.0.1:5000/v1/chat/completions
INFO:     Started server process [80573]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:5000 (Press CTRL+C to quit)
INFO:     Metrics (ID: b25229683a9d4f06b4102fc989852ed5): 102 tokens generated in 2.42 seconds (Queue: 0.0 s, Process: 0 cached
tokens and 15 new tokens at 233.38 T/s, Generate: 43.36 T/s, Context: 15 tokens)
INFO:     Finished chat completion request b25229683a9d4f06b4102fc989852ed5
INFO:     127.0.0.1:36474 - "POST /v1/chat/completions HTTP/1.1" 200

@ehartford
Copy link
Author

I will try to repro in TGI and vllm

@AlpinDale
Copy link
Member

AlpinDale commented Oct 23, 2024

I've been trying to replicate this, with some success:

DolphinPod/dolphin-2.9.1-llama3.1-8b:

{
  "id": "chat-93aaefa981c046d497a5699308ad094d",
  "object": "chat.completion",
  "created": 1729716396,
  "model": "DolphinPod/dolphin-2.9.1-llama3.1-8b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The sky is blue because the color blue is defined by the color of the blue sky is the the the the the the the is the.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "total_tokens": 43,
    "completion_tokens": 29
  },
  "prompt_logprobs": null
}

NousResearch/Meta-Llama-3.1-8B-Instruct:

{
  "id": "chat-0a08ca7350f74464a6f7a4d0c0ca420a",
  "object": "chat.completion",
  "created": 1729716528,
  "model": "NousResearch/Meta-Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The sky appears blue to us during the daytime because of a phenomenon called scattering. When sunlight enters Earth's atmosphere, it encounters tiny molecules of gases such as nitrogen and oxygen. These molecules scatter the shorter, blue wavelengths of light more than the longer, red wavelengths, due to a principle called Rayleigh scattering. This scattering makes the blue light more visible to our eyes, giving the sky its blue appearance.\n\nHere's a simplified explanation of the process:\n\n1. **Sunlight reaches the Earth**: Sunlight is composed of different colors, with each corresponding to a specific wavelength. The colors range from red (longer wavelengths) to blue (shorter wavelengths).\n\n2. **Scattering occurs**: When sunlight enters Earth's atmosphere, it encounters molecules of gases like nitrogen and oxygen. These molecules scatter or bounce off the light in different ways, depending on their size and the wavelength of light they encounter.\n\n3. **Rayleigh scattering**: The scattering of light by small molecules or particles is known as Rayleigh scattering. This is more pronounced for shorter wavelengths, which are in the blue part of the spectrum. Therefore, the blue light is scattered more than the other colors.\n\n4. **Scattered light reaches the observer**: Because blue light is scattered more, from the observer's perspective, more blue light reaches the eyes than other wavelengths. This results in the sky appearing blue.\n\nIt's worth noting that this effect is especially pronounced during the daytime, when the sun is high in the sky and its light has to travel through more of the atmosphere to reach the observer. The blue visibility can also be maximized with clearer skies, as clouds can scatter light in different ways, altering the apparent color of the sky.\n\nIn the context of different environmental and celestial conditions, the sky's color can be perceived differently. For example, during sunrise and sunset, the color of the sky can change or appear different due to the different angles at which the sun's light enters the atmosphere, scattering the light in various ways. At dawn and dusk, the sky often appears more red or orange because the light has to travel through a greater distance, scattering the shorter wavelengths of blue light and increasing the dominance of longer wavelengths like red and orange.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 16,
    "total_tokens": 461,
    "completion_tokens": 445
  },
  "prompt_logprobs": null
}

I'll try and see what's happening with this model in particular.

@AlpinDale
Copy link
Member

Whoops sorry misinput.

@AlpinDale AlpinDale reopened this Oct 23, 2024
@ehartford
Copy link
Author

Any idea what might cause thit?

@ehartford
Copy link
Author

@AlpinDale I would like to offer a $1,000 USD bounty for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants