Multibyte UTF-8 Characters Broken in Streaming Mode (� Substitution) (A5000, H100) #4370

Nurgl · 2025-02-27T09:36:18Z

Description
When using Triton Inference Server in streaming mode (generate_stream endpoint) with REST API or gRPC, multibyte UTF-8 characters (e.g., emojis or non-ASCII symbols like Cyrillic) are incorrectly split and replaced with the "�" symbol in the response. This issue is observed across different versions of Triton and is visible in raw server output (e.g., via curl) as well as client-side processing. Notably, this problem does not occur in non-streaming mode, where the response is returned as a single, correctly encoded UTF-8 string.

The problem appears to occur before the decoding stage on the server side, where multibyte UTF-8 sequences (e.g., emojis or Cyrillic characters) are not preserved as complete units during the streaming process. As a result, the text_output field in the response contains "�" for affected characters, and client-side buffering cannot recover the original data since the substitution happens prior to transmission.

curl -X POST localhost:8000/v2/models/ensemble/generate_stream -d '{
    "text_input": "<|im_start|>system <|im_end|> <|im_start|>user Write a poem using the words click Nutcracker and yoga. Format it with emojis.<|im_end|><|im_start|>assistant ", 
    "parameters": {
      "max_tokens": 512,
          "temperature": 0.7,
          "stream": true,
          "top_p": 0.8,
          "top_k": 1,
          "length_penalty": 0.8,
          "beam_width": 1,
      "stop_words":["<|end|>"]
    }
}'
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"✨"}

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"🌟"}

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" **"}

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"Click"}

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":","}

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" Nut"}

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"cr"}

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"acker"}

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":","}

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" and"}

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" Yoga"}

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"**"}

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" �"}

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"�"}

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"�"}

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"✨"}

data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\n\n"}

Environment

TensorRT Version:
10.8

NVIDIA GPU:
A5000, H100

NVIDIA Driver Version:
570

CUDA Version:
12.8

Operating System: Ubuntu 22.04

Python Version (if applicable): 3.12.3

Model link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct

Steps To Reproduce

Run docker container

cd tensorrtllm_backend/
sudo docker run --rm -it -p 8000:8000 -p 8001:8001 -p 8002:8002 --shm-size=4g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/dfraltcov/digital_mind/tensorrtllm_backend:/tensorrtllm_backend triton_trt_llm bash

Create the checkpoint

cd /tensorrtllm_backend/tensorrt_llm/examples/qwen/
python convert_checkpoint.py --model_dir ./Qwen2.5-72B-Instruct --output_dir ./qwen_checkpoint_4gpu_tp4 --dtype bfloat16 --workers 4 --tp_size 4

Build the engines

trtllm-build --checkpoint_dir ./qwen_checkpoint/ --output_dir ./tmp/qwen/72B/16_batch --gpt_attention_plugin bfloat16 --gemm_plugin bfloat16 --remove_input_padding enable --context_fmha enable --kv_cache_type paged --use_paged_context_fmha enable --max_num_tokens 131072 --max_input_len 4096 --max_batch_size 16 --log_level info --monitor_memory --workers 4

cd /tensorrtllm_backend
python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/tensorrtllm_backend/triton_qwen_repo

Deploy an ensemble model on Triton Inference Server that generates text with multibyte UTF-8 characters (e.g., emojis or Cyrillic).

curl -X POST 192.168.108.68:8000/v2/models/ensemble/generate_stream -d '{
    "text_input": "<|im_start|>system <|im_end|> <|im_start|>user Write a poem using the words click Nutcracker and yoga. Format it with emojis.<|im_end|><|im_start|>assistant ", 
    "parameters": {
      "max_tokens": 512,
          "temperature": 0.7,
          "stream": true,
          "top_p": 0.8,
          "top_k": 1,
          "length_penalty": 0.8,
          "beam_width": 1,
      "stop_words":["<|im_end|>"]
    }
}'

The text was updated successfully, but these errors were encountered:

Nurgl mentioned this issue Feb 28, 2025

Encoding error in stream response from Triton server NVIDIA/TensorRT-LLM#2544

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multibyte UTF-8 Characters Broken in Streaming Mode (� Substitution) (A5000, H100) #4370

Multibyte UTF-8 Characters Broken in Streaming Mode (� Substitution) (A5000, H100) #4370

Nurgl commented Feb 27, 2025 •

edited

Loading

Multibyte UTF-8 Characters Broken in Streaming Mode (� Substitution) (A5000, H100) #4370

Multibyte UTF-8 Characters Broken in Streaming Mode (� Substitution) (A5000, H100) #4370

Comments

Nurgl commented Feb 27, 2025 • edited Loading

Environment

Steps To Reproduce

Run docker container

Create the checkpoint

Build the engines

Nurgl commented Feb 27, 2025 •

edited

Loading