You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
When using Triton Inference Server in streaming mode (generate_stream endpoint) with REST API or gRPC, multibyte UTF-8 characters (e.g., emojis or non-ASCII symbols like Cyrillic) are incorrectly split and replaced with the "�" symbol in the response. This issue is observed across different versions of Triton and is visible in raw server output (e.g., via curl) as well as client-side processing. Notably, this problem does not occur in non-streaming mode, where the response is returned as a single, correctly encoded UTF-8 string.
The problem appears to occur before the decoding stage on the server side, where multibyte UTF-8 sequences (e.g., emojis or Cyrillic characters) are not preserved as complete units during the streaming process. As a result, the text_output field in the response contains "�" for affected characters, and client-side buffering cannot recover the original data since the substitution happens prior to transmission.
curl -X POST localhost:8000/v2/models/ensemble/generate_stream -d '{
"text_input": "<|im_start|>system <|im_end|> <|im_start|>user Write a poem using the words click Nutcracker and yoga. Format it with emojis.<|im_end|><|im_start|>assistant ",
"parameters": {
"max_tokens": 512,
"temperature": 0.7,
"stream": true,
"top_p": 0.8,
"top_k": 1,
"length_penalty": 0.8,
"beam_width": 1,
"stop_words":["<|end|>"]
}
}'
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"✨"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"🌟"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" **"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"Click"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":","}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" Nut"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"cr"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"acker"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":","}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" and"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" Yoga"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"**"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" �"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"�"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"�"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"✨"}
data: {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\n\n"}
cd /tensorrtllm_backend
python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/tensorrtllm_backend/triton_qwen_repo
Deploy an ensemble model on Triton Inference Server that generates text with multibyte UTF-8 characters (e.g., emojis or Cyrillic).
curl -X POST 192.168.108.68:8000/v2/models/ensemble/generate_stream -d '{
"text_input": "<|im_start|>system <|im_end|> <|im_start|>user Write a poem using the words click Nutcracker and yoga. Format it with emojis.<|im_end|><|im_start|>assistant ",
"parameters": {
"max_tokens": 512,
"temperature": 0.7,
"stream": true,
"top_p": 0.8,
"top_k": 1,
"length_penalty": 0.8,
"beam_width": 1,
"stop_words":["<|im_end|>"]
}
}'
The text was updated successfully, but these errors were encountered:
Description
When using Triton Inference Server in streaming mode (generate_stream endpoint) with REST API or gRPC, multibyte UTF-8 characters (e.g., emojis or non-ASCII symbols like Cyrillic) are incorrectly split and replaced with the "�" symbol in the response. This issue is observed across different versions of Triton and is visible in raw server output (e.g., via curl) as well as client-side processing. Notably, this problem does not occur in non-streaming mode, where the response is returned as a single, correctly encoded UTF-8 string.
The problem appears to occur before the decoding stage on the server side, where multibyte UTF-8 sequences (e.g., emojis or Cyrillic characters) are not preserved as complete units during the streaming process. As a result, the text_output field in the response contains "�" for affected characters, and client-side buffering cannot recover the original data since the substitution happens prior to transmission.
Environment
TensorRT Version:
10.8
NVIDIA GPU:
A5000, H100
NVIDIA Driver Version:
570
CUDA Version:
12.8
Operating System: Ubuntu 22.04
Python Version (if applicable): 3.12.3
Model link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct
Steps To Reproduce
Run docker container
Create the checkpoint
Build the engines
trtllm-build --checkpoint_dir ./qwen_checkpoint/ --output_dir ./tmp/qwen/72B/16_batch --gpt_attention_plugin bfloat16 --gemm_plugin bfloat16 --remove_input_padding enable --context_fmha enable --kv_cache_type paged --use_paged_context_fmha enable --max_num_tokens 131072 --max_input_len 4096 --max_batch_size 16 --log_level info --monitor_memory --workers 4
Deploy an ensemble model on Triton Inference Server that generates text with multibyte UTF-8 characters (e.g., emojis or Cyrillic).
The text was updated successfully, but these errors were encountered: