[QA] about chart format，[UNUSED_TOKEN_146] means 146 token or self #618

limoncc · 2024-01-18T02:00:41Z

limoncc
Jan 18, 2024

描述问题

about chart format，[UNUSED_TOKEN_146] means 146 token or self

tokenizer.decode([146])
print(a)
# <0x8F>
b = tokenizer.encode("[UNUSED_TOKEN_146]")
print(b)
# [1, 640, 2010, 15927, 8783, 3553, 304, 284, 10389, 332]

so what means?

Answered by ZwwWayne

Jan 18, 2024

[UNUSED_TOKEN_146] means the 146th unused token in pre-training, so it should be treated as a whole in tokenization. However, it seems that you are using a different tokenizer, because correct results should be:

In [1]: import torch
   ...: from transformers import AutoModelForCausalLM, AutoTokenizer
   ...: 
   ...: model_path = "internlm/internlm2-chat-7b"
   ...: 
   ...: tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)


In [2]: 

In [2]: tokenizer.encode('[UNUSED_TOKEN_146]')
Out[2]: [1, 92543]

In [3]: tokenizer.encode('[UNUSED_TOKEN_145]')
Out[3]: [1, 92542]

In [4]:

View full answer

ZwwWayne · 2024-01-18T02:37:03Z

ZwwWayne
Jan 18, 2024
Maintainer

[UNUSED_TOKEN_146] means the 146th unused token in pre-training, so it should be treated as a whole in tokenization. However, it seems that you are using a different tokenizer, because correct results should be:

In [1]: import torch
   ...: from transformers import AutoModelForCausalLM, AutoTokenizer
   ...: 
   ...: model_path = "internlm/internlm2-chat-7b"
   ...: 
   ...: tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)


In [2]: 

In [2]: tokenizer.encode('[UNUSED_TOKEN_146]')
Out[2]: [1, 92543]

In [3]: tokenizer.encode('[UNUSED_TOKEN_145]')
Out[3]: [1, 92542]

In [4]:

1 reply

limoncc Jan 18, 2024
Author

It is appreciated for answering. because I use a gguf, i will check.

ZwwWayne · 2024-01-18T03:46:23Z

ZwwWayne
Jan 18, 2024
Maintainer

#619 provide more description.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QA] about chart format，[UNUSED_TOKEN_146] means 146 token or self #618

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[QA] about chart format，[UNUSED_TOKEN_146] means 146 token or self #618

limoncc Jan 18, 2024

描述问题

Replies: 2 comments · 1 reply

ZwwWayne Jan 18, 2024 Maintainer

limoncc Jan 18, 2024 Author

ZwwWayne Jan 18, 2024 Maintainer

limoncc
Jan 18, 2024

Replies: 2 comments 1 reply

ZwwWayne
Jan 18, 2024
Maintainer

limoncc Jan 18, 2024
Author

ZwwWayne
Jan 18, 2024
Maintainer