Skip to content

[QA] about chart format,[UNUSED_TOKEN_146] means 146 token or self #618

Answered by ZwwWayne
limoncc asked this question in Q&A
Discussion options

You must be logged in to vote

[UNUSED_TOKEN_146] means the 146th unused token in pre-training, so it should be treated as a whole in tokenization. However, it seems that you are using a different tokenizer, because correct results should be:

In [1]: import torch
   ...: from transformers import AutoModelForCausalLM, AutoTokenizer
   ...: 
   ...: model_path = "internlm/internlm2-chat-7b"
   ...: 
   ...: tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)


In [2]: 

In [2]: tokenizer.encode('[UNUSED_TOKEN_146]')
Out[2]: [1, 92543]

In [3]: tokenizer.encode('[UNUSED_TOKEN_145]')
Out[3]: [1, 92542]

In [4]:

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
1 reply
@limoncc
Comment options

Answer selected by ZwwWayne
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
question Further information is requested
2 participants
Converted from issue

This discussion was converted from issue #614 on January 18, 2024 02:39.