-
描述问题about chart format,[UNUSED_TOKEN_146] means 146 token or self tokenizer.decode([146])
print(a)
# <0x8F>
b = tokenizer.encode("[UNUSED_TOKEN_146]")
print(b)
# [1, 640, 2010, 15927, 8783, 3553, 304, 284, 10389, 332] so what means? |
Beta Was this translation helpful? Give feedback.
Answered by
ZwwWayne
Jan 18, 2024
Replies: 2 comments 1 reply
-
|
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
ZwwWayne
-
#619 provide more description. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
[UNUSED_TOKEN_146]
means the 146th unused token in pre-training, so it should be treated as a whole in tokenization. However, it seems that you are using a different tokenizer, because correct results should be: