Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imprecise description about removing token "pu" in section Unigram tokenization #699

Open
yaojingguo opened this issue Apr 21, 2024 · 0 comments

Comments

@yaojingguo
Copy link

https://huggingface.co/learn/nlp-course/chapter6/7?fw=pt says:

In this (very) particular case, we had two equivalent tokenizations of all the words: as we saw earlier, for example, "pug" could be tokenized ["p", "ug"] with the same score. Thus, removing the "pu" token from the vocabulary will give the exact same loss.

But as the following list from the link shows that "pun" needs "pu" and "n". If "pu" token is removed, the score for "pun" will change. So only if "pun" has the same score after "pu" is removed, the loss does not change.

"hug": ["hug"] (score 0.071428)
"pug": ["pu", "g"] (score 0.007710)
"pun": ["pu", "n"] (score 0.006168)
"bun": ["bu", "n"] (score 0.001451)
"hugs": ["hug", "s"] (score 0.001701)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant