Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error fixed at chapter 6 /7.md #604

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions chapters/en/chapter6/7.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -64,11 +64,11 @@ So, the sum of all frequencies is 210, and the probability of the subword `"ug"`

Now, to tokenize a given word, we look at all the possible segmentations into tokens and compute the probability of each according to the Unigram model. Since all tokens are considered independent, this probability is just the product of the probability of each token. For instance, the tokenization `["p", "u", "g"]` of `"pug"` has the probability:

$$P([``p", ``u", ``g"]) = P(``p") \times P(``u") \times P(``g") = \frac{5}{210} \times \frac{36}{210} \times \frac{20}{210} = 0.000389$$
$$P([``p", ``u", ``g"]) = P(``p") \times P(``u") \times P(``g") = \frac{17}{210} \times \frac{36}{210} \times \frac{20}{210} = 0.001322$$

Comparatively, the tokenization `["pu", "g"]` has the probability:

$$P([``pu", ``g"]) = P(``pu") \times P(``g") = \frac{5}{210} \times \frac{20}{210} = 0.0022676$$
$$P([``pu", ``g"]) = P(``pu") \times P(``g") = \frac{17}{210} \times \frac{20}{210} = 0.007709$$

so that one is way more likely. In general, tokenizations with the least tokens possible will have the highest probability (because of that division by 210 repeated for each token), which corresponds to what we want intuitively: to split a word into the least number of tokens possible.

Expand Down
Loading