Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast and correct BPE algorithms #9

Merged
merged 2 commits into from
Jul 29, 2024
Merged

Fast and correct BPE algorithms #9

merged 2 commits into from
Jul 29, 2024

Conversation

aneubeck
Copy link
Collaborator

No description provided.

The solution is to track ALL encodings for all text prefixes. For our example `ababc` we would get:
- `a` ------> `a`
- `ab` -----> `ab`
- `aba` ----> `ab a`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why is the resulting encoding ab a rather than a ba? Is this the consequence of

Then, one scans over the preliminary encoding and determines the smallest token id by which any pair of neighboring token ids could be replaced.
The left most of them is replaced with that token id.

I guess why is the single a token not considered the smallest token id in this case, given its position in the dictionary implies highest frequency?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to encode aba, then you start with the byte tokens a b a.
Next, you look for potential merges. in this example you can choose between ab and ba.
Since ab is the smaller one of the two you replace that one first.
I.e. you get now ab a.
But at this point not possible merge operation exists anymore (ba is no longer possible)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next, you look for potential merges. in this example you can choose between ab and ba.
Since ab is the smaller one of the two you replace that one first.

Thanks for clarifying, I understand now. That merge operation is producing the smallest token id for a sequence of individual byte tokens or already paired tokens starting with the left most element working towards the end of the sequence.

Copy link
Contributor

@hendrikvanantwerpen hendrikvanantwerpen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. The various tests are useful.

I'd need to spend quite a bit more time to understand all the details, but the code structure corresponds to the program set out in the readme. Together with the tests, it looks good to me.

There are mostly three strategies for BPE encoding.
1) Trivial solution. Search brute force for the most frequent pair in the encoded text according the dictionary and replace those occurrences. This has a `O(n^2)` complexity and is therefore not very appealing in production.
2) Heap based. Set up a heap with the frequencies. This improves the linear search time to a logarithmic factor. If done properly, the overall complexity reduces now to `O(n log n)`.
3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now distinct results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is essentially impossible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit Add what the "distinct results" are distinct from?

Comment on lines +79 to +86
### Corollary III

Given two valid encoding sequences `e_0..e_i` and `e_i..e_n`, then `e_0..e_i..e_n` is also a valid encoding sequence.
Note: that the end/start token has to be identical between the two sequences!

The correctness of this statement follows with a similar argument as used in Corollary II.
Given the merge operations performed by BPE for both valid encoding sequences. The merge operations which lead to the shared token `e_i` must be identical to produce the same token. And those are the only redundant merge operations. Combining the two sets of merge operations will lead to the combined token sequence.
If BPE wants to make a different merge decision when it sees the full input, then this merge decision must involve either the token boundary to the left or right of `e_i`. But that means that it had to make a different merge decision for at least one of the substrings `e_0..e_i` or `e_i..e_n` which cover those token boundaries. So, by contradiction, the corollary must be true.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered if this was true if one of the sequences has length one, but I suppose it is because the shared token is present in both.

crates/bpe/README.md Outdated Show resolved Hide resolved
@aneubeck aneubeck marked this pull request as ready for review July 29, 2024 09:31
@aneubeck aneubeck merged commit e32c83d into main Jul 29, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants