-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast and correct BPE algorithms #9
Conversation
The solution is to track ALL encodings for all text prefixes. For our example `ababc` we would get: | ||
- `a` ------> `a` | ||
- `ab` -----> `ab` | ||
- `aba` ----> `ab a` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Why is the resulting encoding ab a
rather than a ba
? Is this the consequence of
Then, one scans over the preliminary encoding and determines the smallest token id by which any pair of neighboring token ids could be replaced.
The left most of them is replaced with that token id.
I guess why is the single a
token not considered the smallest token id in this case, given its position in the dictionary implies highest frequency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to encode aba
, then you start with the byte tokens a
b
a
.
Next, you look for potential merges. in this example you can choose between ab
and ba
.
Since ab
is the smaller one of the two you replace that one first.
I.e. you get now ab
a
.
But at this point not possible merge operation exists anymore (ba
is no longer possible)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next, you look for potential merges. in this example you can choose between ab and ba.
Since ab is the smaller one of the two you replace that one first.
Thanks for clarifying, I understand now. That merge operation is producing the smallest token id for a sequence of individual byte tokens or already paired tokens starting with the left most element working towards the end of the sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. The various tests are useful.
I'd need to spend quite a bit more time to understand all the details, but the code structure corresponds to the program set out in the readme. Together with the tests, it looks good to me.
crates/bpe/README.md
Outdated
There are mostly three strategies for BPE encoding. | ||
1) Trivial solution. Search brute force for the most frequent pair in the encoded text according the dictionary and replace those occurrences. This has a `O(n^2)` complexity and is therefore not very appealing in production. | ||
2) Heap based. Set up a heap with the frequencies. This improves the linear search time to a logarithmic factor. If done properly, the overall complexity reduces now to `O(n log n)`. | ||
3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now distinct results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is essentially impossible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit Add what the "distinct results" are distinct from?
### Corollary III | ||
|
||
Given two valid encoding sequences `e_0..e_i` and `e_i..e_n`, then `e_0..e_i..e_n` is also a valid encoding sequence. | ||
Note: that the end/start token has to be identical between the two sequences! | ||
|
||
The correctness of this statement follows with a similar argument as used in Corollary II. | ||
Given the merge operations performed by BPE for both valid encoding sequences. The merge operations which lead to the shared token `e_i` must be identical to produce the same token. And those are the only redundant merge operations. Combining the two sets of merge operations will lead to the combined token sequence. | ||
If BPE wants to make a different merge decision when it sees the full input, then this merge decision must involve either the token boundary to the left or right of `e_i`. But that means that it had to make a different merge decision for at least one of the substrings `e_0..e_i` or `e_i..e_n` which cover those token boundaries. So, by contradiction, the corollary must be true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered if this was true if one of the sequences has length one, but I suppose it is because the shared token is present in both.
No description provided.