Fast and correct BPE algorithms #9

aneubeck · 2024-07-11T14:25:16Z

No description provided.

rewinfrey · 2024-07-16T16:34:29Z

crates/bpe/README.md

+The solution is to track ALL encodings for all text prefixes. For our example `ababc` we would get:
+- `a` ------> `a`
+- `ab` -----> `ab`
+- `aba` ----> `ab a`


Question: Why is the resulting encoding ab a rather than a ba? Is this the consequence of

Then, one scans over the preliminary encoding and determines the smallest token id by which any pair of neighboring token ids could be replaced.
The left most of them is replaced with that token id.

I guess why is the single a token not considered the smallest token id in this case, given its position in the dictionary implies highest frequency?

If you want to encode aba, then you start with the byte tokens a b a.
Next, you look for potential merges. in this example you can choose between ab and ba.
Since ab is the smaller one of the two you replace that one first.
I.e. you get now ab a.
But at this point not possible merge operation exists anymore (ba is no longer possible)

Next, you look for potential merges. in this example you can choose between ab and ba.
Since ab is the smaller one of the two you replace that one first.

Thanks for clarifying, I understand now. That merge operation is producing the smallest token id for a sequence of individual byte tokens or already paired tokens starting with the left most element working towards the end of the sequence.

hendrikvanantwerpen

Generally looks good. The various tests are useful.

I'd need to spend quite a bit more time to understand all the details, but the code structure corresponds to the program set out in the readme. Together with the tests, it looks good to me.

hendrikvanantwerpen · 2024-07-22T13:16:27Z

crates/bpe/README.md

+There are mostly three strategies for BPE encoding.
+1) Trivial solution. Search brute force for the most frequent pair in the encoded text according the dictionary and replace those occurrences. This has a `O(n^2)` complexity and is therefore not very appealing in production.
+2) Heap based. Set up a heap with the frequencies. This improves the linear search time to a logarithmic factor. If done properly, the overall complexity reduces now to `O(n log n)`.
+3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now distinct results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is essentially impossible.


Nit Add what the "distinct results" are distinct from?

hendrikvanantwerpen · 2024-07-23T14:28:11Z

crates/bpe/README.md

+### Corollary III
+
+Given two valid encoding sequences `e_0..e_i` and `e_i..e_n`, then `e_0..e_i..e_n` is also a valid encoding sequence.
+Note: that the end/start token has to be identical between the two sequences!
+
+The correctness of this statement follows with a similar argument as used in Corollary II.
+Given the merge operations performed by BPE for both valid encoding sequences. The merge operations which lead to the shared token `e_i` must be identical to produce the same token. And those are the only redundant merge operations. Combining the two sets of merge operations will lead to the combined token sequence.
+If BPE wants to make a different merge decision when it sees the full input, then this merge decision must involve either the token boundary to the left or right of `e_i`. But that means that it had to make a different merge decision for at least one of the substrings `e_0..e_i` or `e_i..e_n` which cover those token boundaries. So, by contradiction, the corollary must be true.


I wondered if this was true if one of the sequences has length one, but I suppose it is because the shared token is present in both.

crates/bpe/README.md

rewinfrey reviewed Jul 16, 2024

View reviewed changes

hendrikvanantwerpen approved these changes Jul 23, 2024

View reviewed changes

Introduce fast and correct BPE algorithms

e9e3bb8

aneubeck force-pushed the aneubeck/bpe branch from b2aae47 to e9e3bb8 Compare July 29, 2024 08:23

add more documentation

5e0c654

aneubeck marked this pull request as ready for review July 29, 2024 09:31

aneubeck merged commit e32c83d into main Jul 29, 2024
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast and correct BPE algorithms #9

Fast and correct BPE algorithms #9

aneubeck commented Jul 11, 2024

rewinfrey Jul 16, 2024

aneubeck Jul 17, 2024

rewinfrey Jul 17, 2024

hendrikvanantwerpen left a comment

hendrikvanantwerpen Jul 22, 2024

hendrikvanantwerpen Jul 23, 2024

Fast and correct BPE algorithms #9

Fast and correct BPE algorithms #9

Conversation

aneubeck commented Jul 11, 2024

rewinfrey Jul 16, 2024

Choose a reason for hiding this comment

aneubeck Jul 17, 2024

Choose a reason for hiding this comment

rewinfrey Jul 17, 2024

Choose a reason for hiding this comment

hendrikvanantwerpen left a comment

Choose a reason for hiding this comment

hendrikvanantwerpen Jul 22, 2024

Choose a reason for hiding this comment

hendrikvanantwerpen Jul 23, 2024

Choose a reason for hiding this comment