Skip to content

Commit

Permalink
Update crates/bpe/README.md
Browse files Browse the repository at this point in the history
Co-authored-by: Hendrik van Antwerpen <[email protected]>
  • Loading branch information
aneubeck and hendrikvanantwerpen authored Jul 24, 2024
1 parent 4f0ba1b commit b2aae47
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion crates/bpe/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ Given a valid encoding sequence `e_0..e_i` and a valid encoding tuple `e_i e_j`,
At a first glance, it seems impossible to achieve `O(n)` complexity while preserving the encoding output of the original BPE algorithm, since the original BPE algorithm needs to first scan the full input before it can make any encoding decision.
For instance, the sequence `abab` would be encoded as `ab ab` when the dictionary contains the tokens `a b ab ba bc abc babc ababc` ordered by frequency. But appending a single character `ababc` would result in a pretty different tokenization: `ab a bc`. So without looking ahead it seems impossible to properly tokenize the text.

The solution is to track ALL encodings for all text prefixes. For our example `ababc` we would get:
The solution is to track the encodings of ALL text prefixes. For our example `ababc` we would get:
- `a` ------> `a`
- `ab` -----> `ab`
- `aba` ----> `ab a`
Expand Down

0 comments on commit b2aae47

Please sign in to comment.