Skip to content

Commit 94f5cc3

Browse files
committed
Update README.md
1 parent dc77f0c commit 94f5cc3

File tree

1 file changed

+10
-0
lines changed

1 file changed

+10
-0
lines changed

crates/bpe/README.md

+10
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,16 @@ Note: the tik-token library uses a combination of 1) and 3) where sections are d
4141

4242
## Properties of BPE
4343

44+
### Definition: Byte Pair Encoding
45+
46+
The byte pair encoding is defined by an ordered list of tokens where the position in the list corresponds to the token id.
47+
Thereby each multi-byte token must have been constructed from exactly two previous tokens in the list.
48+
49+
The encoding starts with all bytes being converted into their one-byte token ids.
50+
Then, one scans over the preliminary encoding and determines the smallest token id by which any pair of neighboring token ids could be replaced.
51+
The left most of them is replaced with that token id.
52+
The process continues until no further replacement is possible.
53+
4454
### Definition: Valid Encoding Sequence
4555

4656
An encoding sequence `e_0..e_n` is said to be valid if decoding the sequence and reencoding it with BPE produces the very same sequence.

0 commit comments

Comments
 (0)