File tree 1 file changed +10
-0
lines changed
1 file changed +10
-0
lines changed Original file line number Diff line number Diff line change @@ -41,6 +41,16 @@ Note: the tik-token library uses a combination of 1) and 3) where sections are d
41
41
42
42
## Properties of BPE
43
43
44
+ ### Definition: Byte Pair Encoding
45
+
46
+ The byte pair encoding is defined by an ordered list of tokens where the position in the list corresponds to the token id.
47
+ Thereby each multi-byte token must have been constructed from exactly two previous tokens in the list.
48
+
49
+ The encoding starts with all bytes being converted into their one-byte token ids.
50
+ Then, one scans over the preliminary encoding and determines the smallest token id by which any pair of neighboring token ids could be replaced.
51
+ The left most of them is replaced with that token id.
52
+ The process continues until no further replacement is possible.
53
+
44
54
### Definition: Valid Encoding Sequence
45
55
46
56
An encoding sequence ` e_0..e_n ` is said to be valid if decoding the sequence and reencoding it with BPE produces the very same sequence.
You can’t perform that action at this time.
0 commit comments