Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
aneubeck committed Jul 16, 2024
1 parent dc77f0c commit 94f5cc3
Showing 1 changed file with 10 additions and 0 deletions.
10 changes: 10 additions & 0 deletions crates/bpe/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,16 @@ Note: the tik-token library uses a combination of 1) and 3) where sections are d

## Properties of BPE

### Definition: Byte Pair Encoding

The byte pair encoding is defined by an ordered list of tokens where the position in the list corresponds to the token id.
Thereby each multi-byte token must have been constructed from exactly two previous tokens in the list.

The encoding starts with all bytes being converted into their one-byte token ids.
Then, one scans over the preliminary encoding and determines the smallest token id by which any pair of neighboring token ids could be replaced.
The left most of them is replaced with that token id.
The process continues until no further replacement is possible.

### Definition: Valid Encoding Sequence

An encoding sequence `e_0..e_n` is said to be valid if decoding the sequence and reencoding it with BPE produces the very same sequence.
Expand Down

0 comments on commit 94f5cc3

Please sign in to comment.