Skip to content

Commit

Permalink
Merge pull request #10 from github/aneubeck/bpe
Browse files Browse the repository at this point in the history
linter
  • Loading branch information
aneubeck authored Jul 29, 2024
2 parents e32c83d + a455767 commit 0a012df
Show file tree
Hide file tree
Showing 4 changed files with 10 additions and 5 deletions.
2 changes: 1 addition & 1 deletion crates/bpe/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,6 @@ We compared our implementations with the tiktoken implementation on a MacBook Pr
| Heap | 1900 µs ||

As can be seen, our Backtracking implementation beats the TikToken Rust implementation by ~4x.
And even the fully deynamic programming solution is faster with a more consistent runtime.
And even the fully dynamic programming solution is faster with a more consistent runtime.
The tuned heap implementation is still quite competitive to TikToken (especially for smaller inputs).
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
9 changes: 7 additions & 2 deletions crates/bpe/src/appendable_encoder.rs
Original file line number Diff line number Diff line change
Expand Up @@ -56,10 +56,15 @@ impl<'a> AppendableEncoder<'a> {
}

/// Returns the number of tokens required to tokenize the input text.
/// This operation is O(1) and can be called at any point in time.
/// This operation is O(1) and can be called at any point in time.
pub fn len(&self) -> usize {
self.counts.last().copied().unwrap_or(0) as usize
}

/// Returns true if the structure represents the empty string.
pub fn is_empty(&self) -> bool {
self.counts.is_empty()
}
}

#[cfg(test)]
Expand All @@ -78,4 +83,4 @@ mod tests {
enc.push(*c);
}
}
}
}
2 changes: 1 addition & 1 deletion crates/bpe/src/backtrack_encoder.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ use crate::byte_pair_encoding::BytePairEncoding;
/// for a given input text.
/// It keeps track of visited states in a bitfield and only remembers the tokenization
/// of the currently processed dynamic programming state.
///
///
/// The biggest downside of this approach is that the search for the longest leftmost match
/// has to be reset at every (backtracking) step which is still a net win in practice compared to other approaches.
pub(crate) struct BacktrackEncoder<'a> {
Expand Down
2 changes: 1 addition & 1 deletion crates/bpe/src/interval_encoding.rs
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ impl<'a> IntervalEncoding<'a> {
/// Thereby it reencodes the prefix with the `BacktrackEncoder` until the encoding sequence becomes
/// compatible with the precomputed tables. Once that's the case, the remainder of the range becomes
/// a simple O(1) lookup.
///
///
/// Note: in the worst-case the complexity is O(n). This happens for instance for a whitespace input
/// where the encoding changes when the starting position changes.
pub fn count(&self, range: Range<usize>) -> usize {
Expand Down

0 comments on commit 0a012df

Please sign in to comment.