Skip to content

Commit 0a012df

Browse files
authored
Merge pull request #10 from github/aneubeck/bpe
linter
2 parents e32c83d + a455767 commit 0a012df

File tree

4 files changed

+10
-5
lines changed

4 files changed

+10
-5
lines changed

crates/bpe/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,6 @@ We compared our implementations with the tiktoken implementation on a MacBook Pr
191191
| Heap | 1900 µs ||
192192

193193
As can be seen, our Backtracking implementation beats the TikToken Rust implementation by ~4x.
194-
And even the fully deynamic programming solution is faster with a more consistent runtime.
194+
And even the fully dynamic programming solution is faster with a more consistent runtime.
195195
The tuned heap implementation is still quite competitive to TikToken (especially for smaller inputs).
196196
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.

crates/bpe/src/appendable_encoder.rs

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,10 +56,15 @@ impl<'a> AppendableEncoder<'a> {
5656
}
5757

5858
/// Returns the number of tokens required to tokenize the input text.
59-
/// This operation is O(1) and can be called at any point in time.
59+
/// This operation is O(1) and can be called at any point in time.
6060
pub fn len(&self) -> usize {
6161
self.counts.last().copied().unwrap_or(0) as usize
6262
}
63+
64+
/// Returns true if the structure represents the empty string.
65+
pub fn is_empty(&self) -> bool {
66+
self.counts.is_empty()
67+
}
6368
}
6469

6570
#[cfg(test)]
@@ -78,4 +83,4 @@ mod tests {
7883
enc.push(*c);
7984
}
8085
}
81-
}
86+
}

crates/bpe/src/backtrack_encoder.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ use crate::byte_pair_encoding::BytePairEncoding;
66
/// for a given input text.
77
/// It keeps track of visited states in a bitfield and only remembers the tokenization
88
/// of the currently processed dynamic programming state.
9-
///
9+
///
1010
/// The biggest downside of this approach is that the search for the longest leftmost match
1111
/// has to be reset at every (backtracking) step which is still a net win in practice compared to other approaches.
1212
pub(crate) struct BacktrackEncoder<'a> {

crates/bpe/src/interval_encoding.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ impl<'a> IntervalEncoding<'a> {
5757
/// Thereby it reencodes the prefix with the `BacktrackEncoder` until the encoding sequence becomes
5858
/// compatible with the precomputed tables. Once that's the case, the remainder of the range becomes
5959
/// a simple O(1) lookup.
60-
///
60+
///
6161
/// Note: in the worst-case the complexity is O(n). This happens for instance for a whitespace input
6262
/// where the encoding changes when the starting position changes.
6363
pub fn count(&self, range: Range<usize>) -> usize {

0 commit comments

Comments
 (0)