File tree Expand file tree Collapse file tree 4 files changed +10
-5
lines changed Expand file tree Collapse file tree 4 files changed +10
-5
lines changed Original file line number Diff line number Diff line change @@ -191,6 +191,6 @@ We compared our implementations with the tiktoken implementation on a MacBook Pr
191
191
| Heap | 1900 µs | ✔ |
192
192
193
193
As can be seen, our Backtracking implementation beats the TikToken Rust implementation by ~ 4x.
194
- And even the fully deynamic programming solution is faster with a more consistent runtime.
194
+ And even the fully dynamic programming solution is faster with a more consistent runtime.
195
195
The tuned heap implementation is still quite competitive to TikToken (especially for smaller inputs).
196
196
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
Original file line number Diff line number Diff line change @@ -56,10 +56,15 @@ impl<'a> AppendableEncoder<'a> {
56
56
}
57
57
58
58
/// Returns the number of tokens required to tokenize the input text.
59
- /// This operation is O(1) and can be called at any point in time.
59
+ /// This operation is O(1) and can be called at any point in time.
60
60
pub fn len ( & self ) -> usize {
61
61
self . counts . last ( ) . copied ( ) . unwrap_or ( 0 ) as usize
62
62
}
63
+
64
+ /// Returns true if the structure represents the empty string.
65
+ pub fn is_empty ( & self ) -> bool {
66
+ self . counts . is_empty ( )
67
+ }
63
68
}
64
69
65
70
#[ cfg( test) ]
@@ -78,4 +83,4 @@ mod tests {
78
83
enc. push ( * c) ;
79
84
}
80
85
}
81
- }
86
+ }
Original file line number Diff line number Diff line change @@ -6,7 +6,7 @@ use crate::byte_pair_encoding::BytePairEncoding;
6
6
/// for a given input text.
7
7
/// It keeps track of visited states in a bitfield and only remembers the tokenization
8
8
/// of the currently processed dynamic programming state.
9
- ///
9
+ ///
10
10
/// The biggest downside of this approach is that the search for the longest leftmost match
11
11
/// has to be reset at every (backtracking) step which is still a net win in practice compared to other approaches.
12
12
pub ( crate ) struct BacktrackEncoder < ' a > {
Original file line number Diff line number Diff line change @@ -57,7 +57,7 @@ impl<'a> IntervalEncoding<'a> {
57
57
/// Thereby it reencodes the prefix with the `BacktrackEncoder` until the encoding sequence becomes
58
58
/// compatible with the precomputed tables. Once that's the case, the remainder of the range becomes
59
59
/// a simple O(1) lookup.
60
- ///
60
+ ///
61
61
/// Note: in the worst-case the complexity is O(n). This happens for instance for a whitespace input
62
62
/// where the encoding changes when the starting position changes.
63
63
pub fn count ( & self , range : Range < usize > ) -> usize {
You can’t perform that action at this time.
0 commit comments