-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An efficient implementation of BytePairTokenizer #36
Comments
Hi @gboduljak! Yeah we would want a tokenizer in C++. I think for starters implementing it similar to python but in C++ would be sufficient. BPE quite a simple algorithm and if a Python implementation is usable I think a C++ one would be at least as much (probably much faster) with the benefit of allowing us to use threads. Subsequently, we can optimize it if needed. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
As suggested by @angeloskath' s code review ml-explore/mlx-examples#315 (comment), an implementation of
BytePairTokenizer
seems useful for many use cases, but it is currently missing inmlx-data
. I did some research on byte pair tokenization intransformers
. I think that the implementation intransformers
is somewhat slow. More precisely, the implementation iterates over all possible adjacent symbol pairs to determine the optimal symbol pair to merge, every time a merge could be done. This implies quadratic time complexity. However, in the referenced paper, there is an elegant linearithmic time implementation. Since the implementation requires some pointer trickery, it seems that we could (relatively) easily implement this in C++ and expose to Python.I would appreciate your thoughts on:
Paper: https://arxiv.org/pdf/2306.16837.pdf
The text was updated successfully, but these errors were encountered: