huggingface / tokenizers Public

Notifications You must be signed in to change notification settings
Fork 737
Star 8.6k

Code
Issues 37
Pull requests 7
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Issues: huggingface/tokenizers

ByteLevelBPETokenizer output seems weird

#203 opened Mar 24, 2020 by seyyaw

Open 2

Training a model from in-memory data

#198 by loicbarrault was closed Nov 28, 2020

Closed 1

Labels 20 Milestones 2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

37 Open 907 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

How to use TokenizerBuilder?

#1549 opened Jun 7, 2024 by polarathene

"Solution" to memory hogging in train_new_from_iterator with a hack

#1546 opened Jun 4, 2024 by morphpiece

How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer?

#1545 opened Jun 4, 2024 by LuoKaiGSW

[BUG] Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.)

#1544 opened Jun 4, 2024 by MilkClouds

llama3 tokenizer doesn't round trip

#1543 opened Jun 3, 2024 by josharian

Memory leak for large strings

#1539 opened May 23, 2024 by noamgai21

Training HuggingFace tokenizer - ignore_merges Feature Request planned

#1537 opened May 22, 2024 by ykoyfman

[BUG]Might be a bug in Unigram Trainer

#1536 opened May 20, 2024 by Codesticker

How to allow the merging of consecutive newline tokens \n when training a byte-level bpe tokenizer?

#1534 opened May 18, 2024 by liuslnlp

How to Batch-Encode Paired Input Sentences with Tokenizers: Seeking Clarification

#1531 opened May 14, 2024 by insookim43

Converting tokenizers tokenizers into tiktoken tokenizers

#1530 opened May 13, 2024 by umarbutler

Special token handling breaks idempotency of sentencepiece due to extra spaces

#1527 opened May 9, 2024 by cat-state

Link to download the training text in docs/source/quicktour.rst is broken

#1526 opened May 9, 2024 by 14jdelap

How to write custom Wordpiece class?

#1525 opened May 9, 2024 by xinyinan9527

Convert huggingface tokenizer into sentencepiece format Stale

#1524 opened May 7, 2024 by RRaphaell

❓Get stats (e.g. counts) about the merged pairs Stale

#1523 opened May 6, 2024 by pietrolesci

Error: Cannot find module 'tokenizers/bindings/tokenizer' Stale

#1522 opened May 6, 2024 by meichangsu1

Why the tokenizer is slower than tiktoken?

#1519 opened Apr 29, 2024 by BigBinnie

Loading tokenizer.model with Rust API

#1518 opened Apr 28, 2024 by EricLBuehler

UnigramTrainer: byte_fallback is false. Feature Request training

#1515 opened Apr 25, 2024 by Moddus

BPE Trainer doesn't respect the vocab_size parameter when dataset size is increased Stale

#1514 opened Apr 25, 2024 by Abhinay1997

Breaking changes in v0.19.1 for tiktoken/llama3 Stale

#1512 opened Apr 24, 2024 by sanderland

Cross-compilation fails for custom target

#1509 opened Apr 23, 2024 by semaraugusto

Failing to build bindings with 0.19.1

#1505 opened Apr 18, 2024 by bryteise

Extended vocab tokenizer merging text into a single string without spaces while decoding

#1501 opened Apr 17, 2024 by savanth14

Previous 1 2 Next

Previous Next

ProTip! Find all open issues with in progress development work with linked:pr.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly