Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Training a Byte-Level BPE in "Building a tokenizer, block by block" #775

Open
FurtherAI opened this issue Jan 26, 2025 · 0 comments

Comments

@FurtherAI
Copy link

This line for training a Byte-Level BPE has an error. You have to add an initial alphabet of bytes, otherwise the tokenizer will not fall back to bytes when tokens are missing from the vocabulary and characters from your string can be missing when decoded.

For good reference and helping people, the training of a Byte-Level BPE should go as in this example.

Here is some shortened code so you don't have to follow the link or read it in a broken up tutorial:

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=add_prefix_space)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=trim_offsets)

def batch_generator(ds, batch_size):
        for i in range(0, len(ds), batch_size):
            yield ds[i : i + batch_size]['text']

trainer = trainers.BpeTrainer(
    vocab_size=vocab_size,
    min_frequency=min_frequency,
    show_progress=show_progress,
    special_tokens=special_tokens,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
)
self._tokenizer.train_from_iterator(
    batch_generator,
    trainer=trainer,
    length=length,
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant