Error in Training a Byte-Level BPE in "Building a tokenizer, block by block" #775

FurtherAI · 2025-01-26T08:36:41Z

This line for training a Byte-Level BPE has an error. You have to add an initial alphabet of bytes, otherwise the tokenizer will not fall back to bytes when tokens are missing from the vocabulary and characters from your string can be missing when decoded.

For good reference and helping people, the training of a Byte-Level BPE should go as in this example.

Here is some shortened code so you don't have to follow the link or read it in a broken up tutorial:

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=add_prefix_space)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=trim_offsets)

def batch_generator(ds, batch_size):
        for i in range(0, len(ds), batch_size):
            yield ds[i : i + batch_size]['text']

trainer = trainers.BpeTrainer(
    vocab_size=vocab_size,
    min_frequency=min_frequency,
    show_progress=show_progress,
    special_tokens=special_tokens,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
)
self._tokenizer.train_from_iterator(
    batch_generator,
    trainer=trainer,
    length=length,
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in Training a Byte-Level BPE in "Building a tokenizer, block by block" #775

Error in Training a Byte-Level BPE in "Building a tokenizer, block by block" #775

FurtherAI commented Jan 26, 2025

Error in Training a Byte-Level BPE in "Building a tokenizer, block by block" #775

Error in Training a Byte-Level BPE in "Building a tokenizer, block by block" #775

Comments

FurtherAI commented Jan 26, 2025