spm-tokenizer-pattern

This repository implements common pattern appears when training sentencepiece tokenizer from raw corpus and use it within Tensorflow framework. This repository basically replicates process implemented in this repository. Namely, role of two main components contained in this repository can be described as:

preprocess.py : fetch data from source, extract information from it and save train, test data
train.py : extract corpus from train data, train and save tokenizer

Unlike huggingface tokenizer, sentencepiece tokenizer can be directly incorporated into Tensorflow graph(i.e. model). By doing so, corresponding NLP model can solely be saved and defined by Tensorflow SavedModel, which makes system that leverages the model relatively compact and simple. That being said, this repository also implements additional pattern within separate module explained below.

evaluate.py : load trained spm tokenizer using tensorflow-text API and make use of it

These codes are implemented and executed on macOS environment with Python 3.9 version.

Execute

To execute implemented process, change directory into the cloned repository and execute following:

python3 --version  # Python 3.9.5
python3 -m venv venv  # generate virtual environment for this project
source venv/bin/activate  # activate generated virtual environment
pip install -r requirements.txt  # install required packages

Then, implemented process can be directly executed by passing appropriate argument to main module. For example, to fetch IMDb data from web source and compose corpus, execute:

python3 main.py preprocess

Or, to train sentencepiece tokenizer from preprocessed corpus, execute:

python3 main.py train

Without any modification on configuration described in config.py, these processes will save trained sentencepiece tokenizer into local directory whose absolute path is:

/tmp/spmTokenizer/tokenizer.model

Given this result file, logics implemented in evaluate.py module can be executed with following command:

python3 main.py evaluate

SentencepieceTokenizerLayer class defined in evaluate.py implements text preprocessing logics as a series of Tensorflow operations. After loading trained tokenizer saved in the local directory when initialized, it applies following operations in order. Check source code for more details.

normalize strings
1. tf.strings.lower: takes raw text tensor as input, converts upper case of every string within the tensor, returns processed tensor as output
2. tf.strings.regex_replace: takes raw text tensor as input, converts every matching regular expression pattern exists in every string within the tensor, returns processed tensor as output
tokenize normalized strings
1. SentencepieceTokenizer.tokenize: takes tensor of strings as input and returns corresponding ragged tensor of tokens in user-defined data type
2. to_tensor: method of ragged tensor that converts itself as square tensor. It makes length of arrays in ragged tensor even by filling input value(mostly, pad token)to reach maximum array length within input tensor

Mostly, dimension of token sequence has to be fixed when used as input of language model. As dimension of output tensor returned from to_tensor method can vary according to input batches, utility function to fill sequence into predefined constant maximum length is required. For this, Tensorflow provides pad_sequence method which takes iterable of iterables as input and returns numpy array of sequences whose dimension is fixed as maximum length. This means that attaching to_tensor at the end of tokenization was essential, since ragged tensor itself is not iterable. To check, execute:

import tensorflow as tf

hasattr(tf.ragged.constant([[0], [1,2], [3,4,5]]), "__iter__")  # False

To show how tokenizer actually works, main module prints the tokenization results on list of dummy texts. It is defined as

dummy_texts = [
    "You'll see that CAPITAL letters in the review",
    "is INDEED converted into lowercase letters,",
    "texts are normalized and eos, bos tokens are added."
]

which is tokenized respectively into:

[
    ['[CLS]', '▁you', '▁ll', '▁see', '▁that', '▁capital', '▁letters', '▁in', '▁the', '▁review', '[SEP]'],
    ['[CLS]', '▁is', '▁indeed', '▁converted', '▁into', '▁lower', 'case', '▁letters', '[SEP]'],
    ['[CLS]', '▁text', 's', '▁are', '▁normal', 'ized', '▁and', '▁e', 'os', '▁bos', '▁to', 'k', 'ens', '▁are', '▁added', '[SEP]']
]

and when sequence padding is applied:

[
    ['[CLS]', '▁you', '▁ll', '▁see', '▁that', '▁capital', '▁letters', '▁in', '▁the', '▁review', '[SEP]', '[PAD]', ...],
    ['[CLS]', '▁is', '▁indeed', '▁converted', '▁into', '▁lower', 'case', '▁letters', '[SEP]', '[PAD]', '[PAD]', ...],
    ['[CLS]', '▁text', 's', '▁are', '▁normal', 'ized', '▁and', '▁e', 'os', '▁bos', '▁to', 'k', 'ens', '▁are', '▁added', '[SEP]', '[PAD]', '[PAD]', ...]
]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spm-tokenizer-pattern

Execute

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
evaluate.py		evaluate.py
main.py		main.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train.py		train.py

License

sunsikim/tf-spm-tokenizer-pattern

Folders and files

Latest commit

History

Repository files navigation

spm-tokenizer-pattern

Execute

About

Topics

Resources

License

Stars

Watchers

Forks

Languages