Skip to content

Files

Latest commit

75876a4 · Jun 23, 2019

History

History

utils

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Jun 23, 2019
Mar 9, 2019
Mar 9, 2019
Mar 9, 2019
Mar 9, 2019
Mar 9, 2019
Mar 9, 2019

This directory contains several utilities for, e.g., data pre-processing.

Instructions of using BPE and WPM encoding are as follows. See examples/transformer for a real example of using these encodings.

Note that there are a few different (sub-)word encoding approaches and implementations which are used by several popular models. For example:

  • BPE by Rico Sennrich: Used in Transformer for machine translation. This is the version in this folder, including both BPE training and encoding/decoding.
  • BPE by OpenAI: Used in GPT-2 language model. Includes BPE encoding/decoding and provided BPE vocab (no training).
  • BPE by WordPiece: Used in BERT for text embedding. Includes BPE encoding/decoding and provided BPE vocab (no training).
  • SPM by sentencepiece: Used in Transformer for machine translation. This is the version in this folder, including both SPM training and encoding/decoding.
  • Add bin directory to PATH env variable
TEXAR=$(pwd)  
export PATH=$PATH:$TEXAR/bin
  • Learning BPE vocab on source and target combined
cat train.src train.trg | learn_bpe -s 32000 > bpe-codes.32000
  • Applying BPE on source and target files
apply_bpe -c bpe-codes.32000 < train.src > train.src.bpe
apply_bpe -c bpe-codes.32000 < train.trg > train.trg.bpe
apply_bpe -c bpe-codes.32000 < dev.src > dev.src.bpe
apply_bpe -c bpe-codes.32000 < dev.trg > dev.trg.bpe
apply_bpe -c bpe-codes.32000 < test.src > test.src.bpe
  • BPE decoding target to match with references
mv test.out test.out.bpe
cat test.out.bpe | sed -E 's/(@@ )|(@@ ?$)//g' > test.out
Evaluate Using Transformer's BLEU tool
python [TEXAR]/examples/transformer/bleu_tool.py --translation=test.out --reference=test.tgt

Word Piece Model (WPM) pipeline

pip install sentencepiece
  • Learning Word Piece on source and target combined
spm_train --input=train.src,train.tgt --vocab_size 32000 --model_prefix=wpm-codes
  • Applying Word Piece on source and target
spm_encode --model wpm-codes.model --output_format=id < train.src > train.src.wpm
spm_encode --model wpm-codes.model --output_format=id < train.tgt > train.tgt.wpm
spm_encode --model wpm-codes.model --output_format=id < valid.src > valid.src.wpm
spm_encode --model wpm-codes.model --output_format=id < valid.tgt > valid.tgt.wpm
spm_encode --model wpm-codes.model --output_format=id < test.src > test.src.wpm
  • WPM decoding/detokenising target to match with references
mv test.out test.wpm
spm_decode --model wpm-codes.model --input_format=id < test.out.wpm > test.out