utils

Jun 23, 2019

75876a4 · Jun 23, 2019

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	Fix bin/utils/README.md	Jun 23, 2019
apply_bpe	apply_bpe	added HParams	Mar 9, 2019
learn_bpe	learn_bpe	added HParams	Mar 9, 2019
multi-bleu.perl	multi-bleu.perl	added HParams	Mar 9, 2019
spm_decode	spm_decode	added HParams	Mar 9, 2019
spm_encode	spm_encode	added HParams	Mar 9, 2019
spm_train	spm_train	added HParams	Mar 9, 2019

README.md

This directory contains several utilities for, e.g., data pre-processing.

Instructions of using BPE and WPM encoding are as follows. See examples/transformer for a real example of using these encodings.

Note that there are a few different (sub-)word encoding approaches and implementations which are used by several popular models. For example:

BPE by Rico Sennrich: Used in Transformer for machine translation. This is the version in this folder, including both BPE training and encoding/decoding.
BPE by OpenAI: Used in GPT-2 language model. Includes BPE encoding/decoding and provided BPE vocab (no training).
BPE by WordPiece: Used in BERT for text embedding. Includes BPE encoding/decoding and provided BPE vocab (no training).
SPM by sentencepiece: Used in Transformer for machine translation. This is the version in this folder, including both SPM training and encoding/decoding.

Byte Pair Encoding (BPE) pipeline

Add bin directory to PATH env variable

TEXAR=$(pwd)  
export PATH=$PATH:$TEXAR/bin

Learning BPE vocab on source and target combined

cat train.src train.trg | learn_bpe -s 32000 > bpe-codes.32000

Applying BPE on source and target files

apply_bpe -c bpe-codes.32000 < train.src > train.src.bpe
apply_bpe -c bpe-codes.32000 < train.trg > train.trg.bpe
apply_bpe -c bpe-codes.32000 < dev.src > dev.src.bpe
apply_bpe -c bpe-codes.32000 < dev.trg > dev.trg.bpe
apply_bpe -c bpe-codes.32000 < test.src > test.src.bpe

BPE decoding target to match with references

mv test.out test.out.bpe
cat test.out.bpe | sed -E 's/(@@ )|(@@ ?$)//g' > test.out

Evaluate Using Transformer's BLEU tool

python [TEXAR]/examples/transformer/bleu_tool.py --translation=test.out --reference=test.tgt

Word Piece Model (WPM) pipeline

This requires installation of the sentencepiece library

pip install sentencepiece

Learning Word Piece on source and target combined

spm_train --input=train.src,train.tgt --vocab_size 32000 --model_prefix=wpm-codes

Applying Word Piece on source and target

spm_encode --model wpm-codes.model --output_format=id < train.src > train.src.wpm
spm_encode --model wpm-codes.model --output_format=id < train.tgt > train.tgt.wpm
spm_encode --model wpm-codes.model --output_format=id < valid.src > valid.src.wpm
spm_encode --model wpm-codes.model --output_format=id < valid.tgt > valid.tgt.wpm
spm_encode --model wpm-codes.model --output_format=id < test.src > test.src.wpm

WPM decoding/detokenising target to match with references

mv test.out test.wpm
spm_decode --model wpm-codes.model --input_format=id < test.out.wpm > test.out

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

utils

utils

README.md

Byte Pair Encoding (BPE) pipeline

Evaluate Using Transformer's BLEU tool

Word Piece Model (WPM) pipeline

Files

utils

Directory actions

More options

Directory actions

More options

Latest commit

History

utils

Folders and files

parent directory

README.md

Byte Pair Encoding (BPE) pipeline

Evaluate Using Transformer's BLEU tool

Word Piece Model (WPM) pipeline