zs-nmt-dae

Official implementation of EMNLP 2021 Paper "Rethinking Zero-shot Neural Machine Translation: From a Perspective of Latent Variables".

Citation

Please cite our paper if you find this repository helpful in your research:

@article{wang2021rethinking,
  title={Rethinking Zero-shot Neural Machine Translation: From a Perspective of Latent Variables},
  author={Wang, Weizhi and Zhang, Zhirui and Du, Yichao and Chen, Boxing and Xie, Jun and Luo, Weihua},
  journal={arXiv preprint arXiv:2109.04705},
  year={2021}
}

Requirements and Installation

Python version == 3.6
PyTorch version == 1.5.0
numpy == 1.19.5
sacremoses == 0.0.43
sacrebleu == 1.5.1
jieba == 0.42.1
tqdm == 4.59.0
To install revised fairseq 0.10.1:

git clone https://github.com/Victorwz/zs-nmt-dae.git;
cd zs-nmt-dae;
pip install --editable ./;

Data Downloading

We conduct experiments on two multilingual corpus MultiUN and Europarl.

For downloading MultiUN, please refer to its official website and scripts. The downloaded corpus should be put in the folder ./data/MultiUN. Or you can use our script to download the corpus:

cd data;
bash download-multiun.sh

For downloading Europarl, please refer to its official website and scripts. The official validation and test sets of Europarl are WMT dev2006 and devtest2006. The downloading script is concatenated into the data pre-processing script of Europarl.

Data Preprocessing

For preprocessing MultiUN corpus, please run the following shell scripts:

cd data;
bash prepare-multiun.sh

For downloading and preprocessing EuroParl corpus, please run the followinng shell scripts:

cd data;
bash prepare-europarl.sh

Binalizing and Training with FairSeq

For training multilingual NMT model with denoising autoencoder objective on MultiUN, please run the following shell scripts:

bash train_multiun_mnmt_dn.sh

For training multilingual NMT model with denoising autoencoder objective on EuroParl, please run the following shell scripts:

bash train_europarl_mnmt_dn.sh

Decoding and Testing

For the decoding and testing on MultiUN, you need to first train the transformer model from scratch to get your checkpoint. Or you can use our checkpoint for reproducing the reported results in our paper. The checkpoint, dictionary, and BPE code are available at Google Drive. You can download it and unzip to ./checkpoints/multiun_mnmt_denoising. You need to modify the model and dictionary path in testing script to run the script.

For testing, please run the following shell scripts:

bash test_multiun_mnmt_dn.sh

We also find that deploying the trick of averaging the last 5 checkpoints starting from checkpoint_valid_bleu_best may lead to a better performance. You can uncomment some part of the code in our scripts to test the functionality of this trick. However, we did not deploy this trick in our method and all baseline methods in our paper.

Credit

Our project is developed based on FairSeq toolkit.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
config		config
data		data
docs		docs
examples		examples
fairseq		fairseq
fairseq_cli		fairseq_cli
scripts		scripts
tests		tests
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
hubconf.py		hubconf.py
pyproject.toml		pyproject.toml
setup.py		setup.py
test_multiun_mnmt_dn.sh		test_multiun_mnmt_dn.sh
train.py		train.py
train_europarl_mnmt_dn.sh		train_europarl_mnmt_dn.sh
train_multiun_mnmt_dn.sh		train_multiun_mnmt_dn.sh

License

Victorwz/zs-nmt-dae

Folders and files

Latest commit

History

Repository files navigation

zs-nmt-dae

Citation

Requirements and Installation

Data Downloading

Data Preprocessing

Binalizing and Training with FairSeq

Decoding and Testing

Credit

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages