GitHub - vaibhav4595/Seq2Seq_MT: MT implementation for Seq2Seq_Assignment1

In this project, you are going to implement a neural machine translation model, trained and tested on the IWSLT 2014 data set. To help you start, we have prepared some template (pseudo-) code in this repo. Note that you are not required to use this template code, and it may not be the best implementation, however you may find this a good reference.

File Structure

nmt.py: contains the neural machine translation model and training/testing code.
vocab.py: a script that extracts vocabulary from training data
util.py: contains utility/helper functions

Dataset

The IWSLT 2014 dataset has 150K German-English training sentences. The data/ folder contains a copy of the public release of the dataset. Files with suffix *.wmixerprep are pre-processed versions of the dataset from Ranzato et al., 2015, with long sentences chopped and rared words replaced by a special <unk> token. You could use the pre-processed training files for training/developing (or come up with your own pre-processing strategy), but for testing you have to use the original version of testing files, ie., test.de-en.(de|en).

Environment

The (pseudo-) template code is written in Python 3.6 using some supporting third-party libraries. We provided a conda environment to install Python 3.6 with required libraries. Simply run

conda env create -f environment.yml

Usage

First, we extract a vocabulary file from the training data using the command:

python vocab.py --train-src=data/train.de-en.de.wmixerprep --train-tgt=data/train.de-en.en.wmixerprep data/vocab.bin

This generates a vocabulary file data/vocab.bin. The script also has options to control the cutoff frequency and the size of generated vocabulary, which you may play with.

For training and decoding/testing, you may refer to data/train.sh. Note that in the training script we set the values of some hyper parameters. They are not guaranteed to be the best hyper-parameters, and you are free to play with them. After training and decoding, we call the official evaluation script multi-bleu.perl to compute the corpus-level BLEU score of the decoding results against the gold-standard.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
analysis_results		analysis_results
data		data
decoded_text_best		decoded_text_best
scripts		scripts
README.md		README.md
environment.yml		environment.yml
model.py		model.py
multi-bleu.perl		multi-bleu.perl
nmt.py		nmt.py
requirements.txt		requirements.txt
utils.py		utils.py
vocab.py		vocab.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

File Structure

Dataset

Environment

Usage

About

Releases

Packages

Contributors 3

Languages

vaibhav4595/Seq2Seq_MT

Folders and files

Latest commit

History

Repository files navigation

File Structure

Dataset

Environment

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages