artetxem
diff --git a/‎.gitignore
Lines changed: 9 additions & 0 deletions b/‎.gitignore
Lines changed: 9 additions & 0 deletions
diff --git a/‎LICENSE.txt
Lines changed: 674 additions & 0 deletions b/‎LICENSE.txt
Lines changed: 674 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 71 additions & 2 deletions b/‎README.md
Lines changed: 71 additions & 2 deletions
diff --git a/‎train.py
Lines changed: 20 additions & 0 deletions b/‎train.py
Lines changed: 20 additions & 0 deletions
diff --git a/‎translate.py
Lines changed: 59 additions & 0 deletions b/‎translate.py
Lines changed: 59 additions & 0 deletions
diff --git a/‎undreamt/__init__.py b/‎undreamt/__init__.py
diff --git a/‎undreamt/attention.py
Lines changed: 54 additions & 0 deletions b/‎undreamt/attention.py
Lines changed: 54 additions & 0 deletions
@@ -0,0 +1,9 @@
+data/
+output/
+
+*~
+.DS_Store
+__pycache__/
+*.py[cod]
+*$py.class
+.idea/
@@ -1,2 +1,71 @@
-# UNdreaMT (Unsupervised Neural Machine Translation)
-Coming soon ;)
+UNdreaMT: Unsupervised Neural Machine Translation
+==============
+
+This is an open source implementation of our unsupervised neural machine translation system, described in the following paper:
+
+Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. **[Unsupervised Neural Machine Translation](https://arxiv.org/pdf/1710.11041.pdf)**. In *Proceedings of the Sixth International Conference on Learning Representations (ICLR 2018)*.
+
+If you use this software for academic research, please cite the paper in question:
+```
+@inproceedings{artetxe2018iclr,
+  author    = {Artetxe, Mikel  and  Labaka, Gorka  and  Agirre, Eneko  and  Cho, Kyunghyun},
+  title     = {Unsupervised neural machine translation},
+  booktitle = {Proceedings of the Sixth International Conference on Learning Representations},
+  month     = {April},
+  year      = {2018}
+}
+```
+
+
+Requirements
+--------
+- Python 3
+- PyTorch (tested with v0.3)
+
+
+Usage
+--------
+
+The following command trains an unsupervised NMT system from monolingual corpora using the exact same settings described in the paper:
+
+```
+python3 train.py --src SRC.MONO.TXT --trg TRG.MONO.TXT --src_embeddings SRC.EMB.TXT --trg_embeddings TRG.EMB.TXT --save MODEL_PREFIX --cuda
+```
+
+The data in the above command should be provided as follows:
+- `SRC.MONO.TXT` and `TRG.MONO.TXT` are the source and target language monolingual corpora. They should both be pre-processed so atomic symbols (either tokens or BPE units) are separated by whitespaces. For that purpose, we recommend using [Moses](http://www.statmt.org/moses/) to tokenize and truecase the corpora and, optionally, [Subword-NMT](https://github.com/rsennrich/subword-nmt) if you want to use BPE.
+- `SRC.EMB.TXT` and `TRG.EMB.TXT` are the source and target language cross-lingual embeddings. In order to obtain them, we recommend training monolingual embeddings in the corpora above using either [word2vec](https://github.com/tmikolov/word2vec) or [fasttext](https://github.com/facebookresearch/fastText), and then map them to a shared space using [VecMap](https://github.com/artetxem/vecmap). Please make sure to cutoff the vocabulary as desired before mapping the embeddings.
+- `MODEL_PREFIX` is the prefix of the output model.
+
+Using the above settings, training takes about 3 days in a single Titan Xp. Once training is done, you can use the resulting model for translation as follows:
+
+```
+python3 translate.py MODEL_PREFIX.final.src2trg.pth < INPUT.TXT > OUTPUT.TXT
+```
+
+For more details and additional options, run the above scripts with the `--help` flag.
+
+
+FAQ
+--------
+
+###### You claim that your unsupervised NMT system is trained on monolingual corpora alone, but it also requires bilingual embeddings... Isn't that cheating?
+
+Not really, because we also learn the bilingual embeddings from monolingual corpora alone. We use our companion tool [VecMap](https://github.com/artetxem/vecmap) for that.
+
+
+###### Can I use this software to train a regular NMT system on parallel corpora?
+
+Yes! You can use the following arguments to make UNdreaMT behave like a regular NMT system:
+
+```
+python3 train.py --src2trg SRC.PARALLEL.TXT TRG.PARALLEL.TXT --src_vocabulary SRC.VOCAB.TXT --trg_vocabulary TRG.VOCAB.TXT --embedding_size 300 --learn_encoder_embeddings --disable_denoising --save MODEL_PREFIX --cuda
+```
+
+
+License
+-------
+
+Copyright (C) 2018, Mikel Artetxe
+
+Licensed under the terms of the GNU General Public License, either version 3 or (at your option) any later version. A full copy of the license can be found in LICENSE.txt.
@@ -0,0 +1,20 @@
+# Copyright (C) 2018  Mikel Artetxe <[email protected]>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+import undreamt.train
+
+
+if __name__ == '__main__':
+    undreamt.train.main_train()
@@ -0,0 +1,59 @@
+# Copyright (C) 2018  Mikel Artetxe <[email protected]>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+import argparse
+import sys
+import torch
+
+
+def main():
+    # Parse command line arguments
+    parser = argparse.ArgumentParser(description='Translate using a pre-trained model')
+    parser.add_argument('model', help='a model previously trained with train.py')
+    parser.add_argument('--batch_size', type=int, default=50, help='the batch size (defaults to 50)')
+    parser.add_argument('--beam_size', type=int, default=12, help='the beam size (defaults to 12, 0 for greedy search)')
+    parser.add_argument('--encoding', default='utf-8', help='the character encoding for input/output (defaults to utf-8)')
+    parser.add_argument('-i', '--input', default=sys.stdin.fileno(), help='the input file (defaults to stdin)')
+    parser.add_argument('-o', '--output', default=sys.stdout.fileno(), help='the output file (defaults to stdout)')
+    args = parser.parse_args()
+
+    # Load model
+    translator = torch.load(args.model)
+
+    # Translate sentences
+    end = False
+    fin = open(args.input, encoding=args.encoding, errors='surrogateescape')
+    fout = open(args.output, mode='w', encoding=args.encoding, errors='surrogateescape')
+    while not end:
+        batch = []
+        while len(batch) < args.batch_size and not end:
+            line = fin.readline()
+            if not line:
+                end = True
+            else:
+                batch.append(line)
+        if args.beam_size <= 0 and len(batch) > 0:
+            for translation in translator.greedy(batch, train=False):
+                print(translation, file=fout)
+        elif len(batch) > 0:
+            for translation in translator.beam_search(batch, train=False, beam_size=args.beam_size):
+                print(translation, file=fout)
+        fout.flush()
+    fin.close()
+    fout.close()
+
+
+if __name__ == '__main__':
+    main()
@@ -0,0 +1,54 @@
+# Copyright (C) 2018  Mikel Artetxe <[email protected]>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+import torch.nn as nn
+
+
+class GlobalAttention(nn.Module):
+    def __init__(self, dim, alignment_function='general'):
+        super(GlobalAttention, self).__init__()
+        self.alignment_function = alignment_function
+        if self.alignment_function == 'general':
+            self.linear_align = nn.Linear(dim, dim, bias=False)
+        elif self.alignment_function != 'dot':
+            raise ValueError('Invalid alignment function: {0}'.format(alignment_function))
+        self.softmax = nn.Softmax(dim=1)
+        self.linear_context = nn.Linear(dim, dim, bias=False)
+        self.linear_query = nn.Linear(dim, dim, bias=False)
+        self.tanh = nn.Tanh()
+
+    def forward(self, query, context, mask):
+        # query: batch*dim
+        # context: length*batch*dim
+        # ans: batch*dim
+
+        context_t = context.transpose(0, 1)  # batch*length*dim
+
+        # Compute alignment scores
+        q = query if self.alignment_function == 'dot' else self.linear_align(query)
+        align = context_t.bmm(q.unsqueeze(2)).squeeze(2)  # batch*length
+
+        # Mask alignment scores
+        if mask is not None:
+            align.data.masked_fill_(mask, -float('inf'))
+
+        # Compute attention from alignment scores
+        attention = self.softmax(align)  # batch*length
+
+        # Computed weighted context
+        weighted_context = attention.unsqueeze(1).bmm(context_t).squeeze(1)  # batch*dim
+
+        # Combine context and query
+        return self.tanh(self.linear_context(weighted_context) + self.linear_query(query))