|
1 |
| -## A Hierarchical Latent Concept to Paraphrase Variational Autoencoder |
| 1 | +# The Latent Bag of Words Model |
2 | 2 |
|
3 |
| -The journals about this project is moved to [this link](https://github.com/Francix/Deep-Generative-Models-for-Natural-Language-Processing/blob/master/README.md), as a reading list |
| 3 | +Implementation of Yao Fu, Yansong Feng and John Cunningham, _Paraphrase Generation with Latent Bag of Words_. NeurIPS 2019. [paper](https://github.com/FranxYao/dgm_latent_bow/doc/latent_bow_camera_ready.pdf) |
4 | 4 |
|
5 |
| -## Results - Quora |
| 5 | +<img src="etc/sample_sentences.png" alt="example" |
| 6 | + title="Example" width="800" /> |
6 | 7 |
|
7 |
| -Models | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | Rouge-1 | Rouge-2 | Rouge-L |
8 |
| ------- | ------ | ------ | ------ | ------ | ------- | ------- | ------- |
9 |
| -seq2seq | 51.34 | 36.88 | 28.08 | 22.27 | 52.66 | 29.17 | 50.29 |
10 |
| -seq2seq-attn | 53.24 | 38.79 | 29.56 | 23.34 | 54.71 | 30.68 | 52.29 |
11 |
| -beta-vae, beta = 1e-3 | 43.02 | 28.60 | 20.98 | 16.29 | 41.81 | 21.17 | 40.09 |
12 |
| -beta-vae, beta = 1e-4 | 47.86 | 33.21 | 24.96 | 19.73 | 47.62 | 25.49 | 45.46 |
13 |
| -bow-hard | 33.40 | 21.18 | 14.43 | 10.36 | 36.08 | 16.23 | 33.77 |
14 |
| -latent-bow-topk | 54.93 | 41.19 | 31.98 | 25.57 | 58.05 | 33.95 | 55.74 |
15 |
| -latent-bow-gumbel | 54.82 | 40.96 | 31.74 | 25.33 | 57.75 | 33.67 | 55.46 |
16 |
| -cheating-bow | 72.96 | 61.78 | 54.40 | 49.47 | 72.15 | 52.61 | 68.53 |
| 8 | +For more background about deep generative models for natural language processing, see the [DGM4NLP](https://github.com/FranxYao/Deep-Generative-Models-for-Natural-Language-Processing) journal list. |
17 | 9 |
|
18 |
| -note: strictly, we should call this cross-aligned VAE |
19 | 10 |
|
20 |
| -## Results - MSCOCO |
21 |
| - |
22 |
| -Models | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | Rouge-1 | Rouge-2 | Rouge-L |
23 |
| ------- | ------ | ------ | ------ | ------ | ------- | ------- | ------- |
24 |
| -seq2seq | 69.61 | 47.14 | 31.64 | 21.65 | 40.11 | 14.31 | 36.28 |
25 |
| -seq2seq-attn | 71.24 | 49.65 | 34.04 | 23.66 | 41.07 | 15.26 | 37.35 |
26 |
| -beta-vae, beta = 1e-3 | 68.81 | 45.82 | 30.56 | 20.99 | 39.63 | 13.86 | 35.81 |
27 |
| -beta-vae, beta = 1e-4 | 70.04 | 47.59 | 32.29 | 22.54 | 40.72 | 14.75 | 36.75 |
28 |
| -bow-hard | 48.14 | 28.35 | 16.25 | 9.28 | 31.66 | 8.30 | 27.37 |
29 |
| -latent-bow-topk | 72.60 | 51.14 | 35.66 | 25.27 | 42.08 | 16.13 | 38.16 |
30 |
| -latent-bow-gumbel | 72.37 | 50.81 | 35.32 | 24.98 | 42.12 | 16.05 | 38.13 |
31 |
| -cheating-bow | 80.87 | 65.38 | 51.72 | 41.48 | 45.54 | 20.57 | 40.97 |
32 |
| - |
33 |
| -## Results - MSCOCO - Detailed |
34 |
| - |
35 |
| -Models | PPL | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 |
36 |
| ------- | --- | ------ | ------ | ------ | ------ |
37 |
| -seq2seq | 4.36 | 69.61 | 47.14 | 31.64 | 21.65 |
38 |
| -seq2seq-attn | 4.88 | 71.24 | 49.65 | 34.04 | 23.66 |
39 |
| -beta-vae, beta = 1e-3 | 3.94 | 68.81 | 45.82 | 30.56 | 20.99 |
40 |
| -beta-vae, beta = 1e-4 | 4.12 | 70.04 | 47.59 | 32.29 | 22.54 |
41 |
| -bow-hard | 19.13 | 48.14 | 28.35 | 16.25 | 9.28 |
42 |
| -latent-bow-topk | 4.75 | 72.60 | 51.14 | 35.66 | 25.27 |
43 |
| -latent-bow-gumbel | 4.69 | 72.37 | 50.81 | 35.32 | 24.98 |
44 |
| -cheating-bow | 15.65 | 80.87 | 65.38 | 51.72 | 41.48 |
45 |
| -latent-bow-memory-only | |
46 |
| -seq2seq-attn top2 sampling | |
47 |
| -bow-seq2seq, enc baseline | - | 63.39 | 40.31 | 24.40 | 14.76 |
48 |
| -bow-seq2seq, ref baseline | - | 76.09 | 49.90 | 31.79 | 20.41 |
49 |
| -bow, predict all para bow | - | 64.44 | 41.26 | 25.90 | 16.47 |
50 |
| -bow, predict all para bow exclude self bow | |
51 |
| -hierarchical vae | |
52 |
| - |
53 |
| -Models | Rouge-1 | Rouge-2 | Rouge-L |
54 |
| ------- | ------- | ------- | ------- |
55 |
| -seq2seq | 40.11 | 14.31 | 36.28 |
56 |
| -seq2seq-attn | 41.07 | 15.26 | 37.35 |
57 |
| -beta-vae, beta = 1e-3 | 39.63 | 13.86 | 35.81 |
58 |
| -beta-vae, beta = 1e-4 | 40.72 | 14.75 | 36.75 |
59 |
| -bow-hard | 31.66 | 8.30 | 27.37 |
60 |
| -latent-bow-topk | 42.08 | 16.13 | 38.16 |
61 |
| -latent-bow-gumbel | 42.12 | 16.05 | 38.13 |
62 |
| -cheating-bow | 45.54 | 20.57 | 40.97 |
63 |
| -seq2seq-attn top2 sampling | |
64 |
| -latent-bow-memory-only | |
65 |
| - |
66 |
| -Models | Dist-1 | Dist-2 | Dist-3 |
67 |
| ------- | ------ | ------ | ------ |
68 |
| -seq2seq | 689 | 3343 | 7400 |
69 |
| -seq2seq-attn | 943 | 4867 | 11494 |
70 |
| -beta-vae, beta = 1e-3 | 737 | 3367 | 6923 |
71 |
| -beta-vae, beta = 1e-4 | 1090 | 5284 | 11216 |
72 |
| -bow-hard | 2100 | 24505 | 71293 |
73 |
| -latent-bow-topk | 1407 | 7496 | 17062 |
74 |
| -latent-bow-gumbel | 1433 | 7563 | 17289 |
75 |
| -cheating-bow | 2399 | 26963 | 70128 |
76 |
| -seq2seq-attn top2 sampling | |
77 |
| -latent-bow-memory-only | |
78 |
| - |
79 |
| -Models | IN-BLEU-1 | IN-BLEU-2 | IN-BLEU-3 | IN-BLEU-4 | Jaccard Dist |
80 |
| ------- | --------- | --------- | --------- | --------- | ------------ |
81 |
| -seq2seq | 46.01 | 28.17 | 18.41 | 12.76 | 33.74 |
82 |
| -seq2seq-attn | 49.28 | 32.23 | 22.19 | 16.06 | 37.60 |
83 |
| -beta-vae, beta = 1e-3 | 44.92 | 26.82 | 17.34 | 12.02 | 32.41 |
84 |
| -beta-vae, beta = 1e-4 | 46.97 | 29.07 | 19.33 | 13.68 | 34.42 |
85 |
| -bow-hard | 27.62 | 14.31 | 7.59 | 4.06 | 21.08 |
86 |
| -latent-bow-topk | 51.22 | 34.36 | 24.31 | 18.04 | 39.25 |
87 |
| -latent-bow-gumbel | |
88 |
| -cheating-bow | 34.95 | 18.98 | 10.79 | 6.41 | 24.85 |
89 |
| -seq2seq-attn top2 sampling | |
90 |
| -latent-bow-memory-only | |
91 |
| -bow-seq2seq, enc baseline | 41.40 | 25.31 | 15.78 | 10.13 | - |
92 |
| -bow-seq2seq, ref baseline | 29.56 | 13.95 | 7.11 | 3.83 | - |
93 |
| -bow, predict all para bow | 49.07 | 31.17 | 20.55 | 14.18 | - |
94 |
| -bow, predict all para bow exclude self bow | |
95 |
| -hierarchical vae | |
96 |
| - |
97 |
| - |
98 |
| -Sentence samples - seq2seq-attn |
99 |
| -* I: Five slices of bread are placed on a surface . |
100 |
| -* O: A bunch of food that is sitting on a plate . |
101 |
| -* I: A wooden floor inside of a kitchen next to a stove top oven . |
102 |
| -* O: A kitchen with a stove , oven , and a refrigerator . |
103 |
| -* I: Four horses pull a carriage carrying people in a parade . |
104 |
| -* O: A group of people riding horses down a street . |
105 |
| - |
106 |
| -Random Walk samples - seq2seq-attn |
107 |
| -* I: A man sitting on a bench reading a piece of paper |
108 |
| -* -> A man is sitting on a bench in front of a building |
109 |
| -* -> A man is standing in the middle of a park bench |
110 |
| -* -> A man is holding a baby in the park |
111 |
| -* -> A man is holding a baby in a park |
112 |
| -* -> A man is holding a baby in a park |
113 |
| -* I: A water buffalo grazes on tall grass while an egret stands by |
114 |
| -* -> A large bison standing in a grassy field |
115 |
| -* -> A large buffalo standing in a field with a large green grass |
116 |
| -* -> A bison with a green grass covered in green grass |
117 |
| -* -> A large bison grazing in a field with a green grass covered field |
118 |
| -* -> A large bison grazing in a field with a large tree in the background |
119 |
| - |
120 |
| -## Project Vision |
121 |
| - |
122 |
| -* "Use probabilistic models where we have inductive bias; Use flexible function approximators where we do not." |
123 |
| -* This project aims to explore effective Generative Modeling techniques for Natural Langauge Generation |
124 |
| - |
125 |
| -* Two paths |
126 |
| - 1. Improving text generation diversity by injecting randomness (or by anything else) |
127 |
| - * Existing text generation models tend to produce repeated and dull expressions from fixed learned modes. |
128 |
| - * E.g. "I do not know" for any questions in a Question Answering system. |
129 |
| - * With MLE training, models usually converge to the local maximal which is dominated by the most frequent patterns, thus losing text variety. |
130 |
| - * We aim to promote text diversity by injecting randomness. |
131 |
| - * \# NOTE: many existing works do this by using adversarial regularization (Xu et.al., Zhang et.al.) but I want to utilize the randomness of VAE. This idea is not so main-stream so I think I should do some prelimilary verification. |
132 |
| - * \# NOTE: I have had this idea since last year but have not seem any work about it. So if the prelimilary experiments do not work I may switch back to the existing line. |
133 |
| - 2. Language generation guided by global semantics |
134 |
| - * Many recent works incorporte global semantic signals (e.g. topics) into sentence generation systems with latent variable models. |
135 |
| - * These models exhibit many advantages such as better generation quality (but also can be worse honestly), making the generation controllable (which is desirable for decades), and improving interpretability (but sometimes compromises quality). |
136 |
| - * This work explore the new methods to utilize global semantic signals with latent variable models to improve the downstream generation quality such as language variety. |
137 |
| - * \# NOTE: These two topics are the most compelling in my mind, but I cannot decide which one is more practical at this time (Feb 06 2019). Will do a survey this week and decide next week. |
138 |
| - |
139 |
| -* Methods(tentative): |
140 |
| - * Every time one wants to say something, he will have certain _concepts_ in his mind. e.g. "lunch .. burger .. good" |
141 |
| - * At this stage, this _concept_ is not a sentence yet, it is a concept in his mind, he has not say it yet. |
142 |
| - * One has many ways to say this sentence, all the sentences are to some extent different from each other, but they all convey the same meaning. They are _paraphrases_ to each other. |
143 |
| - * We can think of different sentence realization of this _concept_ as different samples from the same distribution. |
144 |
| - * Because of stochasticity, each sample is different than each other, which is to say, **stochasticity induces language diversity** |
145 |
| - * Our idea is to use stochasticity to model language diversity. |
146 |
| - * We model one _concept_ as a Gaussian |
147 |
| - * We model different ways _realization_ of this concept as a mixture Gaussian, each component share the _concept_ Gaussian as their prior. |
148 |
| - * Given a sentence, we recover the mixture Gaussian, then we use different samples from the mixture to get different paraphrase of that sentence. -- This will require us to reparameterize through Gaussian Mixture, see (Grave 16). |
149 |
| - |
150 |
| -* Assumptions |
151 |
| - * Simgle Gaussian cannot model stochasticity because of posterior collpse -- TO BE VERIFIED (but I think I have done this before, not 100% sure) |
152 |
| - |
153 |
| -* Goal |
154 |
| - * Effectiveness: we can actually generate paraphrases |
155 |
| - * surface difference: lower BLEU of different paraphrases |
156 |
| - * semantic similarity: use a classifier to give similarity score |
157 |
| - |
158 |
| -* Vision |
159 |
| - * upper bound: build new effective models (for one focused application.) |
160 |
| - * upper bound: investigating exising methods and gain a deeper understanding (thus giving a position paper). |
161 |
| - * lower bound: test existing state of the art models and analyse their pros and cons. |
162 |
| - * lower bound: continuous trial and error and get to know many ways that do not work. |
163 |
| - |
164 |
| -* Related Works |
165 |
| - * Text Generation Models (with particular sentence quality objective) |
166 |
| - * Sentence Variational Autoencoders |
167 |
| - * Adversarial Regularization for Text Generation |
168 |
| - |
169 |
| -## Code structures |
170 |
| -* AdaBound.py |
171 |
| -* config.py |
172 |
| -* controller.py |
173 |
| -* data_utils.py |
174 |
| -* hierarchical_vae.py |
175 |
| -* lm.py |
176 |
| -* main.py |
177 |
| -* seq2seq.py |
178 |
| -* similarity.py |
179 |
| -* vae.py |
0 commit comments