[TOC]

Introduction to word embedding

##Word representation

Until now we represents words as one hot encoding, for example in the diccionary $v=[a, aaron, …, zulu, ]$

But this representation has aweakness: the algorithm can't learn relationships between them.

So instead of one hot encoding, we can use feature resprentation :

Features	Man(5391)	Woman(9853)	King(4914)	Queen(7157)	Apple(466)	Orange(6257)
Gender	-1	1	-0.95	0.97	0	0.01
Royal	0.01	0.02	0.93	0.95	-0.01	0
Age	0.03	0.02	0.7	0.69	0.03	0.02
Food	0.01	0.01	0.02	0,01	0.95	0.95
….

	e5391	e9853

For example each word will have 300 features
- -1 if is man, 1 is woman, so apple and orange are close to 0
- 1 if food, so man, woman, king, queen are not foods
Then apple and orange share more similalities.

We call this representation word embeddings

To visualize word embedding we use t-SNE algorithm to reduce the features to 2 dimensions: more similar they are, more close are in the figure

Using word embedding

Example name entity recognition:

Sally Johnson is an orange farmer
Robert lin is an apple farmer. It will probably be easy to recognize apple farmer thanks to the previous sentence
Robert lin is an durian cultivator, but it we are facing difficult words, like durian or cultivator, whichs may not be in our diccionary, if we use one hot encoding, it will be difficult to recognize. However, if we have a trained word embedding , it will tell us that durian is a fruit like orange and cultivator is like farm, them it can recognize that durian cultivator is a person.

The effect is similar to transfer learning.

Transfer learning and word embeddings:

Learn word embedding from large text corpus or dowmload pre traine embedding online
Transfer embedding to new task with smaller training set
Optional: continue to fine tune the word embeddings with new data.

Properties of word embeding

It also can help with analogy reasoning.

In the previous table of man and women, king queen: $$ \begin{array}{cccc} e_{m} & & e_{w} & & e_{m} - e_{w}\ \begin{bmatrix} -1 \ 0.01 \ 0.03 \ 0.09 \end{bmatrix} &- & \begin{bmatrix} 1 \ 0.02 \ 0.02 \ 0.01 \end{bmatrix} &\approx & \begin{bmatrix} -2 \ 0 \ 0 \ 0 \end{bmatrix} \end{array} $$ Then main difference between man and women is gender

Similar, for king and queen the mean difference is also gender $$ \begin{array}{cccc} e_{k} & & e_{q} & & e_{k} - e_{q}\ \begin{bmatrix} -0.95 \ 0.93 \ 0.70 \ 0.02 \end{bmatrix} &- & \begin{bmatrix} 0.97 \ 0.95 \ 0.69 \ 0.01 \end{bmatrix} &\approx & \begin{bmatrix} -2 \ 0 \ 0 \ 0 \end{bmatrix} \end{array} $$ If we know man is analog to women, what correponds to king?

$e_{m}-e_{women} \approx e_{king}-e_{?}$

What we want is to find word w that: argmax $ sim(e_w, e_{king}-e_{man}+e_{woman})$

The mos used function is cosine similarity:

According to euclidean dot product formula: $$ A \cdot B =\left| A \right|\left| B \right|\cos \theta $$ Then $$ sim(u, v) = \frac{u^T v}{|| u ||_2 || v ||_2} $$ where $u.v$ is the dot product (or inner product) of two vectors, $||u||_2$ is the norm (or length) of the vector $u$, and $\theta$ is the angle between $u$ and $v$.

The norm of $u$ is defined as $ ||u||2 = \sqrt{\sum{i=1}^{n} u_i^2}$

When smaller is the angle between those 2 vectors, more similar they are.

You can also use euclidean distance, but it measures dissimilaritt.

Embedding matrix

Suppose that we have a 10000 word diccionary, and we use 300 features, we can construct the embedding matrix.

For example, by multiplying the embeding matrix to the original one hot encoded vector to get the embedding vector. $E_m*O_{6257}=e_{6257}$

Our goal is to learn an embedding matrix E.

In practice, use specialized function to look up an embedding. For example in keras there are an embedding layer

Learning word embeddings: word2vec & GloVe

Learning word embeddings

For example we want to predict the next word when given a sentence: "I want a glass of orange _". B

Where O_j is the one hot encoding vector, E is the embedding matrix, e_j is the embedding vector

When we have the features vector, we can put them to the neural network, and then pass through softmax layer to generate an output a prediction
The hidden layer have parameters $W^{[1]}, b^{[1]}$
The softmax layer has his own parameters $$W^{[2]}, b^{[2]}$$

Normally we use a fixed windows size, for example the last 4 words to predict the next one.

Other context/target pairs:

Last 4 words
2 words left and right
Last 1 word
Nearby 1 word

Word2vec

Skip grams:

Given a specific word in the middle of a sentence (the input word), look at the words nearby and pick one at random. The network is going to tell us the probability for every word in our vocabulary of being the “nearby word” that we chose. The output probabilities are going to relate to how likely it is find each vocabulary word nearby our input word

Example: I want a glass of orange juice to go along with my cereal

We can generate pairs like

Context	Target	Distance
orange	juice	1
orange	glass	-1
orange	my	6

Model details

We want to map context c to target t using softmax $$ p(t|c) = \frac{e^{(\theta_t^T e_c)}}{\sum^m_{j=1}e^{(\theta_j^T e_c)}} $$ where $\theta_t$ is the parameter associated with output t.

If we have 10000 word vocabulary, and are learning 500-dimensional word embeddings.
- $\theta_t, e_c$ are both 500 dimensional
- $\theta _t$ and $e_c$ are both trained with an optimization algorithm such as Adam or gradient descent.

The loss function is defined as $$ L(\hat y, y) = -\sum^m_{i=1}y_ilog\hat y_i $$ Problems with softmax classification

If the vocabulary size is large, is quite computational expensive to compute $\sum^m_{j=1}e^{(\theta_j^T e_c)}$

The solution to that is to use Hierarchical softmax classifier

![Hierarchical softmax](figures/herarchical softmax.png)

In practice, we dont use an perfect balanced tree, common used word will be in the top and less used word(durian) in the bottom(the, of).

How to sample the context c?

Random sample, but in this case is possible that more frequent word such like the, of appear.
In practice, we don't take the context uniformly random, instead there are some heuristics to balance the common words and the non-common words.

CBoW

Continuous bag of word model:

Instead of use context word

Negative sampling

Negative sampling is used to reduce the computational cost of the Skip gram.

Define a new learning problem: given a context and randomly pick a word from dictionary, can the word be the target?

Context(c)	Word(t)	Target?(y)
orange	juice	1
orange	king	0
orange	book	0

1 if is target and 0 if is not a target.

The steps to generate the samples are:

Pick a positive context
Pick a k negative contexts from the dictionary.

In this case, king and book are negative samples.

For smaller dataset use k=5-20
For large dataset use k=2-5

Model

Create regression model: $$ p(y=1/c,t)=\sigma(\theta_t^T e_c) $$ We have 10000 binary classification problem but we only train k+1(k negative + 1 positive example) classifiers of them in each iteration.

How to select negatice examples

They best way, according to the authors is: $$ p(w_i) = \frac{f(w_i)^{\frac{3}{4}}}{\sum^m_{j=0}f(w_j)^{\frac{3}{4}}} $$ where $f(w_i)$ is the frequency of the word $w_i$

GloVe word vectors

Global vectors for word representation

Let

$x_{ij}$ = number of times the word i apprear in context of j
$x_{ij}=x_{ji}$ if we choose a windows pair, but they will not equal if we choose the previous words for example. In GloVe they use a window which means they are equal

Model $$ J = \sum_{i=1}^{N} \sum_{j=1}^{N} f\left(X_{i j}\right)\left(\theta_{i}^{T} e_{j}+b_{i}+b_{j}^{\prime}-\log X_{i j}\right)^{2} $$

$F(x_{ij})$ is a weighting term, $f(Xij) = 0 $ if $X_{ij}=0$, $0*\log X_{ij}=0$
Notice that $\theta_i$ and $e_j$ are simetric, which helps getting the final word embedding. It should be randomly initialized

Given a word $e_w^{(final)} = \frac{e_w+\theta_w}{2}$

Applications using word embeddings

Sentiment classification

For sentiment classificacion we usually not have a large training set, but we can use word embedding.

The embedding matrix may have been trained on 100 billion of words
Number of features in the word embedding is 3000
We can use sum or average given the embedding vector and pass it through the softmax classifier.

The problem of that is it ignore words order, for example "Completely lacking in good taste, good service, and good ambience" has the word good 3 times but its a negative review.

Using RNN can build a better sentiment classifier

In addition, it generalize better even if words weren't in your dataset.

Debiasing word embeddings

Word embeddings can reflect gender, ethnicity, age, sexual orientation and other biases of the text used to train the model

For example:

man:king vs women:queen
man:computer programmer vs women:homemaker
father:doctor vs mather:nurse

We have to reduce the bias.

Adressing bias in word embeddings

Identify bias direction: computing
- $e_{he}-e_{she}$
- $e_{male}-e_{female}$
- ...
And average them, so we can find bias direction
Neutralize: for every word that is not definiitional, project to get rid of bias
- In this case, babysitting and doctor
- After that whey will be equal in terms of gender
If g is the biased direction, then $g_{\perp}$ is the unbiased direction, in the example, we have to project $e_{receptionist}$ to $g_{\perp}$. $$ \begin{aligned} e^{bias_component} = \frac{e \cdot g}{||g||_2^2} * g \ e^{debiased} = e - e^{bias_component} \end{aligned} $$
Equalize pairs:

$$ \begin{array}{ccl} \mu & = & \frac{e_{w1} + e_{w2}}{2} \ \mu_{B} & = & \frac {\mu \cdot \text{bias_axis}}{||\text{bias_axis}||2^2} *\text{bias_axis}\ \mu{\perp} & = & \mu - \mu_{B}\ e_{w1B} & = & \frac {e_{w1} \cdot \text{bias_axis}}{||\text{bias_axis}||2^2} *\text{bias_axis}\ e{w2B} & = & \frac {e_{w2} \cdot \text{bias_axis}}{||\text{bias_axis}||2^2} *\text{bias_axis}\ e{w1B}^{corrected} & = & \sqrt{ |{1 - ||\mu_{\perp} ||^2_2} |} * \frac{e_{\text{w1B}} - \mu_B} {||(e_{w1} - \mu_{\perp}) - \mu_B||} \ e_{w2B}^{corrected} & = & \sqrt{ |{1 - ||\mu_{\perp} ||^2_2} |} * \frac{e_{\text{w2B}} - \mu_B} {||(e_{w2} - \mu_{\perp}) - \mu_B||} \ e_1 & = & e_{w1B}^{corrected} + \mu_{\perp}\ e_2 & = & e_{w2B}^{corrected} + \mu_{\perp} \end{array} $$
- We want to each pair to have difference only in gender:
  - grandfather-grandmather
  - he - she
  - boy - girl
- we move grandfather and grandmother to a point where they will be in the middle of the non-bias axis.
- There are some words you need to do this for in your steps. Number of these words is relatively small.

Proyecting one vector to another:

Given $\vec e$ and we want to proyect it to $\vec g$,
1. Direction(unit) vector: $ \frac{\vec g}{||g||_2}$
2. $\vec{P_{e,g}} = ||\vec{P_{e,g}}||* \frac{\vec g}{||g||_2} = \frac{\vec e \cdot\vec g}{||g||_2}* \frac{\vec g}{||g||_2} = \frac{\vec e \cdot -\vec g}{||\vec g||_2^2} * \vec g$

Useful informations

Word2Vec tutorial-the skip gram model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5.2 Natural Language Processing & Word Embeddings.md

5.2 Natural Language Processing & Word Embeddings.md

Introduction to word embedding

Using word embedding

Properties of word embeding

Embedding matrix

Learning word embeddings: word2vec & GloVe

Learning word embeddings

Word2vec

Skip grams:

CBoW

Negative sampling

GloVe word vectors

Applications using word embeddings

Sentiment classification

Debiasing word embeddings

Useful informations

Files

5.2 Natural Language Processing & Word Embeddings.md

Latest commit

History

5.2 Natural Language Processing & Word Embeddings.md

File metadata and controls

Introduction to word embedding

Using word embedding

Properties of word embeding

Embedding matrix

Learning word embeddings: word2vec & GloVe

Learning word embeddings

Word2vec

Skip grams:

CBoW

Negative sampling

GloVe word vectors

Applications using word embeddings

Sentiment classification

Debiasing word embeddings

Useful informations