[TOC]
##Word representation
Until now we represents words as one hot encoding, for example in the diccionary
But this representation has aweakness: the algorithm can't learn relationships between them.
So instead of one hot encoding, we can use feature resprentation :
Features | Man(5391) | Woman(9853) | King(4914) | Queen(7157) | Apple(466) | Orange(6257) |
---|---|---|---|---|---|---|
Gender | -1 | 1 | -0.95 | 0.97 | 0 | 0.01 |
Royal | 0.01 | 0.02 | 0.93 | 0.95 | -0.01 | 0 |
Age | 0.03 | 0.02 | 0.7 | 0.69 | 0.03 | 0.02 |
Food | 0.01 | 0.01 | 0.02 | 0,01 | 0.95 | 0.95 |
…. | ||||||
e5391 | e9853 |
-
For example each word will have 300 features
- -1 if is man, 1 is woman, so apple and orange are close to 0
- 1 if food, so man, woman, king, queen are not foods
Then apple and orange share more similalities.
We call this representation word embeddings
To visualize word embedding we use t-SNE algorithm to reduce the features to 2 dimensions: more similar they are, more close are in the figure
Example name entity recognition:
- Sally Johnson is an orange farmer
- Robert lin is an apple farmer. It will probably be easy to recognize apple farmer thanks to the previous sentence
- Robert lin is an durian cultivator, but it we are facing difficult words, like durian or cultivator, whichs may not be in our diccionary, if we use one hot encoding, it will be difficult to recognize. However, if we have a trained word embedding , it will tell us that durian is a fruit like orange and cultivator is like farm, them it can recognize that durian cultivator is a person.
The effect is similar to transfer learning.
Transfer learning and word embeddings:
- Learn word embedding from large text corpus or dowmload pre traine embedding online
- Transfer embedding to new task with smaller training set
- Optional: continue to fine tune the word embeddings with new data.
It also can help with analogy reasoning.
In the previous table of man and women, king queen: $$ \begin{array}{cccc} e_{m} & & e_{w} & & e_{m} - e_{w}\ \begin{bmatrix} -1 \ 0.01 \ 0.03 \ 0.09 \end{bmatrix} &- & \begin{bmatrix} 1 \ 0.02 \ 0.02 \ 0.01 \end{bmatrix} &\approx & \begin{bmatrix} -2 \ 0 \ 0 \ 0 \end{bmatrix} \end{array} $$ Then main difference between man and women is gender
Similar, for king and queen the mean difference is also gender $$ \begin{array}{cccc} e_{k} & & e_{q} & & e_{k} - e_{q}\ \begin{bmatrix} -0.95 \ 0.93 \ 0.70 \ 0.02 \end{bmatrix} &- & \begin{bmatrix} 0.97 \ 0.95 \ 0.69 \ 0.01 \end{bmatrix} &\approx & \begin{bmatrix} -2 \ 0 \ 0 \ 0 \end{bmatrix} \end{array} $$ If we know man is analog to women, what correponds to king?
What we want is to find word w that: argmax $ sim(e_w, e_{king}-e_{man}+e_{woman})$
The mos used function is cosine similarity:
According to euclidean dot product formula:
$$
A \cdot B =\left| A \right|\left| B \right|\cos \theta
$$
Then
$$
sim(u, v) = \frac{u^T v}{|| u ||_2 || v ||_2}
$$
where
The norm of
When smaller is the angle between those 2 vectors, more similar they are.
You can also use euclidean distance, but it measures dissimilaritt.
Suppose that we have a 10000 word diccionary, and we use 300 features, we can construct the embedding matrix.
For example, by multiplying the embeding matrix to the original one hot encoded vector to get the embedding vector.
Our goal is to learn an embedding matrix E.
In practice, use specialized function to look up an embedding. For example in keras there are an embedding layer
For example we want to predict the next word when given a sentence: "I want a glass of orange _". B
Where O_j
is the one hot encoding vector, E
is the embedding matrix, e_j
is the embedding vector
- When we have the features vector, we can put them to the neural network, and then pass through softmax layer to generate an output a prediction
- The hidden layer have parameters
$W^{[1]}, b^{[1]}$ - The softmax layer has his own parameters
$$W^{[2]}, b^{[2]}$$
Normally we use a fixed windows size, for example the last 4 words to predict the next one.
Other context/target pairs:
- Last 4 words
- 2 words left and right
- Last 1 word
- Nearby 1 word
Given a specific word in the middle of a sentence (the input word), look at the words nearby and pick one at random. The network is going to tell us the probability for every word in our vocabulary of being the “nearby word” that we chose. The output probabilities are going to relate to how likely it is find each vocabulary word nearby our input word
Example: I want a glass of orange juice to go along with my cereal
We can generate pairs like
Context | Target | Distance |
---|---|---|
orange | juice | 1 |
orange | glass | -1 |
orange | my | 6 |
Model details
We want to map context c to target t using softmax
$$
p(t|c) = \frac{e^{(\theta_t^T e_c)}}{\sum^m_{j=1}e^{(\theta_j^T e_c)}}
$$
where
- If we have 10000 word vocabulary, and are learning 500-dimensional word embeddings.
-
$\theta_t, e_c$ are both 500 dimensional -
$\theta _t$ and$e_c$ are both trained with an optimization algorithm such as Adam or gradient descent.
-
The loss function is defined as $$ L(\hat y, y) = -\sum^m_{i=1}y_ilog\hat y_i $$ Problems with softmax classification
- If the vocabulary size is large, is quite computational expensive to compute
$\sum^m_{j=1}e^{(\theta_j^T e_c)}$
The solution to that is to use Hierarchical softmax classifier
![Hierarchical softmax](figures/herarchical softmax.png)
In practice, we dont use an perfect balanced tree, common used word will be in the top and less used word(durian) in the bottom(the, of).
How to sample the context c?
- Random sample, but in this case is possible that more frequent word such like the, of appear.
- In practice, we don't take the context uniformly random, instead there are some heuristics to balance the common words and the non-common words.
Continuous bag of word model:
Instead of use context word
Negative sampling is used to reduce the computational cost of the Skip gram.
Define a new learning problem: given a context and randomly pick a word from dictionary, can the word be the target?
Context(c) | Word(t) | Target?(y) |
---|---|---|
orange | juice | 1 |
orange | king | 0 |
orange | book | 0 |
1 if is target and 0 if is not a target.
The steps to generate the samples are:
- Pick a positive context
- Pick a k negative contexts from the dictionary.
In this case, king and book are negative samples.
- For smaller dataset use k=5-20
- For large dataset use k=2-5
Model
Create regression model: $$ p(y=1/c,t)=\sigma(\theta_t^T e_c) $$ We have 10000 binary classification problem but we only train k+1(k negative + 1 positive example) classifiers of them in each iteration.
How to select negatice examples
They best way, according to the authors is:
$$
p(w_i) = \frac{f(w_i)^{\frac{3}{4}}}{\sum^m_{j=0}f(w_j)^{\frac{3}{4}}}
$$
where
Global vectors for word representation
Let
-
$x_{ij}$ = number of times the word i apprear in context of j -
$x_{ij}=x_{ji}$ if we choose a windows pair, but they will not equal if we choose the previous words for example. In GloVe they use a window which means they are equal
Model $$ J = \sum_{i=1}^{N} \sum_{j=1}^{N} f\left(X_{i j}\right)\left(\theta_{i}^{T} e_{j}+b_{i}+b_{j}^{\prime}-\log X_{i j}\right)^{2} $$
-
$F(x_{ij})$ is a weighting term, $f(Xij) = 0 $ if$X_{ij}=0$ ,$0*\log X_{ij}=0$ - Notice that
$\theta_i$ and$e_j$ are simetric, which helps getting the final word embedding. It should be randomly initialized
Given a word
For sentiment classificacion we usually not have a large training set, but we can use word embedding.
- The embedding matrix may have been trained on 100 billion of words
- Number of features in the word embedding is 3000
- We can use sum or average given the embedding vector and pass it through the softmax classifier.
The problem of that is it ignore words order, for example "Completely lacking in good taste, good service, and good ambience" has the word good 3 times but its a negative review.
Using RNN can build a better sentiment classifier
In addition, it generalize better even if words weren't in your dataset.
Word embeddings can reflect gender, ethnicity, age, sexual orientation and other biases of the text used to train the model
For example:
- man:king vs women:queen
- man:computer programmer vs women:homemaker
- father:doctor vs mather:nurse
We have to reduce the bias.
Adressing bias in word embeddings
-
Identify bias direction: computing
$e_{he}-e_{she}$ $e_{male}-e_{female}$ - ...
And average them, so we can find bias direction
-
Neutralize: for every word that is not definiitional, project to get rid of bias
- In this case, babysitting and doctor
- After that whey will be equal in terms of gender
If g is the biased direction, then
$g_{\perp}$ is the unbiased direction, in the example, we have to project$e_{receptionist}$ to$g_{\perp}$ . $$ \begin{aligned} e^{bias_component} = \frac{e \cdot g}{||g||_2^2} * g \ e^{debiased} = e - e^{bias_component} \end{aligned} $$ -
Equalize pairs:
$$ \begin{array}{ccl} \mu & = & \frac{e_{w1} + e_{w2}}{2} \ \mu_{B} & = & \frac {\mu \cdot \text{bias_axis}}{||\text{bias_axis}||2^2} *\text{bias_axis}\ \mu{\perp} & = & \mu - \mu_{B}\ e_{w1B} & = & \frac {e_{w1} \cdot \text{bias_axis}}{||\text{bias_axis}||2^2} *\text{bias_axis}\ e{w2B} & = & \frac {e_{w2} \cdot \text{bias_axis}}{||\text{bias_axis}||2^2} *\text{bias_axis}\ e{w1B}^{corrected} & = & \sqrt{ |{1 - ||\mu_{\perp} ||^2_2} |} * \frac{e_{\text{w1B}} - \mu_B} {||(e_{w1} - \mu_{\perp}) - \mu_B||} \ e_{w2B}^{corrected} & = & \sqrt{ |{1 - ||\mu_{\perp} ||^2_2} |} * \frac{e_{\text{w2B}} - \mu_B} {||(e_{w2} - \mu_{\perp}) - \mu_B||} \ e_1 & = & e_{w1B}^{corrected} + \mu_{\perp}\ e_2 & = & e_{w2B}^{corrected} + \mu_{\perp} \end{array} $$
- We want to each pair to have difference only in gender:
- grandfather-grandmather
- he - she
- boy - girl
- we move grandfather and grandmother to a point where they will be in the middle of the non-bias axis.
- There are some words you need to do this for in your steps. Number of these words is relatively small.
- We want to each pair to have difference only in gender:
-
Proyecting one vector to another:
Given
$\vec e$ and we want to proyect it to$\vec g$ ,- Direction(unit) vector: $ \frac{\vec g}{||g||_2}$
$\vec{P_{e,g}} = ||\vec{P_{e,g}}||* \frac{\vec g}{||g||_2} = \frac{\vec e \cdot\vec g}{||g||_2}* \frac{\vec g}{||g||_2} = \frac{\vec e \cdot -\vec g}{||\vec g||_2^2} * \vec g$