[TOC]
Hyperparameters:
- Most important:
$\alpha$ - Second important:
$\beta$ : momentum term, number of layers, mini batch size - Third important: number of hidden units, learning rate decay
-
$\beta_1=0.9, \beta_2=0.999, \varepsilon=10^{-8}$ , normally we use the default values for Adam
How to tune:
- Don't use grid, use random values because we don't know which will be the most important
- Coarse to fine: zoom in a smaller regin of hyperparameters and then sample more densely within this space
Appropiate scale for hyperparameters
-
$\alpha$ : from 0.0001 to 1, but is better use log scaler = -4*np.random.rand() # r[-4, 0] alpha = 10**r # 10^-4 ... 10^0
-
$\beta$ : from 0.9 to 0.999->$1-\beta$ [0.1, 0.001]r = -3*np.random.rand() # r[-3, 0] 1-beta = 10**r # 10^-4 ... 10^0 beta = 1-10**r
- It is important not to use lienar scale because
$\frac{1}{1-\beta}$ is very sensitive when$\beta$ is close to 1-
$\beta:0.9000\rightarrow 0.9005$ , we are average about 10 values -
$\beta:0.9999(1000samples) \rightarrow 0.9995(2000samples)$ , there are huge difference
-
- It is important not to use lienar scale because
Babysitting one model: when you have huge dataset but not enough computational resources
- Day 0 you might initialize your parameter as random and then start training. Then you watch your learning curve gradually decrease over the day.
- And each day you nudge your parameters a little during training (increase/decrease learning rate, add momentum if prtforme well, otherwise back to the previous model )
- This is called panda approach.
Training many models in parallel
- Running multiple models and compare the J
- pick the best one
- Normalize inputs to speed up learning
- Batch normalization: normalize hidden layer so
Implementation
Given some intermediate value in NN:
-
Compute mean:
$\mu = \frac{1}{m} \sum Z^{(i)}$ -
Compute variance:
$\sigma^2 = \frac{1}{m} \sum (Z_i-\mu)^2$ -
$$Z_{norm}^{(i)} = \frac{Z^{(i)}-\mu}{\sqrt(\sigma^2+\epsilon)}$$
Then Z has mean 0 and variance 1.But we don't want Z alway has mean 0 and variance 1(for example if we have sigmoid we don't want more variance):
-
$\widetilde{Z}^{(i)} = \gamma Z_{norm}^{(i)} + \beta$ where$\gamma, \beta$ are learnable parameters-
$\beta, \gamma$ cam be learned using Adam, gradient descent with momentum, or RMSprop, not just with gradient descent - They set the mean and the variance of the linear variable
$z^{[l]}$ of a given layer
-
-
If $$ \begin{aligned} \gamma & = & \sqrt(\sigma^2 +\epsilon)\ \beta & = & \mu \end{aligned} $$ Then
$\widetilde{Z}{(i)} = {Z}^{(i)}$
-
Batch normalization is widely used in mini batches.
-
As we substract by mean, add an constant don't affect, so the parameters are:
$W, \beta, \gamma$ . -
Dimensions:
$Z^{[l]}: [n^{[l]}, 1]$ ,$b^{[l]}: [n^{[l]}, 1]$ ,$\beta: [n^{[l]}, 1]$ $\gamma^{[l]}: [n^{[l]}, 1]$
Implementation
for t = 1 ... numMiniBatches
Compute forward prop on X^t
In each hidden layer, use BN to replace Z^L with Z tilde ^l
Use backprop to compute dW, db, dbeta, dgamma
Update parameters
W^l = W^l -alpha dW^l
beta^l = beta^l -alpha dbeta^l
gamma^l = gamma^l -alpha dgamma^l
-
Same as why normalize X: facilitate the gradient descent
-
Make the changes of weights in earliear layer have less effect on later layer.
- When the distribution of input X is changed(Covariate shift), we have to retrain the model. By using normalization, no matter how it changes, the main and variance will be remain the same.
-
It also have regularization effect
-
Each mini bach is scaled by mean and variance computed on just that mini batch
-
This adds some noise to
$Z^{[l]}$ within that minibatch because the mean/variance is calculater with small number of samples. So similar to dropout, it add some noises to each hidden layer's activation -
Due to that, it has a slight regulalization effer(to prevent overfitting)
-
If we use larger mini batch size, we reduce noise -> reduce regularization effect.
-
At the training set: $$ \begin{aligned} \mu & =\frac{1}{m} \sum_{i} z^{(i)} \ \sigma^{2} & =\frac{1}{m} \sum_{i}\left(z^{(i)}-\mu\right)^{2} \ z_{\text { norm }}^{(i)} &=\frac{z^{(i)}-\mu}{\sqrt{\sigma^{2}+\varepsilon}} \ \tilde{z}^{(i)} &=\gamma z_{\text { norm }}^{(i)}+\beta \end{aligned} $$ As in test set we don't have mini batch:
- Estimate
$\mu, \sigma^2$ using exponentially weighed average across mini batch. - At test time, compute Z and
$\widetilde{Z}$ with estimated$\mu, \sigma^2$ .
-
C: number of classes
-
$Z^{[L]} = W^{[L]}a^{[L-1]}+b^{[L]}$ -
Activation function: $$ \begin{array}{lcl} t & = & e^{z^{[l]}} \ a^{[l]} & = & \frac{e^{z^{[l]}}}{\sum_{j=1}^C{t_i}}\ a^{[l]}i & = & \frac{t_i}{\sum{j=1}^C{t_i}} \end{array} $$
If we use subindex j for a
and k for z
:
$$
a_i = \frac{e^{z_j}}{\sum_{c=1}^C e^{z_c}}
$$
Cost function is defined as: $$ J = \frac{1}{m}\sum^m_{i=1}L(\hat y, y) $$
When we compute the backward propagation, we divide it in 3 steps:
- Derivative of the loss function:
$\frac{\partial L}{\partial a}$ - Derivative of the activation function
$\frac{\partial a }{\partial z}$ - Then $\frac{\partial J} {\partial z} = \frac{\partial J }{\partial L}\frac{\partial L}{\partial a}\frac{\partial a }{\partial z}$
Derivative of the softmax activation function
Following the example of the forward propagation, when we derivate
-
j = k $$ \begin{array}{rcl} \frac{\partial a_j}{\partial z_k} & = & \frac{e^{z_j}\sum_{c=1}^C e^{c}- e^{z_k}e^{z_j}}{(\sum_{c=1}^C e^{c})^2}\ & = & \frac{e^{z_j}}{\sum_{c=1}^C e^{c}}\frac{1- e^{z_k}}{\sum_{c=1}^C e^{c}} \ & = & a_j (1-a_k) \ & = & a_j* (1-a_j) \end{array} $$
-
$j \ne k$ $$ \begin{array}{rcl} \frac{\partial a_j}{\partial z_k} & = & \frac{0*\sum_{c=1}^C e^{c}- e^{z_j}e^{z_k}}{(\sum_{k=1}^C e^{k})^2}\ & = & -\frac{e^{z_j}}{\sum_{c=1}^C e^{c}}\frac{e^{z_k}}{\sum_{c=1}^C e^{c}} \ & = & -a_ja_k \end{array} $$ Then $$ \frac{ \partial a_j} {\partial z_k} = \left { \begin{array}{ll} -a_j(1-a_j) & if & j= k\ a_ja_k & if & j \ne k \end{array} \right. $$
Loss function
The loss function for a single sample is defined as: comparing the real label with prediction. $$ L(a, y) = -\sum^C_{j=1}y_j \log a_j $$ Then the derivative of L in the previous example is $$ \begin{array}{lclcl} \frac{\partial L}{\partial a_0} & = & \frac{y_0}{a_0}(-a_0(1-a_0)) & = & -y_0(1-a_0)\ \frac{\partial L}{\partial a_1} & = & \frac{y_1}{a_1}(a_1a_0) & = &y_1a_0 \ \frac{\partial L}{\partial a_2} & = & \frac{y_2}{a_2}(a_2a_0) & = &y_2a_0 \ \frac{\partial L}{\partial a} & =& -y_0+y_0a_0 +y_1a_0+y_2a_0 & = & (y_0+y_1+y_2)a_0 -y_0 \end{array} $$ As $y$ is one hot encoded, then $\sum y = 1$, so $ \frac{\partial L}{\partial a} = a_0-y_0$ . Then $$ \begin{array}{lcl} \frac{\partial L}{\partial z_k} & = & -\sum_{j=1}^C y_j (\frac{\partial L}{\partial a_j}\frac{\partial a_j}{\partial z_k})\ & = & -\frac{y_j}{a_j}\left( -a_j (1-a_j)\right) + \sum_{j \ne k}^C \frac{y_k}{a_k} \left(a_ja_k \right) \ & = & -y_j(1-a_j) + \sum_{j \ne k}^C y_k*a_j \ & = & (y_j+\sum_{j \ne k}^C y_j)a_j - y_j \ & = & a_j -y_j \end{array} $$
In conclusion: $$ dz^{[l]} = \hat y - y \ $$
Choosing deep learning frameworks
- Ease of programming(development and deployment)
- Running speed
- Truly open ( open source with good governance)
- Declare variables:
w = tf.Variable(0, dtype = tf.float32)
- Define cost:
cost = w**2 -10w+25
- Define train:
tf.train.GradientDescentOptimizer(learning_rate). minimize(cost)
placeholder
: variable that the value will be provided later.