[TOC]

Hyperparameter tuning

Tuning process

Hyperparameters:

Most important: $\alpha$
Second important: $\beta$: momentum term, number of layers, mini batch size
Third important: number of hidden units, learning rate decay
$\beta_1=0.9, \beta_2=0.999, \varepsilon=10^{-8}$, normally we use the default values for Adam

How to tune:

Don't use grid, use random values because we don't know which will be the most important
Coarse to fine: zoom in a smaller regin of hyperparameters and then sample more densely within this space

Pick hyperparameters

Appropiate scale for hyperparameters

$\alpha$: from 0.0001 to 1, but is better use log scale

r = -4*np.random.rand() # r[-4, 0]
alpha = 10**r # 10^-4 ... 10^0

$\beta$: from 0.9 to 0.999-> $1-\beta$ [0.1, 0.001]
```
r = -3*np.random.rand() # r[-3, 0]
1-beta = 10**r # 10^-4 ... 10^0
beta = 1-10**r
```
- It is important not to use lienar scale because $\frac{1}{1-\beta}$ is very sensitive when $\beta$ is close to 1
  - $\beta:0.9000\rightarrow 0.9005$, we are average about 10 values
  - $\beta:0.9999(1000samples) \rightarrow 0.9995(2000samples)$, there are huge difference

Hyperparameter tuning in practice: pandas vs caviar

Babysitting one model: when you have huge dataset but not enough computational resources

Day 0 you might initialize your parameter as random and then start training. Then you watch your learning curve gradually decrease over the day.
And each day you nudge your parameters a little during training (increase/decrease learning rate, add momentum if prtforme well, otherwise back to the previous model )
This is called panda approach.

Training many models in parallel

Running multiple models and compare the J
pick the best one

Batch normalization

Normalize inputs to speed up learning
Batch normalization: normalize hidden layer so

Implementation

Given some intermediate value in NN: $Z^{(1)}...Z^{(m)}$:

Compute mean: $\mu = \frac{1}{m} \sum Z^{(i)}$
Compute variance: $\sigma^2 = \frac{1}{m} \sum (Z_i-\mu)^2$
$$Z_{norm}^{(i)} = \frac{Z^{(i)}-\mu}{\sqrt(\sigma^2+\epsilon)}$$

Then Z has mean 0 and variance 1.But we don't want Z alway has mean 0 and variance 1(for example if we have sigmoid we don't want more variance):

$\widetilde{Z}^{(i)} = \gamma Z_{norm}^{(i)} + \beta$ where $\gamma, \beta$ are learnable parameters
- $\beta, \gamma$ cam be learned using Adam, gradient descent with momentum, or RMSprop, not just with gradient descent
- They set the mean and the variance of the linear variable $z^{[l]}$ of a given layer
If $$ \begin{aligned} \gamma & = & \sqrt(\sigma^2 +\epsilon)\ \beta & = & \mu \end{aligned} $$ Then $\widetilde{Z}{(i)} = {Z}^{(i)}$

Fitting batch norm into a neural network

Batch normalization is widely used in mini batches.
As we substract by mean, add an constant don't affect, so the parameters are: $W, \beta, \gamma$.
Dimensions: $Z^{[l]}: [n^{[l]}, 1]$, $b^{[l]}: [n^{[l]}, 1]$, $\beta: [n^{[l]}, 1]$ $\gamma^{[l]}: [n^{[l]}, 1]$

Implementation

for t = 1 ... numMiniBatches
	Compute forward prop on X^t
		In each hidden layer, use BN to replace Z^L with Z tilde ^l
	Use backprop to compute dW, db, dbeta, dgamma
	Update parameters
		W^l = W^l -alpha dW^l
		beta^l = beta^l -alpha dbeta^l
		gamma^l = gamma^l -alpha dgamma^l

Why doeas batch norm work?

Same as why normalize X: facilitate the gradient descent
Make the changes of weights in earliear layer have less effect on later layer.
- When the distribution of input X is changed(Covariate shift), we have to retrain the model. By using normalization, no matter how it changes, the main and variance will be remain the same.
It also have regularization effect
- Each mini bach is scaled by mean and variance computed on just that mini batch
- This adds some noise to $Z^{[l]}$ within that minibatch because the mean/variance is calculater with small number of samples. So similar to dropout, it add some noises to each hidden layer's activation
- Due to that, it has a slight regulalization effer(to prevent overfitting)
- If we use larger mini batch size, we reduce noise -> reduce regularization effect.

Batch normalization on test

At the training set: $$ \begin{aligned} \mu & =\frac{1}{m} \sum_{i} z^{(i)} \ \sigma^{2} & =\frac{1}{m} \sum_{i}\left(z^{(i)}-\mu\right)^{2} \ z_{\text { norm }}^{(i)} &=\frac{z^{(i)}-\mu}{\sqrt{\sigma^{2}+\varepsilon}} \ \tilde{z}^{(i)} &=\gamma z_{\text { norm }}^{(i)}+\beta \end{aligned} $$ As in test set we don't have mini batch:

Estimate $\mu, \sigma^2$ using exponentially weighed average across mini batch.
At test time, compute Z and $\widetilde{Z}$ with estimated $\mu, \sigma^2$.

Muticlass classification

Softmax regression

C: number of classes
$Z^{[L]} = W^{[L]}a^{[L-1]}+b^{[L]}$
Activation function: $$ \begin{array}{lcl} t & = & e^{z^{[l]}} \ a^{[l]} & = & \frac{e^{z^{[l]}}}{\sum_{j=1}^C{t_i}}\ a^{[l]}i & = & \frac{t_i}{\sum{j=1}^C{t_i}} \end{array} $$

Training a softmax classifier

Forward propagation

$$ \begin{aligned} a_0 & = & \frac{e^{z_0}}{e^{z_0}+e^{z_1}+ e^{z_2}} \\ a_1 & = & \frac{e^{z_1}}{e^{z_0}+e^{z_1}+ e^{z_2}} \\ a_2 & = & \frac{e^{z_2}}{e^{z_0}+e^{z_1}+ e^{z_2}} \end{aligned} $$

If we use subindex j for a and k for z: $$ a_i = \frac{e^{z_j}}{\sum_{c=1}^C e^{z_c}} $$

Backward propagation

Cost function is defined as: $$ J = \frac{1}{m}\sum^m_{i=1}L(\hat y, y) $$

When we compute the backward propagation, we divide it in 3 steps:

Derivative of the loss function: $\frac{\partial L}{\partial a}$
Derivative of the activation function $\frac{\partial a }{\partial z}$
Then $\frac{\partial J} {\partial z} = \frac{\partial J }{\partial L}\frac{\partial L}{\partial a}\frac{\partial a }{\partial z}$

Derivative of the softmax activation function

Following the example of the forward propagation, when we derivate $a_i$, we have to take into account that each $z_j$ has contributed in the forward propagation for $a_i$, so we have to derivate $a_i$ with respect each unit of z, like: $$ \begin{array}{rclcl} \frac{\partial a_0}{\partial z_0} & = & \frac{e^{z_0}*(e^{z_0}+e^{z_1}+ e^{z_2})- e^{z_0}e^{z_0}}{(e^{z_0}+e^{z_1}+ e^{z_2})^2} & = & a_0(1-a_0)\ \frac{\partial a_0}{\partial z_1} & = & \frac{0(e^{z_0}+e^{z_1}+ e^{z_2})- e^{z_1}e^{z_0}}{(e^{z_0}+e^{z_1}+ e^{z_2})^2} & = & a_1a_0\ \frac{\partial a_0}{\partial z_2} & = & \frac{0(e^{z_0}+e^{z_1}+ e^{z_2})- e^{z_2}*e^{z_0}}{(e^{z_0}+e^{z_1}+ e^{z_2})^2} & = & a_2a_0\ \end{array} $$ Then we can conclude that if

j = k $$ \begin{array}{rcl} \frac{\partial a_j}{\partial z_k} & = & \frac{e^{z_j}\sum_{c=1}^C e^{c}- e^{z_k}e^{z_j}}{(\sum_{c=1}^C e^{c})^2}\ & = & \frac{e^{z_j}}{\sum_{c=1}^C e^{c}}\frac{1- e^{z_k}}{\sum_{c=1}^C e^{c}} \ & = & a_j (1-a_k) \ & = & a_j* (1-a_j) \end{array} $$
$j \ne k$ $$ \begin{array}{rcl} \frac{\partial a_j}{\partial z_k} & = & \frac{0*\sum_{c=1}^C e^{c}- e^{z_j}e^{z_k}}{(\sum_{k=1}^C e^{k})^2}\ & = & -\frac{e^{z_j}}{\sum_{c=1}^C e^{c}}\frac{e^{z_k}}{\sum_{c=1}^C e^{c}} \ & = & -a_ja_k \end{array} $$ Then $$ \frac{ \partial a_j} {\partial z_k} = \left { \begin{array}{ll} -a_j(1-a_j) & if & j= k\ a_ja_k & if & j \ne k \end{array} \right. $$

Loss function

The loss function for a single sample is defined as: comparing the real label with prediction. $$ L(a, y) = -\sum^C_{j=1}y_j \log a_j $$ Then the derivative of L in the previous example is $$ \begin{array}{lclcl} \frac{\partial L}{\partial a_0} & = & \frac{y_0}{a_0}(-a_0(1-a_0)) & = & -y_0(1-a_0)\ \frac{\partial L}{\partial a_1} & = & \frac{y_1}{a_1}(a_1a_0) & = &y_1a_0 \ \frac{\partial L}{\partial a_2} & = & \frac{y_2}{a_2}(a_2a_0) & = &y_2a_0 \ \frac{\partial L}{\partial a} & =& -y_0+y_0a_0 +y_1a_0+y_2a_0 & = & (y_0+y_1+y_2)a_0 -y_0 \end{array} $$ As $y$ is one hot encoded, then $\sum y = 1$, so $ \frac{\partial L}{\partial a} = a_0-y_0$ . Then $$ \begin{array}{lcl} \frac{\partial L}{\partial z_k} & = & -\sum_{j=1}^C y_j (\frac{\partial L}{\partial a_j}\frac{\partial a_j}{\partial z_k})\ & = & -\frac{y_j}{a_j}\left( -a_j (1-a_j)\right) + \sum_{j \ne k}^C \frac{y_k}{a_k} \left(a_ja_k \right) \ & = & -y_j(1-a_j) + \sum_{j \ne k}^C y_k*a_j \ & = & (y_j+\sum_{j \ne k}^C y_j)a_j - y_j \ & = & a_j -y_j \end{array} $$

In conclusion: $$ dz^{[l]} = \hat y - y \ $$

Deep learning frameworks

Choosing deep learning frameworks

Ease of programming(development and deployment)
Running speed
Truly open ( open source with good governance)

Tensorflow

Declare variables: w = tf.Variable(0, dtype = tf.float32)
Define cost: cost = w**2 -10w+25
Define train: tf.train.GradientDescentOptimizer(learning_rate). minimize(cost)
placeholder: variable that the value will be provided later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.3 Hyperparameter tuning.md

2.3 Hyperparameter tuning.md

Hyperparameter tuning

Tuning process

Pick hyperparameters

Hyperparameter tuning in practice: pandas vs caviar

Batch normalization

Fitting batch norm into a neural network

Why doeas batch norm work?

Batch normalization on test

Muticlass classification

Softmax regression

Training a softmax classifier

Forward propagation

Backward propagation

Deep learning frameworks

Tensorflow

Files

2.3 Hyperparameter tuning.md

Latest commit

History

2.3 Hyperparameter tuning.md

File metadata and controls

Hyperparameter tuning

Tuning process

Pick hyperparameters

Hyperparameter tuning in practice: pandas vs caviar

Batch normalization

Fitting batch norm into a neural network

Why doeas batch norm work?

Batch normalization on test

Muticlass classification

Softmax regression

Training a softmax classifier

Forward propagation

Backward propagation

Deep learning frameworks

Tensorflow