-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathneural_networks.tex
1204 lines (1066 loc) · 61.3 KB
/
neural_networks.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Neural Networks}
\label{chap-neural_networks}
You've probably been
hearing a lot about ``neural networks.'' Now that we have several
useful machine-learning concepts (hypothesis classes, classification,
regression, gradient descent, regularization, etc.) we are
well equipped to understand neural networks in detail.
This is, in some sense, the ``third wave'' of neural nets. The basic
idea is founded on the 1943 model of neurons of McCulloch and Pitts
and the learning ideas of Hebb. There was a great deal of excitement, but
not a lot of practical success: there were good training methods
(e.g., perceptron) for linear functions, and interesting examples of
non-linear functions, but no good way to train non-linear functions
from data. Interest died out for a while, but was re-kindled in the
1980s when several people \note{As with many good ideas in science,
the basic idea for how to train non-linear neural networks with
gradient descent was independently developed by more than one
researcher.} came up with a way to train neural networks with
``back-propagation,'' which is a particular style of implementing
gradient descent, that we will study here. By the mid-90s, the
enthusiasm waned again, because although we could train non-linear
networks, the training tended to be slow and was plagued by a problem
of getting stuck in local optima. Support vector machines ({\sc svm}s)
that use regularization of high-dimensional hypotheses by seeking to maximize
the margin, and kernel methods that are an efficient and beautiful way of
using feature transformations to non-linearly transform data into a
higher-dimensional space, provided reliable learning methods with
guaranteed convergence and no local optima.
However, during the {\sc svm} enthusiasm, several groups kept working
on neural networks, and their work, in combination with an increase in
available data and computation, has made them rise again. They have
become much more reliable and capable, and are now the method of
choice in many applications. There are many, many \note{The number
increases daily, as may be seen on {\tt arxiv.org}.} variations of
neural networks, which we can't even begin to survey. We will study
the core ``feed-forward'' networks with ``back-propagation''
training, and then, in later chapters, address some of the major
advances beyond this core.
We can view neural networks from several different perspectives:
\begin{description}
\item{\bf View 1}: An application of stochastic gradient descent for
classification and regression with a potentially very rich
hypothesis class.
\item{\bf View 2}: A brain-inspired network of neuron-like computing
elements that learn distributed representations.
\item{\bf View 3}: A method for building applications that make
predictions based on huge amounts of data in very complex domains.
\end{description}
We will mostly take view 1, with the understanding that the techniques
we develop will enable the applications in view 3. View 2 was a major
motivation for the early development of neural networks, but the
techniques we will study do not \note{Some prominent
researchers are, in fact, working hard to find analogues of these methods in
the brain.} seem to actually account for the biological learning
processes in brains.
\section{Basic element}
The basic element of a neural network is a ``neuron,'' pictured
schematically below. We will also sometimes refer to a
neuron as a ``unit'' or ``node.'' \index{neuron}
\begin{center}
\begin{tikzpicture}[main node/.style={thick,circle,font=\Large}]
\node[main node,draw](sum) at (0,0) {$\sum$};
\node[main node](x1) at (-2,1) {$x_1$};
\node[main node](dots) at (-2,0) {$\vdots$};
\node[main node](xm) at (-2,-1) {$x_m$};
\coordinate (w0) at (0,-1.5);
\node[main node, draw] (f) at (2,0) {$f(\cdot)$};
\node[main node] (y) at (4,0) {$a$};
\draw[->, above right] (x1) -- node {$w_1$} (sum);
\draw (dots) -- node {} (sum);
\draw[->, below right] (xm) -- node {$w_m$} (sum);
\draw[above right] (w0) node {$w_0$} -- (sum) ;
\draw[->, above] (sum) -- node (z) {$z$} (f);
\draw[->, above] (f) -- (y);
\node[black!70] (inp) at ($(xm) + (0,-1.5)$) {input};
\node[black!70] (pre) at ($(z) + (0,1)$) {pre-activation};
\node[black!70] (out) at ($(y) + (0,1)$) {output};
\node[black!70] (act) at ($(f) - (0,1.5)$) {activation function};
\draw[->,black!70] (inp) -- (xm);
\draw[->,black!70] (pre) -- (z);
\draw[->,black!70] (out) -- (y);
\draw[->,black!70] (act) -- (f);
\end{tikzpicture}
\end{center}
It is a non-linear function of an input vector $x \in \R^m$ \note{Sorry
for changing our notation here. We were using $d$ as the dimension
of the input, but we are trying to be consistent here with many
other accounts of neural networks. It is impossible to be
consistent with all of them though---there are many different ways
of telling this story.} to a single output value $a \in \R$. It is
parameterized by a vector of {\em weights} $(w_1, \ldots, w_m) \in
\R^m$ and an {\em offset} \note{This
should remind you of our $\theta$ and $\theta_0$ for linear models.} or {\em threshold} $w_0 \in \R$. In order for the neuron to be non-linear, we also specify an {\em
activation function} $f : \R \rightarrow \R$, which can be the
identity ($f(x) = x$, in that case the neuron is a linear function of $x$), but can also be any other function, though we
will only be able to work with it if it is differentiable.
The function represented by the neuron is expressed as:
\[a = f(z) = f\left(\left(\sum_{j=1}^m x_jw_j\right) + w_0\right) = f(w^Tx + w_0)\;\;.\]
Before thinking about a whole network, we can consider how to train a
single unit. Given a loss function $\mathcal{L}(\text{guess}, \text{actual})$
and a dataset $\{(\ex{x}{1}, \ex{y}{1}), \ldots,
(\ex{x}{n},\ex{y}{n})\}$, we can
do (stochastic) gradient descent, adjusting the weights
$w, w_0$ to minimize
\[J(w, w_0) = \sum_{i} \mathcal{L}\left(\text{NN}(\ex{x}{i}; w, w_0), \ex{y}{i}\right)\;,\]
where $\text{NN}$ is the output of our single-unit neural net for a given input.
We have already studied two special cases of the neuron:
linear logistic classifiers (LLCs) with NLL loss and regressors with
quadratic loss! The activation function for the LLC is $f(x) =
\sigma(x)$ and for linear regression it is simply $f(x) = x$.
\question{
Just for a single neuron, imagine for some reason, that we decide to
use activation function $f(z) = e^z$ and loss function $\mathcal{L}(\text{guess}, \text{actual}) = (\text{guess} - \text{actual})^2$. Derive a gradient descent update for $w$ and $w_0$.
}
\section{Networks}
Now, we'll put multiple neurons together into a {\em network}. A
neural network \index{neural network}in general takes in an input $x \in \R^m$ and generates
an output $a \in \R^n$. It is constructed out of multiple neurons;
the inputs of each neuron might be elements of $x$ and/or outputs of
other neurons. The outputs are generated by $n$ {\em output units}. \index{neural network!output units}
In this chapter, we will only consider {\it feed-forward} networks\index{neural network!feed-forward network}. In
a feed-forward network, you can think of the network as defining a
function-call graph that is {\em acyclic}: that is, the input to a
neuron can never depend on that neuron's output. Data flows one way,
from the inputs to the outputs, and the function computed by the
network is just a composition of the functions computed by the
individual neurons.
Although the graph structure of a feed-forward neural network can really be
anything (as long as it satisfies the feed-forward constraint), for
simplicity in software and analysis, we usually organize them into
{\em layers}\index{neural network!layers}. A layer is a group of neurons that are essentially
``in parallel'': their inputs are outputs of neurons in the previous
layer, and their outputs are the input to the neurons in the next
layer. We'll start by describing a single layer, and then go on to
the case of multiple layers.
\subsection{Single layer}
A {\em layer} is a set of units that, as we have just described, are
not connected to each other. The layer is called
{\em fully connected}\index{neural network!fully connected} if, as in the diagram below,
all of the inputs
% the inputs to each unit in the layer are the same
(i.e., $x_1, x_2, \ldots x_m$ in this
case) are connected to every unit in the layer. A layer has input $x \in \R^m$ and output (also known as
{\em activation}) $a \in \R^n$.
\begin{center}
\begin{tikzpicture}[main node/.style={circle}]
\coordinate (soff) at (0,1.4);
\node[main node,draw](s1) at ($2*(soff)$) {$\sum$};
\node[main node,draw](s2) at (soff) {$\sum$};
\node[main node,draw](s3) at (0,0) {$\sum$};
\node[main node](sdots) at ($(0,0)-(soff)$) {$\vdots$};
\node[main node,draw](sn) at ($(0,0)-2*(soff)$) {$\sum$};
% inputs
\coordinate (x) at (-2,0);
\coordinate (xoff) at (0,1);
\node[main node](x1) at ($(x) + 1.5*(xoff)$) {$x_1$};
\node[main node](x2) at ($(x) + .5*(xoff)$) {$x_2$};
\node[main node](xdots) at ($(x) - .5*(xoff)$) {$\vdots$};
\node[main node](xm) at ($(x) - 1.5*(xoff)$) {$x_m$};
% activations
\coordinate (foff) at (1.5,0);
\node[main node, draw](f1) at ($(s1) + (foff)$) {$f$};
\node[main node, draw](f2) at ($(s2) + (foff)$) {$f$};
\node[main node, draw](f3) at ($(s3) + (foff)$) {$f$};
\node[main node] at ($(sdots) + (foff)$) {$\vdots$};
\node[main node, draw](fn) at ($(sn) + (foff)$) {$f$};
% outputs
\node[main node](y1) at ($(s1) + 2*(foff)$) {$a_1$};
\node[main node](y2) at ($(s2) + 2*(foff)$) {$a_2$};
\node[main node](y3) at ($(s3) + 2*(foff)$) {$a_3$};
\node[main node] at ($(sdots) + 2*(foff)$) {$\vdots$};
\node[main node](yn) at ($(sn) + 2*(foff)$) {$a_n$};
% arrows
\foreach \b in {(s1), (s2), (s3), (sn)}
\foreach \a in {(x1), (x2), (xm)}
\draw[->] \a -- \b;
\foreach \x/\y/\z in {(s1)/(f1)/(y1), (s2)/(f2)/(y2),
(s3)/(f3)/(y3), (sn)/(fn)/(yn)}{
\draw[->] \x -- \y;
\draw[->] \y -- \z;
}
%\foreach \x in {(s1), (s2), (s3), (sn)}
% \draw \x -- +(0,-.65);
\node[main node, ,below left] (weights) at ($(xm)!0.5!(sn)$)
{$W,W_0$};
\end{tikzpicture}
\end{center}
Since each unit has a vector of weights and a single offset, we can
think of the weights of the whole layer as a matrix, $W$, and the
collection of all the offsets as a vector $W_0$.
If we have $m$ inputs, $n$ units, and $n$ outputs, then
\begin{itemize}
\item $W$ is an $m\times n$ matrix,
\item $W_0$ is an $n \times 1$ column vector,
\item $X$, the input, is an $m \times 1$ column vector,
\item $Z = W^T X + W_0$, the {\em pre-activation}, is an $n \times
1$ column vector,
\item $A$, the {\em activation}, is an $n \times 1$ column vector,
\end{itemize}
and the output vector is
\[A = f(Z) = f(W^TX + W_0)\;\;.\]
The activation function $f$ is applied element-wise to the
pre-activation values $Z$.
% What can we do with a single layer? We have already seen single-layer
% networks, in the form of linear separators and linear regressors. All
% we can do with a single layer is make a linear hypothesis.~\note{We
% have used a step or sigmoid function to transform the linear output
% value for classification, but it's important to be clear that the
% resulting {\em separator} is still linear.} The whole reason for
% moving to neural networks is to move in the direction of {\em
% non-linear} hypotheses. To do this, we will have to consider
% multiple layers, where we can view the last layer as still being a
% linear classifier or regressor, but where we interpret the previous
% layers as learning a non-linear feature transformation $\phi(x)$,
% rather than having us hand-specify it.
\subsection{Many layers}
A single neural network generally combines multiple layers, most
typically by feeding the outputs of one layer into the inputs of
another layer.
We have to start by establishing some nomenclature. We will use $l$
to name a layer, and let $m^l$ be the number of inputs to the layer
and $n^l$ be the number of outputs from the layer. Then, $W^l$ and
$W^l_0$ are of shape $m^l \times n^l$ and $n^l \times 1$,
respectively. Note that the input to layer $l$ is the output from
layer $l-1$, so we have $m^l= n^{l-1}$, and as a result $A^{l-1}$ is
of shape $m^l \times 1$, or equivalently $n^{l-1} \times 1$. Let $f^l$
be the activation function\index{activation function} of layer
$l$. \note{It is technically possible to have different activation
functions within the same layer, but, again, for convenience in
specification and implementation, we generally have the same
activation function within a layer.}Then, the pre-activation outputs
are the $n^l \times 1$ vector
\[Z^l = {W^l}^TA^{l-1} + W^l_0\]
and the activation outputs are simply the $n^l \times 1$ vector
\[A^l = f^l(Z^l)\;\;.\]
Here's a diagram of a many-layered network, with two blocks
for each layer, one representing the linear part of the operation and
one representing the non-linear activation function. We will use this
structural decomposition to organize our algorithmic thinking and
implementation.
\begin{center}
\begin{tikzpicture}
\coordinate (x) at (0,0);
\node[inner sep=0em] (w1) at (1.7,0)
{\begin{tabular}{c} $W^1$ \\ $W^1_0$\end{tabular}};
\node[inner sep=1em] (f1) at ($2*(w1)$) {$f^1$};
\node[inner sep=0em] (w2) at ($3*(w1)$)
{\begin{tabular}{c} $W^2$ \\ $W^2_0$\end{tabular}};
\node[inner sep=1em] (f2) at ($4*(w1)$) {$f^2$};
\node (dots) at ($5*(w1)$) {$\cdots$};
\node[inner sep=0em] (wL) at ($6*(w1)$)
{\begin{tabular}{c} $W^L$ \\ $W^L_0$\end{tabular}};
\node[inner sep=1em] (fL) at ($7*(w1)$) {$f^L$};
\coordinate (y) at ($8*(w1) - (.3,0)$);
\draw[->] (x) -- node[above] {$X = A^0$} (w1);
\draw[->] (w1) -- node[above] {$Z^1$} (f1);
\draw[->] (f1) -- node[above] {$A^1$} (w2);
\draw[->] (w2) -- node[above] {$Z^2$} (f2);
\draw[->] (f2) -- node[above] {$A^2$} (dots);
\draw[->] (dots) -- node[above] {$A^{L-1}$} (wL);
\draw[->] (wL) -- node[above] {$Z^L$} (fL);
\draw[->] (fL) -- node[above] {$A^L$} (y);
\draw [decorate,decoration={brace,mirror,amplitude=10pt}]
($(w1) + (-.4, -.7)$) -- node[yshift=-1.8em] {layer 1}
($(f1) + (.4,-.7)$);
\draw [decorate,decoration={brace,mirror,amplitude=10pt}]
($(w2) + (-.4, -.7)$) -- node[yshift=-1.8em] {layer 2}
($(f2) + (.4,-.7)$);
\draw [decorate,decoration={brace,mirror,amplitude=10pt}]
($(wL) + (-.4, -.7)$) -- node[yshift=-1.8em] {layer $L$}
($(fL) + (.4,-.7)$);
\foreach \point in {w1, f1, w2, f2, wL, fL}{
\draw ($(\point) + (-.3,-.5)$) rectangle ($(\point) + (.3,.5)$);
}
\end{tikzpicture}
\end{center}
\section{Choices of activation function}
There are many possible choices for the activation function. We will
start by thinking about whether it's really necessary to have an $f$
at all.
What happens if we let $f$ be the identity? Then, in a network with
$L$ layers (we'll leave out $W_0$ for simplicity, but keeping it
wouldn't change the form of this argument),
\[A^L = {W^L}^T A^{L-1} =
{W^L}^T {W^{L-1}}^T \cdots {W^1}^T X\;\;.\]
So, multiplying out the weight matrices, we find that
\[A^L = W^\text{total}X\;\;,\]
which is a {\em linear} function of $X$!
Having all those layers did not change the representational
capacity of the network: the non-linearity of the activation function
is crucial.
\question{Convince yourself that any function representable by any
number of linear layers (where $f$ is the identity function) can be
represented by a single layer.}
Now that we are convinced we need a non-linear activation, let's
examine a few common choices.
These are shown mathematically below, followed by plots of these functions.
\begin{description}
\item{\bf Step function:} \index{activation function!step function}
$$\text{step}(z) =
\begin{cases}
0 & \text{if $z<0$} \\
1 & \text{otherwise}
\end{cases}$$
\item{\bf Rectified linear unit (ReLU):} \index{activation function!ReLU}
$$\text{ReLU}(z) =
\begin{cases}
0 & \text{if $z<0$} \\
z & \text{otherwise}
\end{cases} = \max(0,z)$$
\item{\bf Sigmoid function:}\index{activation function!sigmoid} Also known as a {\em logistic} function. This can sometimes
be interpreted as probability, because for any value of $z$ the
output is in $(0, 1)$:
$$\sigma(z) = \frac{1}{1+e^{-z}}$$
\item{\bf Hyperbolic tangent:}\index{activation function!hyperbolic tangent} Always in the range $(-1, 1)$:
$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$
\item{\bf Softmax function:}\index{activation function!softmax}
Takes a whole vector $Z \in \R^n$ and generates as output a vector
$A \in (0, 1)^n$ with the property that $\sum_{i = 1}^n A_i = 1$,
which means we can interpret it as a probability distribution over $n$ items:
\[\text{softmax}(z) =
\begin{bmatrix}
\exp(z_1) / \sum_{i} \exp(z_i) \\
\vdots \\
\exp(z_n) / \sum_{i} \exp(z_i)
\end{bmatrix}\]
\end{description}
\noindent\begin{center}
\resizebox{0.9\textwidth}{!}{%
\begin{tikzpicture}%[scale=.9]
\begin{axis}[
name=topleft,
axis lines=middle, axis equal image,
xmin=-2, xmax=2,
ymin=-.5, ymax=1.5,
xlabel={$z$}, ylabel={$\text{step}(z)$},
]
\addplot [domain=-2:0, samples=2, ultra thick] {0};
\addplot [domain=0:2, samples=2, ultra thick] {1};
\end{axis}
\begin{axis}[
name=topright,
axis lines=middle, axis equal image,
xmin=-2, xmax=2,
ymin=-.5, ymax=1.5,
xlabel={$z$}, ylabel={$\text{ReLU}(z)$},
at = (topleft.east), anchor=west, xshift=.7cm
]
\addplot [domain=-2:0, samples=2, ultra thick] {0};
\addplot [domain=0:2, samples=2, ultra thick] {x};
\end{axis}
\begin{axis}[
name=bottomleft,
axis lines=middle, %axis equal image,
xmin=-4, xmax=4,
ymin=-1.25, ymax=1.25,
xlabel={$z$}, ylabel={$\sigma(z)$},
at = (topleft.south), anchor=north, yshift=-.7cm
]
\addplot [domain=-4:4, samples=100, ultra thick] {1/(1+exp(-x))};
\addplot [domain=-4:4, samples=2, dashed] {1};
\addplot [domain=-4:4, samples=2, dashed] {0};
\end{axis}
\begin{axis}[
name=bottomright,
axis lines=middle, %axis equal image,
xmin=-4, xmax=4,
ymin=-1.25, ymax=1.25,
xlabel={$z$}, ylabel={$\tanh(z)$},
at = (bottomleft.east), anchor=west, xshift=.7cm
]
\addplot [domain=-4:4, samples=100, ultra thick]
{(exp(x) - exp(-x))/(exp(x) + exp(-x))};
\addplot [domain=-4:4, samples=2, dashed] {1};
\addplot [domain=-4:4, samples=2, dashed] {-1};
\end{axis}
\end{tikzpicture}
}%
\end{center}
The original idea for neural networks involved using the {\bf step}
function as an activation, but because the derivative of the
step function is zero everywhere except at the discontinuity
(and there it is undefined), gradient-descent methods won't be useful
in finding a good setting of the weights, and so we won't
consider them further. They have been replaced, in a sense, by the
sigmoid, ReLU, and tanh activation functions.
\question{
Consider sigmoid, ReLU, and tanh activations. Which one is most like
a step function? Is there an additional parameter you could add to a
sigmoid that would make it be more like a step function?
}
\question{
What is the derivative of the ReLU function? Are there some values of
the input for which the derivative vanishes?
}
ReLUs are especially common in internal (``hidden'') layers,
sigmoid activations are common for the output for binary
classification, and softmax activations are common for the output for multi-class classification
(see Section~\ref{logistic} for an explanation).
\section{Loss functions and activation functions}\index{neural network!loss and activation function choices}
Different loss functions make different assumptions about the range of
values they will get as input and, as we have seen, different
activation functions will produce output values in different ranges.
When you are designing a neural network, it's important to make these
things fit together well. In particular, we will think about matching
loss functions with the activation function in the last layer, $f^L$.
Here is a table of loss functions and activations that make sense for them:
\begin{center}
\begin{tabular}{c c l}
Loss & $f^L$ & task \\
\hline
squared & linear & regression \\
% hinge & linear \\
{\sc nll} & sigmoid & binary classification \\
{\sc nllm} & softmax & multi-class classification
\end{tabular}
\end{center}
We explored squared loss in Chapter~\ref{chap-regression} and ({\sc nll} and {\sc nllm}) in
Chapter~\ref{chap-classification}.
% \subsection{Two-class classification and log likelihood}
% For classification, the natural loss function is 0-1 loss, but we have
% already discussed the fact that it's very inconvenient for
% gradient-based learning because its derivative is discontinuous.
% It is nice
% and smooth, and extends nicely to multiple classes as we will see below.
% {\em Hinge loss} gives us another way, for binary classification
% problems, to make a smoother objective, penalizing the {\em margin}s
% of the labeled points relative to the separator. The hinge loss is
% defined to be
% \[\mathcal{L}_h(\text{guess},\text{actual}) = \max(1 - \text{guess} \cdot
% \text{actual}, 0)\;\;,\]
% when $\text{actual} \in \{+1, -1\}$. It has the property that, if the
% sign of $\text{guess}$ is the same as the sign of $\text{actual}$ and the
% magnitude of $\text{guess}$ is greater than $1$, then the loss is 0.
% It is trying to enforce not only that the guess have the correct sign,
% but also that it should be some distance away from the separator.
% Using hinge loss, together with a squared-norm regularizer, actually
% forces the learning process to try to find a separator that has the
% maximum {\em margin} relative to the data set. This optimization
% set-up is called a {\em support vector machine}, and was popular
% before the renaissance of neural networks and gradient descent,
% because it has a quadratic form that makes it particularly easy to
% optimize.
% Let's assume that the activation function on the output layer is a
% sigmoid and that there is a single unit in the output layer, so the
% output of the whole neural network is a scalar, $a^L$. Because $f^L$
% is a sigmoid, we know $a^L \in [0, 1]$, and we can interpret it as the
% probability that the input $x$ is a positive example. Let us
% further assume that the labels in the training data are $y \in \{0,
% 1\}$, so they can also be interpreted as probabilities.
% We might want to pick the parameters of our network to maximize the
% probability that the network assigns the correct labels to all the
% points. That would be
% \begin{equation*}
% \prod_{i = 1}^n \begin{cases} \ex{a}{i} & \text{if $\ex{y}{i} =
% 1$} \\ 1 - \ex{a}{i} & \text{otherwise}
% \end{cases}\;\;,
% \end{equation*}
% under the assumption that our predictions are independent. This can
% be cleverly rewritten as
% \begin{equation*}
% \prod_{i = 1}^n {\ex{a}{i}}^{\ex{y}{i}}(1 - \ex{a}{i})^{1 - \ex{y}{i}}\;\;.
% \end{equation*}
% \question{Be sure you can see why these two expressions are the
% same.}
% Now, because products are kind of hard to deal with, and because the
% log function is monotonic, the $W$ that maximizes the log of this
% quantity will be the same as the $W$ that maximizes the original, so
% we can try to maximize
% \begin{equation*}
% \sum_{i = 1}^n {\ex{y}{i}}\log {\ex{a}{i}} + (1 - \ex{y}{i})\log(1 -
% \ex{a}{i})\;\;,
% \end{equation*}
% which we can write in terms of a loss function
% \begin{equation*}
% \sum_{i = 1}^n \mathcal{L}_\text{nll}(\ex{a}{i}, \ex{y}{i})
% \end{equation*}
% where $\mathcal{L}_\text{nll}$ is the {\em negative log likelihood}
% loss function:
% \begin{equation*}
% \mathcal{L}_\text{nll}(\text{guess},\text{actual}) =
% -\left(\text{actual}\cdot \log (\text{guess}) + (1 - \text{actual})\cdot\log (1 -
% \text{guess})\right) \;\;.
% \end{equation*}
% This loss function is also sometimes referred to as the {\em log loss}
% or {\em cross entropy}. \note{You can use any base for the logarithm
% and it won't make any real difference. If we ask you for numbers,
% use log base $e$.}
% \subsection{Multi-class classification and log likelihood}
% We can extend the idea of NLL directly to multi-class classification with
% $K$ classes, where the training label is represented with the one-hot
% vector
% $y=\begin{bmatrix} y_1, \ldots, y_K \end{bmatrix}^T$, where $y_k=1$
% if the example is of class $k$. Assume that our network uses {\em
% softmax} as the activation function in the last layer, so that the
% output is
% $a=\begin{bmatrix} a_1, \ldots, a_K
% \end{bmatrix}^T$, which represents a probability distribution over
% the $K$ possible classes. Then, the probability that our network
% predicts the correct class for this example is
% $\prod_{k=1}^K a_k^{y_k}$ and the log of the probability that it is
% correct is $\sum_{k=1}^K y_k \log a_k$, so
% \[
% \mathcal{L}_\text{nllm}(\text{guess},\text{actual}) =
% - \sum_{k=1}^K \text{actual}_k \cdot \log(\text{guess}_k) \;\;.
% \]
% We'll call this {\sc nllm} for {\em negative log likelihood
% multiclass.}
% \question{Show that $L_\text{nllm}$ for $K = 2$ is the same as
% $L_\text{nll}$. }
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Error back-propagation}
We will train neural networks using gradient descent methods.\index{gradient descent!applied to neural network training} It's
possible to use {\em batch} gradient descent, in which we sum up the
gradient over all the points (as in Section~\ref{sec-gd} of
chapter~\ref{chap-gradient}) or
stochastic gradient descent ({\sc sgd}), in which we take a small step with
respect to the gradient considering a single point at a time (as in
Section~\ref{sec-sgd} of Chapter~\ref{chap-gradient}).
Our notation is going to get pretty hairy pretty quickly. To keep it
as simple as we can, we'll focus on computing the contribution of one
data point $\ex{x}{i}$ to the gradient of the loss with respect to the
weights, for {\sc sgd}; you can simply sum up these gradients over all
the data points if you wish to do batch descent.
So, to do {\sc sgd} for a training example $(x, y)$, we need to
compute $\nabla_W \mathcal{L}(\text{NN}(x;W),y)$,
where $W$ represents all weights $W^l, W_0^l$ in all the layers $l =
(1, \ldots, L)$. This seems terrifying,
but is actually quite easy to do using the chain rule.
\note{Remember the chain rule! If $a = f(b)$ and $b = g(c)$, so that\\
$a = f(g(c))$, then \\$\frac{d a}{d c} = \frac{d a}{d b} \cdot \frac{d
b}{d c} = f'(b) g'(c) = f'(g(c)) g'(c)$.}
Remember that we are always computing the gradient of the loss
function {\em with respect to the weights} for a particular value of
$(x, y)$. That tells us how much we want to change the weights, in
order to reduce the loss incurred on this particular training example.
\subsection{First, suppose everything is one-dimensional}
To get some intuition for how these derivations work, we'll first suppose everything in our neural network is one-dimensional.
In particular, we'll assume there are $m^l = 1$ inputs and $n^l = 1$ outputs at every layer. So layer $l$ looks like:
\[
a^l = f^l(z^l), \quad z^l = w^l a^{l-1} + w^l_0.
\]
In the equation above, we're using the lowercase letters $a^l, z^l, w^l, a^{l-1}, w^l_0$ to emphasize that all of these quantities are scalars just for the moment. We'll look at the more general matrix case below.
To use {\sc sgd}, then, we want \note{Check your understanding: why do we need exactly these quantities for SGD?} to compute $\partial \mathcal{L}(\text{NN}(x;W),y) / \partial w^l$ and $\partial \mathcal{L}(\text{NN}(x;W),y) / \partial w_0^l$ for each layer $l$ and each data point $(x,y)$. Below we'll write ``loss'' as an abbreviation for $\mathcal{L}(\text{NN}(x;W),y)$. Then our first quantity of interest is $\partial \text{loss} / \partial w^l$. The chain rule gives us the following.
First, let's look at the case $l=L$:
\begin{eqnarray*}
\frac{ \partial \text{loss} }{ \partial w^L }
&=& \frac{ \partial \text{loss} }{ \partial a^L }
\cdot \frac{ \partial a^L }{ \partial z^L }
\cdot \frac{ \partial z^L }{ \partial w^L } \\
&=& \frac{ \partial \text{loss} }{ \partial a^L } \cdot (f^L)'(z^L) \cdot a^{L-1}.
\end{eqnarray*}
Now we can look at the case of general $l$:
\begin{eqnarray*}
\frac{ \partial \text{loss} }{ \partial w^l }
&=& \frac{ \partial \text{loss} }{ \partial a^L }
\cdot \frac{ \partial a^L }{ \partial z^L }
\cdot \frac{ \partial z^L }{ \partial a^{L-1} }
\cdot \frac{ \partial a^{L-1} }{ \partial z^{L-1} }
\cdots
%\frac{ \partial a^{l+1} }{ \partial z^{l+1} }
\frac{ \partial z^{l+1} }{ \partial a^{l} }
\cdot \frac{ \partial a^{l} }{ \partial z^l }
\cdot \frac{ \partial z^l }{ \partial w^l } \\
&=& \frac{ \partial \text{loss} }{ \partial a^L } \cdot (f^L)'(z^L) \cdot w^L
\cdot (f^{L-1})'(z^{L-1})
\cdots
%\cdot (f^{l+1})'(z^{l+1})
\cdot w^{l+1}
\cdot (f^{l})'(z^{l})
\cdot a^{l-1} \\
&=& \frac{ \partial \text{loss} }{ \partial z^l } \cdot a^{l-1}.
\end{eqnarray*}
Note that every multiplication above is scalar multiplication because every term in every product above is a scalar. And though we solved for all the other terms in the product, we haven't solved for $\partial \text{loss} / \partial a^L$ because the derivative will depend on which loss function you choose. Once you choose a loss function though, you should be able to compute this derivative.
\question{Suppose you choose squared loss. What is $\partial \text{loss} / \partial a^L$?}
\question{Check the derivations above yourself. You should use the chain rule and also solve for the individual derivatives that arise in the chain rule.}
\question{Check that the the final layer ($l=L$) case is a special case of the general layer $l$ case above.}
\question{Derive $\partial \mathcal{L}(\text{NN}(x;W),y) / \partial w_0^l$ for yourself, for both the final layer ($l=L$) and general $l$.}
\question{Does the $L=1$ case remind you of anything from earlier in this course?}
\question{Write out the full SGD algorithm for this neural network.}
It's pretty typical to run the chain rule from left to right like we did above. But, for where we're going next, it will be useful to notice that it's completely equivalent to write it in the other direction. So we can rewrite our result from above as follows:
\begin{eqnarray}
\label{eq:gradloss_oned}
\frac{ \partial \text{loss} }{ \partial w^l }
&=& a^{l-1} \cdot \frac{ \partial \text{loss} }{ \partial z^l } \\
\label{eq:gradz_intermediate_oned}
\frac{ \partial \text{loss} }{ \partial z^l }
&=& \frac{ \partial a^{l} }{ \partial z^l }
\cdot \frac{ \partial z^{l+1} }{ \partial a^{l} }
\cdots
\frac{ \partial a^{L-1} }{ \partial z^{L-1} }
\cdot \frac{ \partial z^L }{ \partial a^{L-1} }
\cdot \frac{ \partial a^L }{ \partial z^L }
\cdot \frac{ \partial \text{loss} }{ \partial a^L } \\
\label{eq:gradz_oned}
&=& \frac{ \partial a^{l} }{ \partial z^l } \cdot w^{l+1}
\cdots
\frac{ \partial a^{L-1} }{ \partial z^{L-1} } \cdot w^L \cdot \frac{ \partial a^L }{ \partial z^L }
\cdot \frac{ \partial \text{loss} }{ \partial a^L }.
\end{eqnarray}
\subsection{The general case}
Next we're going to do everything that we did above, but this time we'll allow any number of inputs $m^l$ and outputs $n^l$ at every layer. First, we'll tell you the results that correspond to our derivations above. Then we'll talk about why they make sense. And finally we'll derive them carefully.
OK, let's start with the results! Again, below we'll be using ``loss'' as an abbreviation for $\mathcal{L}(\text{NN}(x;W),y)$. Then,
\begin{eqnarray}
\label{eq:gradloss}
\underbrace{\frac{\partial \text{loss}}{\partial W^l}}_{m^l \times n^l}
&=&
\underbrace{A^{l-1}}_{m^l \times 1} \;
\underbrace{\left(\frac{\partial \text{loss}}{\partial
Z^l}\right)^T}_{1 \times n^l} \\
%
\label{eq:gradz_intermediate}
\frac{\partial \text{loss}}{\partial Z^l}
&=& \frac{\partial A^l}{\partial Z^l}
%%\cdot \frac{\partial Z^l}{ \partial A^{l+1}}
\cdot \frac{\partial Z^{l+1}}{ \partial A^l}
%\cdot \frac{\partial A^{l+1}}{\partial Z^{l+1}}
\cdots
%\cdot \frac{\partial Z^{L-1}}{\partial A^{L-2}}
\cdot \frac{\partial A^{L-1}}{\partial Z^{L-1}}
\cdot \frac{\partial Z^{L}}{\partial A^{L-1}}
\cdot \frac{\partial A^{L}}{\partial Z^{L}}
\cdot \frac{\partial \text{loss}}{\partial A^L}
\\
\label{eq:gradz}
&=& \frac{\partial A^l}{\partial Z^l}
\cdot W^{l+1}
% \cdot W^l
%\cdot \frac{\partial A^{l+1}}{\partial Z^{l+1}}
\cdots
%W^{L-1}
\cdot \frac{\partial A^{L-1}}{\partial Z^{L-1}}
\cdot W^{L}
\cdot \frac{\partial A^{L}}{\partial Z^{L}}
\cdot \frac{\partial \text{loss}}{\partial A^L} \; .
\end{eqnarray}
First, compare each equation to its one-dimensional counterpart, and make sure you see the similarities.
That is, compare the general weight derivatives in Eq.~\ref{eq:gradloss} to the one-dimensional case in Eq.~\ref{eq:gradloss_oned}. Compare the intermediate derivative of loss with respect to the pre-activations $Z^l$ in Eq.~\ref{eq:gradz_intermediate} to the one-dimensional case in Eq.~\ref{eq:gradz_intermediate_oned}. And finally compare the version where we've substituted in some of the derivatives in Eq.~\ref{eq:gradz} to Eq.~\ref{eq:gradz_oned}. Hopefully you see how the forms are very analogous. But in the matrix case, we now have to be careful about the matrix dimensions. We'll check these matrix dimensions below.
Let's start by talking through each of the terms in the matrix version of these equations. Recall that loss is a scalar, and $W^l$ is a matrix of size $m^l \times n^l$. You can read about the conventions in the course for derivatives starting in this chapter in Appendix~\ref{app:matrix_deriv}. By these conventions (not the only possible conventions!), we have that $\partial \text{loss} / \partial W^l$ will be a matrix of size $m^l \times n^l$ whose $(i,j)$ entry is the scalar $\partial \text{loss} / \partial W^l_{i,j}$. In some sense, we're just doing a bunch of traditional scalar derivatives, and the matrix notation lets us write them all simultaneously and succinctly. In particular, for SGD, we need to find the derivative of the loss with respect to every scalar component of the weights because these are our model's parameters and therefore are the things we want to update in SGD.
The next quantity we see in Eq.~\ref{eq:gradloss} is $A^{l-1}$, which we recall has size $m^l \times 1$ (or equivalently $n^{l-1} \times 1$ since it represents the outputs of the $l-1$ layer). Finally, we see $\partial \text{loss} / \partial Z^l$. Again, loss is a scalar, and $Z^l$ is a $n^l \times 1$ vector. So by the conventions in Appendix~\ref{app:matrix_deriv}, we have that $\partial \text{loss} / \partial Z^l$ has size $n^l \times 1$. The transpose then has size $1 \times n^l$. Now you should be able to check that the dimensions all make sense in Eq.~\ref{eq:gradloss}; in particular, you can check that inner dimensions agree in the matrix multiplication and that, after the multiplication, we should be left with something that has the dimensions on the lefthand side.
Now let's look at Eq.~\ref{eq:gradz}. We're computing $\partial \text{loss} / \partial Z^l$ so that we can use it in Eq.~\ref{eq:gradloss}. The weights are familiar. The one part that remains is terms of the form $\partial A^l / \partial Z^l$. Checking out Appendix~\ref{app:matrix_deriv}, we see that this term should be a matrix of size $n^l \times n^l$ since $A^l$ and $Z^l$ both have size $n^l \times 1$. The $(i,j)$ entry of this matrix is $\partial A^l_j / \partial Z^l_i$. This scalar derivative is something that you can compute when you know your activation function. If you're not using a softmax activation function, $A^l_j$ typically is a function only of $Z^l_j$, which means that $\partial A^l_j / \partial Z^l_i$ should equal 0 whenever $i \ne j$, and that $\partial A^l_j / \partial Z^l_j = (f^l)'(Z^l_j)$.
\question{Compute the dimensions of every term in Eqs.~\ref{eq:gradz_intermediate} and \ref{eq:gradz} using Appendix~\ref{app:matrix_deriv}. After you've done that, check that all the matrix multiplications work; that is, check that the inner dimensions agree and that the lefthand side and righthand side of these equations have the same dimensions.}
\question{If I use the identity activation function, what is $\partial A^l_j / \partial Z^l_j$ for any $j$? What is the full matrix $\partial A^l / \partial Z^l$?}
\subsection{Derivations for the general case}
You can use everything above without deriving it yourself. But if you want to find the gradients of loss with respect to $W^l_0$ (which we need for SGD!), then you'll want to know how to actually do these derivations. So next we'll work out the derivations.
The key trick is to just break every equation down into its scalar meaning. For instance, the $(i,j)$ element of $\partial \text{loss} / \partial W^l$ is $\partial \text{loss} / \partial W^l_{i,j}$. If you think about it for a moment (and it might help to go back to the one-dimensional case), the loss is a function of the elements of $Z^l$, and the elements of $Z^l$ are a function of the $W^l_{i,j}$. There are $n^l$ elements of $Z^l$, so we can use the chain rule to write
\begin{equation}
\label{eq:partial_loss_partial_W_intermediate}
\frac{ \partial \text{loss} }{ \partial W^l_{i,j} }
= \sum_{k=1}^{n^l} \frac{ \partial \text{loss} }{ \partial Z^l_k } \frac{ \partial Z^l_k }{ \partial W^l_{i,j} }.
\end{equation}
To figure this out, let's remember that $Z^l = (W^l)^\top A^{l-1} + W_0^l$. We can write one element of the $Z^l$ vector, then, as $Z^l_{b} = \sum_{a=1}^{m^l} W_{a,b}^{l} A_a^{l-1} + (W_0^l)_b$. It follows that $\partial Z^l_k / \partial W^l_{i,j}$ will be zero except when $k=j$ (check you agree!). So we can rewrite Eq.~\ref{eq:partial_loss_partial_W_intermediate} as
\begin{equation}
\label{eq:partial_loss_partial_W_entries}
\frac{ \partial \text{loss} }{ \partial W^l_{i,j} }
= \frac{ \partial \text{loss} }{ \partial Z^l_j } \frac{ \partial Z^l_j }{ \partial W^l_{i,j} }
= \frac{ \partial \text{loss} }{ \partial Z^l_j } A_i^{l-1}.
\end{equation}
Finally, then, we match entries of the matrices on both sides of the equation above to recover Eq.~\ref{eq:gradloss}.
\question{Check that Eq.~\ref{eq:partial_loss_partial_W_entries} and Eq.~\ref{eq:gradloss} say the same thing.}
\question{Convince yourself that $\partial Z^{l} / \partial A^{l-1} = W^l$ by comparing the entries of the matrices on both sides on the equality sign.}
\question{Convince yourself that Eq.~\ref{eq:gradz_intermediate} is true.}
\question{Apply the same reasoning to find the gradients of
$\text{loss}$ with respect to $W_0^l$.}
\subsection{Reflecting on backpropagation}
This general process of computing the gradients of the loss with respect to the weights
is called {\em error back-propagation}.\index{error back-propagation}
\note{We could call this
``blame propagation''. Think of $\text{loss}$ as how mad we
are about the prediction just made. Then $\partial
\text{loss}/ \partial A^L$ is how much we blame $A^L$ for the loss.
The last module has to take in $\partial \text{loss}/ \partial A^L$
and compute $\partial \text{loss}/ \partial Z^L$, which is how much
we blame $Z^L$ for the loss. The next module (working backwards)
takes in $\partial \text{loss}/ \partial Z^L$ and computes $\partial
\text{loss}/ \partial A^{L-1}$. So every module is accepting its
blame for the loss, computing how much of it to allocate to each of
its inputs, and passing the blame back to them.}
The idea
is that we first do a {\em forward pass} to compute all the $a$ and
$z$ values at all the layers, and finally the actual loss. Then, we can work backward and compute the gradient of the
loss with respect to the weights in each layer, starting at layer $L$
and going back to layer 1.
\begin{center}
\begin{tikzpicture}[scale=.98]
\coordinate (x) at (0,0);
\node[inner sep=0em] (w1) at (1.7,0)
{\begin{tabular}{c} $W^1$ \\ $W^1_0$\end{tabular}};
\node[inner sep=1em] (f1) at ($2*(w1)$) {$f^1$};
\node[inner sep=0em] (w2) at ($3*(w1)$)
{\begin{tabular}{c} $W^2$ \\ $W^2_0$\end{tabular}};
\node[inner sep=1em] (f2) at ($4*(w1)$) {$f^2$};
\node (dots) at ($5*(w1)$) {$\cdots$};
\node[inner sep=0em] (wL) at ($6*(w1)$)
{\begin{tabular}{c} $W^L$ \\ $W^L_0$\end{tabular}};
\node[inner sep=1em] (fL) at ($7*(w1)$) {$f^L$};
\node (loss) at ($8*(w1)$) {Loss};
\coordinate (y) at ($8*(w1) + (0,1.5)$);
\draw[->] (x) -- node[above] {$X = A^0$} (w1);
\draw[->] (w1) -- node[above] {$Z^1$} (f1);
\draw[->] (f1) -- node[above] {$A^1$} (w2);
\draw[->] (w2) -- node[above] {$Z^2$} (f2);
\draw[->] (f2) -- node[above] {$A^2$} (dots);
\draw[->] (dots) -- node[above] {$A^{L-1}$} (wL);
\draw[->] (wL) -- node[above] {$Z^L$} (fL);
\draw[->] (fL) -- node[above] {$A^L$} (loss);
\draw[->] (y) -- node[right] {$y$} ($(loss)+(0,.65)$);
%\draw[->,yshift=0.5cm] (loss.west) to [out=150,in=30] (fL.east);
\foreach \s/\e/\t in {loss/fL/A^L,
fL/wL/Z^L,
wL/dots/A^{L-1},
dots/f2/A^2,
f2/w2/Z^2,
w2/f1/A^1,
f1/w1/Z^1}{
\path[->] ($(\s.west) - (0,.5)$) edge[out=210,in=-30] node[below]
{$\frac{\partial \text{loss}}{\partial \t}$}
($(\e.east) - (0,.5)$);
}
\foreach \point in {w1, f1, w2, f2, wL, fL}{
\draw ($(\point) + (-.3,-.5)$) rectangle ($(\point) + (.3,.5)$);
}
\draw ($(loss) + (-.4,-.5)$) rectangle ($(loss) + (.4,.5)$);
\end{tikzpicture}
\end{center}
If we view our neural network as a sequential composition of modules
(in our work so far, it has been an alternation between a linear
transformation with a weight matrix, and a component-wise application
of a non-linear activation function), then we can define a simple API
for a module that will let us compute the forward and backward passes,
as well as do the necessary weight updates for gradient descent. Each
module has to provide the following ``methods.'' We are already using
letters $a, x, y, z$ with particular meanings, so here we will use $u$
as the vector input to the module and $v$ as the vector output:
\begin{itemize}
\item forward: $u \rightarrow v$
\item backward: $u, v, \partial L /
\partial v \rightarrow \partial L / \partial u$
\item weight grad: $u, \partial L / \partial v \rightarrow \partial L
/ \partial W$ only needed for modules that have weights $W$
\end{itemize}
In homework we will ask you to implement these modules for neural
network components, and then use them to construct a network and train
it as described in the next section.
% \begin{examplebox}
% What are some conventions for derivatives of matrices and vectors? It
% will always work to explicitly write all indices and treat everything
% as scalars, but we introduce here some shortcuts that are often faster
% to use and helpful for understanding:
% \begin{itemize}
% \item For $v$ of size $1\times 1$ and $u$ of size $1\times 1$,
% $\partial v/\partial u$ is the (scalar) partial derivative of $v$
% w.r.t. $u$. Example: $\underbrace{\partial L}_{1\times
% 1}/\underbrace{\partial W}_{1\times 1}$ is $1\times 1$
% \item For $v$ of size $1\times 1$ and $u$ of size $b\times 1$,
% $\partial v/\partial u$ is a vector of size $b\times 1$ with the $j^{\rm th}$ entry $\partial v/\partial u_j$.
% Example: $\underbrace{\partial L}_{1\times 1}/\underbrace{\partial W_0}_{n\times 1}$ is $n\times 1$
% \item For $v$ of size $c\times 1$ and $u$ of size $b\times 1$,
% $\partial v/\partial u$ is a vector of size $b\times c$ with the $(j, k)$ entry $\partial v_k/\partial u_j$.
% Example: $\underbrace{\partial A}_{n\times 1}/\underbrace{\partial Z}_{n\times 1}$ is $n\times n$
% \item For $v$ of size $1\times 1$ and $u$ of size $b\times c$,
% $\partial v/\partial u$ is a vector of size $b\times c$ with the $(j, k)$ entry $\partial v/\partial u_{j,k}$.
% Example: $\underbrace{\partial L}_{1\times 1}/\underbrace{\partial W}_{m\times n}$ is $m\times n$
% \end{itemize}
% \end{examplebox}
%\chapter{Making NN's Work}
\section{Training}
Here we go! Here's how to do stochastic gradient descent training on
a feed-forward neural network.\index{neural network!training} After this pseudo-code, we motivate
the choice of initialization in lines 2 and 3. The actual
computation of the gradient values (e.g., $\partial\text{loss}/
\partial A^L$) is not directly defined in this code, because we want
to make the structure of the computation clear.
\question{What is $\partial Z^l / \partial W^l$? }
\question{Which terms in the code below depend on $f^L$?}
\begin{codebox}
\Procname{$\proc{SGD-Neural-Net}(\dataTrain, T, L, (m^1, \ldots,
m^L), (f^1, \ldots, f^L), \text{Loss})$}
\li \For $l \gets 1$ \To $L$
\Do
\li $W_{ij}^l \sim \text{Gaussian}(0, 1/m^l)$
\li $W_{0j}^l \sim \text{Gaussian}(0, 1)$
\End
\li \For $t \gets 1$ \To $T$
\li \Do
$i \gets \text{random sample from } \{1,\ldots,n\}$
\li $A^0 \gets \ex{x}{i}$
\li \Comment forward pass to compute the output $A^L$
\li \For $l \gets 1$ \To $L$
\li \Do
$Z^l \gets W^{l^T}A^{l-1} + W_0^l$
\li $A^l \gets f^l(Z^l)$
\End
\li $\text{loss} \gets \text{Loss}(A^L, \ex{y}{i})$
\li \For $l \gets L$ \To 1:
\Do
\li \Comment error back-propagation
\li $\partial \text{loss}/\partial A^l \gets$ %
% m x 1
{\bf if} $l < L$
{\bf then}
% dims: n x 1 * m x n
$\partial Z^{l+1}/\partial A^l \cdot \partial \text{loss}/\partial Z^{l+1} $
% dims: m x n * n x 1
{\bf else}
$ \partial \text{loss}/\partial A^L$
\li $\partial \text{loss}/\partial Z^l \gets \partial
A^l/\partial Z^l \cdot \partial
\text{loss}/\partial A^l$
\li \Comment compute gradient with respect to weights
\li $\partial \text{loss}/\partial W^l \gets A^{l-1} \cdot \left( \partial \text{loss}/\partial Z^l \right)^T $
% m x n = m x 1 * 1 x n
\li $\partial \text{loss}/\partial W_0^l \gets \partial \text{loss}/\partial Z^l $
% 1 x n = 1 x n
\li \Comment stochastic gradient descent update
\li $W^l = W^l - \eta(t)\cdot \partial \text{loss}/\partial W^l$
\li $W^l_0 = W_0^l - \eta(t)\cdot \partial \text{loss}/\partial W_0^l$
\End
\end{codebox}
Initializing $W$ is important;\index{neural network!initialization of weights} if you do it badly there is a good
chance the neural network training won't work well. First, it is
important to initialize the weights to random values. We want
different parts of the network to tend to ``address'' different
aspects of the problem; if they all start at the same weights, the
symmetry will often keep the values from moving in useful directions.
Second, many of our activation functions have (near) zero slope when
the pre-activation $z$ values have large magnitude, so we generally
want to keep the initial weights small so we will be in a situation
where the gradients are non-zero, so that gradient descent will have
some useful signal about which way to go.
One good general-purpose strategy is to choose each weight at random
from a Gaussian (normal) distribution with mean 0 and standard
deviation $(1/m)$ where $m$ is the number of inputs to the unit.
\question{If the input $x$ to this unit is a vector of 1's, what would
the expected pre-activation $z$ value be with these initial
weights?} We write this choice (where $\sim$ means ``is drawn
randomly from the distribution'') as $W^l_{ij} \sim \text{Gaussian}\left(0,
\frac{1}{m^l}\right).$ It will often turn out (especially for fancier activations and loss
functions) that computing
$\frac{\partial \text{loss}}{\partial Z^L}$ is easier than computing
$\frac{\partial \text{loss}}{\partial A^L}$ and $\frac{\partial A^L}{\partial Z^L}.$ So, we may instead ask for an implementation of a loss function to provide a backward method that computes $\partial \text{loss}/\partial Z^L$ directly.
\section{Optimizing neural network parameters}
\label{sec-make_nn_work}
Because neural networks are just parametric functions, we can optimize
loss with respect to the parameters using standard gradient-descent
software, but we can take advantage of the structure of the loss
function and the hypothesis class to improve optimization. As we have
seen, the modular function-composition structure of a neural network
hypothesis makes it easy to organize the computation of the gradient.
As we have also seen earlier, the structure of the loss function as a
sum over terms, one per training data point, allows us to consider
stochastic gradient methods. In this section we'll consider some
alternative strategies for organizing training, and also for making it
easier to handle the step-size parameter.
\subsection{Batches}
Assume that we have an objective of the form
\[J(W) = \sum_{i = 1}^n \mathcal{L}(h(\ex{x}{i}; W), \ex{y}{i})\;\;,\]
where $h$ is the function computed by a neural network, and $W$ stands
for all the weight matrices and vectors in the network.
Recall that, when we perform {\em batch} (or the vanilla) gradient descent, we use the update rule
\[W_t = W_{t-1} - \eta\nabla_W J(W_{t-1})\;\;,\]
which is equivalent to
\[W_t = W_{t-1} - \eta\sum_{i=1}^n \nabla_W \mathcal{L}(h(\ex{x}{i}; W_{t-1}),
\ex{y}{i})\;\;.\]
So, we sum up the gradient of loss at each training point, with
respect to $W$, and then take a step in the negative direction of
the gradient.
In {\em stochastic} gradient descent, we repeatedly pick a point
$(\ex{x}{i}, \ex{y}{i})$ at random from the data set, and execute a
weight update on that point alone:
\[W_t = W_{t-1} - \eta \nabla_W \mathcal{L}(h(\ex{x}{i}; W_{t-1}),
\ex{y}{i})\;\;.\]
As long as we pick points uniformly at random from the data set, and
decrease $\eta$ at an appropriate rate, we are guaranteed, with high
probability, to converge to at least a local optimum.
These two methods have offsetting virtues. The batch method takes
steps in the exact gradient direction but requires a lot of
computation before even a single step can be taken, especially if the
data set is large. The stochastic method begins moving right away,
and can sometimes make very good progress before looking at even a
substantial fraction of the whole data set, but if there is a lot of
variability in the data, it might require a very small $\eta$ to
effectively average over the individual steps moving in ``competing''
directions.
An effective strategy is to ``average'' between batch and stochastic
gradient descent by using {\em mini-batches}. For a mini-batch of
size $K$, we select $K$ distinct data points uniformly at random from the data
set and do the update based just on their contributions to the gradient
\[W_t = W_{t-1} - \eta\sum_{i=1}^K \nabla_W \mathcal{L}(h(\ex{x}{i}; W_{t-1}),