-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathgradient_descent.tex
488 lines (437 loc) · 23.7 KB
/
gradient_descent.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
\chapter{Gradient Descent}
\label{chap-gradient}
\vspace*{-0.1in}
In the previous chapter,
we showed how to describe an interesting objective function for
machine learning, but we need a way to
find the optimal $\Theta^* = \argmin{\Theta} J(\Theta)$, particularly
when the objective function is not amenable to analytical
optimization. For example, this can be the case when $J(\Theta)$
involves a more complex loss function, or more general forms of
regularization.
It can also be the case when there are simply too many
parameters to learn for it to be computationally feasible.
There is an
enormous and fascinating literature on the mathematical and algorithmic
foundations of optimization\note{Which you should consider studying
some day!}, but for this class, we will consider one of the simplest
methods, called {\em gradient descent.}
Intuitively, in one or two dimensions, we can easily think of
$J(\Theta)$ as defining a surface over $\Theta$; that same idea
extends to higher dimensions. Now, our objective is to find the
$\Theta$ value at the lowest point on that surface. One way to think
about gradient descent is that you start at some arbitrary point on
the surface, look to see in which direction the ``hill'' goes down
most steeply, take a small step in that direction, determine the
direction of steepest descent from where you are, take another small
step, etc.
Below, we explicitly give gradient descent algorithms for one and
multidimensional objective functions (Sections~\ref{sec-gd_onedim}
and~\ref{sec-gd}). We then illustrate the application of gradient
descent to a loss function which is not merely mean squared loss
(Section~\ref{sec-gd_ridge}). And we present an important method known
as {\em stochastic gradient descent}
(Section~\ref{sec-sgd}), which is
especially useful when datasets are too large for descent in a single
batch, and has some important behaviors of its own.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Gradient descent in one dimension}\label{sec-gd_onedim}
We start by considering gradient descent in one dimension. Assume
$\Theta \in \R$, and that we know both $J(\Theta)$ and its first
derivative with respect to $\Theta$, $J'(\Theta)$. Here is pseudo-code
for gradient descent on an arbitrary function $f$\index{gradient descent}. Along with $f$ and
its gradient
% AM_S2024 begin
$\nabla_{\Theta}f$ (which, in the case of a scalar $\Theta$, is the same as its derivative $f'$),
% AM_S2024 end
we have to specify some hyper-parameters.
These hyper-parameters include the initial value for parameter $\Theta$, a {\em step-size}
hyper-parameter $\eta$, and an {\em accuracy} hyper-parameter~$\epsilon$\index{gradient descent!learning rate}.
The hyper-parameter $\eta$ is often called {\em learning rate} when
gradient descent is applied in machine learning. For simplicity, $\eta$ may be taken as a constant, as is the case in the pseudo-code below; and we'll see adaptive (non-constant) step-sizes soon. What's important to notice though, is that even when $\eta$ is constant, the actual magnitude of the
change to $\Theta$ may not be constant, as that change depends on the
magnitude of the gradient itself too.
\begin{codebox}
\Procname{$\proc{1D-Gradient-Descent}(\Theta_{\it init}, \eta, f,
f', \epsilon)$}
\li $\Theta^{(0)} \gets \Theta_{\it init}$
\li $t \gets 0$
\li \Repeat
\li $t \gets t+1$
\li $\Theta^{(t)} = \Theta^{(t-1)} - \eta \, f'(\Theta^{(t-1)})$
\li \Until $|f'(\Theta^{(t)})| < \epsilon$
\li \Return $\Theta^{(t)}$
\end{codebox}
Note that this algorithm terminates when the derivative of the function $f$
is sufficiently small. There are many other reasonable ways to decide
to terminate, including\index{gradient descent!stopping criteria}:
\begin{itemize}
\item Stop after a fixed number of iterations $T$, i.e.,\ when $t = T$. Practically, this is the most common choice.
\item Stop when the change in the value of the parameter
$\Theta$ is sufficiently small, i.e.,\ when $\left| \Theta^{(t)} - \Theta^{(t-1)} \right|
<\epsilon$.
% \item Stop when the derivative $f'$ at the latest value of $\Theta$ is sufficiently small, i.e.,\ when $\left|f'(\Theta^{(t)}) \right|
% <\epsilon$.
\end{itemize}
\question{Consider all of the potential stopping criteria for $\proc{1D-Gradient-Descent}$, both in the algorithm as it appears
and listed separately later.
Can you think of ways that any two of the criteria relate to each other?
}
\begin{theorem}
Choose any small distance $\tilde{\epsilon} > 0$.
If we assume that $f$ has a minimum, is sufficiently ``smooth'' and convex,\note{A function is convex if the line segment between any two points on the graph of the function lies above or on the graph.}
and if the learning rate $\eta$ is sufficiently small, gradient descent will reach a point within $\tilde{\epsilon}$ of a global optimum point $\Theta$.\index{gradient descent!convergence}
% Previous theorem:
% for any desired accuracy $\epsilon$, there is some
% learning rate $\eta$ such that gradient
% descent will converge to within $\epsilon$ of the optimal $\Theta$.
%
% Previous theorem has bug: doesn't quite hold for horizontal line; also,
% it won't be epsilon away; it'll get arbitrarily close with the distances going to 0
\end{theorem}
However, we must be careful when choosing the learning rate to prevent
slow convergence, non-converging oscillation around the minimum, or divergence.
The following plot illustrates a convex function $f(x) = (x - 2)^2$,
starting gradient descent at $x_\textrm{init} = 4.0$ with a step-size of
$1/2$. It is very well-behaved!
\begin{center}
\begin{tikzpicture}
\begin{axis}[
axis lines=middle,
xmin=-1, xmax=6,
ymin=-1, ymax=5,
xlabel={$x$}, ylabel={$f(x)$},
]
\addplot [domain=-1:6, samples=100] {(x-2)^2};
\addplot [only marks,color=black,mark=*,mark size=1.5pt] coordinates {
(2,0)
(4,4)};
\draw[-latex,blue,thick] (axis cs: 4,4) -- (axis cs: 2,0);
\end{axis}
\end{tikzpicture}
\end{center}
\question{What happens in this example with very small $\eta$? With
very big $\eta$?}
If $f$ is non-convex, where gradient descent converges to depends on
$x_{\it init}$. First, let's establish some definitions.
Let $f$ be a real-valued function defined over some domain $D$.
A point $x_0 \in D$ is called a {\em global minimum point} of $f$ if $f(x_0) \leq f(x)$
for all other $x \in D$.
A point $x_0 \in D$ is instead called a {\em local minimum point} of a function $f$ if there exists
some constant $\epsilon > 0$ such that for all $x$ within the interval defined by $d(x,x_0) < \epsilon,$
$f(x_0) \leq f(x)$, where $d$ is some distance metric, e.g., $d(x,x_0) = || x - x_0 ||.$
% Suppose we have well defined first and second derivatives for $f$.
% Then we say that $f$ has a {\em local minimum point} or {\em local optimum point}
% at $x$ if $f'(x) = 0$ and $f''(x)
% > 0$, and we say that $f(x)$ is a {\em local minimum value} of $f$.
% % More generally, $x$ is a local minimum point of $f$ if $f(x)$ is at
% % least as low as $f(x')$ for all points $x'$ in some small area around $x$.
% We say that $f$ has a {\em global
% minimum point} at $x$ if $f(x)$ is at least as low as $f(x')$ for every other input $x'$.
% And then we call $f(x)$ a {\em global minimum value}.
A global minimum point is also a local minimum point, but a local
minimum point does not have to be a global minimum point.
If $f$ is non-convex (and sufficiently smooth),
% AM_S2024 begin
one expects that
% AM_S2024 end
gradient descent (run long enough with small enough learning rate) will get very close to a point at which the gradient is zero, though
we cannot guarantee that it will converge to a global minimum point.
% \note{There are many "gaps" between a point having zero gradient and a global minimum. For example, a local maximum, or a so-called saddle point would be a point at which gradient is zero.}
% AM_S2024 begin
There are two notable exceptions to this common sense expectation: First, gradient descent can get stagnated while approaching a point $x$ which is not a local minimum or maximum,
but satisfies $f'(x)=0$. For example, for $f(x)=x^3$, starting gradient descent from the
initial guess $x_{\it init}=1$, while using learning rate $\eta<1/3$
will lead to $x^{(k)}$ converging to zero as $k\to\infty$. Second, there are functions (even convex ones) with no minimum points,
like $f(x)=\exp(-x)$, for which gradient descent with a positive learning rate converges to $+\infty$.
% AM_S2024 end
The plot below shows two different $x_{\it init}$, and how gradient descent
started from each point heads toward two different local optimum points.
\begin{center}
\begin{tikzpicture}
\begin{axis}[
axis lines=middle,
xmin=-2, xmax=4,
ymin=2, ymax=10,
xlabel={$x$}, ylabel={$f(x)$},
]
\addplot [domain=-2:4, samples=100] {0.5*x^4 - 3*x^3 + 4*x^2 + 3*x + 3};
\draw[-latex,blue,thick] (axis cs: -1,7.5) -- (axis cs: -0.52,2.98);
\draw[-latex,blue,thick] (axis cs: -.52,2.98) -- (axis cs: -0.40, 2.65);
\draw[-latex,blue,thick] (axis cs: -0.40, 2.65) -- (axis cs: -0.28, 2.54);
\draw[-latex,blue,thick] (axis cs: 2,9) -- (axis cs: 2.15,8.81);
\draw[-latex,blue,thick] (axis cs: 2.15,8.81) -- (axis cs: 2.38,8.40);
\draw[-latex,blue,thick] (axis cs: 2.38,8.40) -- (axis cs: 2.68, 7.82);
\draw[-latex,blue,thick] (axis cs: 2.68, 7.82) -- (axis cs: 2.93, 7.52);
\draw[-latex,blue,thick] (axis cs: 2.93, 7.52) -- (axis cs: 3, 7.5);
\addplot [only marks,color=black,mark=*,mark size=1.5pt] coordinates {
(-1, 7.5)
(-.4, 2.65)
(2, 9)
(3, 7.5)};
\end{axis}
\end{tikzpicture}
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Multiple dimensions}
\label{sec-gd}
The extension to the case of multi-dimensional $\Theta$ is
straightforward. Let's assume $\Theta \in \R^m$, so $f: \R^m
\rightarrow \R$. The
gradient of $f$ with respect to $\Theta$ is
\[
\nabla_\Theta f =
\begin{bmatrix}
\partial f / \partial \Theta_1 \\
\vdots \\
\partial f / \partial \Theta_m
\end{bmatrix}
\]
The algorithm remains the same, except that the update step in line 5
becomes
\[ \ex{\Theta}{t} = \ex{\Theta}{t-1} - \eta\nabla_\Theta f(\ex{\Theta}{t-1}) \]
and any termination criteria that depended on the dimensionality
of $\Theta$ would have to change. The easiest thing is
to keep the test in line 6 as $\left|f(\ex{\Theta}{t}) -
f(\ex{\Theta}{t-1}) \right| < \epsilon$, which is sensible no matter
the dimensionality of $\Theta$.
\question{Which termination criteria from the 1D case were defined in
a way that assumes $\Theta$ is one dimensional?
}
% % AM_S2024 begin
% \subsection{Gradient vs. Derivative}
% It is easy to fall into habit of using the words {\sl gradient} and {\sl derivative} interchangeably, but there is actually a significant difference.
% For a function $f: \R^m\rightarrow \R$, its derivative at point $x\in\R^m$ (strictly speaking, its {\sl total derivative} at $x$)
% is a {\sl linear} function
% $L:\R^m\to\R$ which provides the best approximation of $f(x+\Delta)-f(x)$ for small $\Delta$,
% in the sense that
% \[
% \lim_{\|\Delta\|\to0+}\frac{\|f(x+\Delta)-f(x)-L(\Delta)\|}{\|\Delta\|}=0.
% \]
% When the elements of $\R^m$ are viewed as
% $m$-by-1 column vectors (as we do in this class), every linear function $L:\R^m\to\R$ can be written in the form
% $L(\Delta)=p\Delta$, where $p$ is a 1-by-$m$ real row matrix, the {\sl Jacobian} of $f$ at $x$.
% This matrix $p$ is usually denoted by $p=df/dx$, and can be computed in terms of partial derivatives as
% \[
% \frac{df}{dx}=\begin{bmatrix}
% \partial f / \partial x_1 &
% \dots &
% \partial f / \partial x_m
% \end{bmatrix}
% \]
% What is described above corresponds to the so-called "numerator" convention for denoting derivatives, which is most common.
% In contrast, the {\sl gradient} $\nabla f$ of a function $f: \R^m\rightarrow \R$ at point $x\in\R^m$ provides a very rough estimate for the
% argument of maximum of $f(x+\Delta)-f(x)$ with respect to $\Delta$. The "roughness" corresponds to the fact that, instead of maximizing
% $f(x+\Delta)-f(x)$, the gradient maximizes $L(\Delta)-0.5\|\Delta\|^2$ (equivalently, $-\nabla f$ minimizes
% $L(\Delta)+0.5\|\Delta\|^2$). Here $L(\Delta)$ is the total derivative of $f$ at
% $x$, hence a good {\sl linear} approximation of $f(x+\Delta)-f(x)$, but the $-0.5\|\Delta\|^2$ term is a wild guess at the
% second order component of the Taylor series expansion of $f(x+\Delta)-f(x)$ (to be more accurate, $-0.5\|\Delta\|^2$ should be replaced by
% $0.5\Delta^Tf''(x)\Delta$, where $f''(x)$ is the {\sl Hessian} of $f$ at $x$, but this approach is much more expensive, and has its own issues).
% When $L(\Delta)=p\Delta$, where $p$ is a 1-by-$m$ row matrix, the maximum of $L(\Delta)-0.5\|\Delta\|^2$ is achieved at $\Delta=p^T$. Hence
% the gradient is the result of transposing the matrix representation of the derivative, when one is using the "numerator" convention.
% The arbitrariness of the $\|\Delta\|^2$ term in the derivation of the gradient is glaring, which leads to the idea of replacing it with
% some other positive definite quadratic form $\Delta^TQ\Delta$, for which maximizing $L(\Delta)-0.5\Delta^TQ\Delta$ is computationally cheap (using
% $Q=f''(x)$ is very expensive in most applications, and there is the additional concern that,
% for a non-convex function, $f''(x)$ does not have to be positive definite).
% One can argue that many of the popular approaches to gradient-like optimization,
% such as the {\sl Adam} are based on using a smart selection of matrix $Q$ in an alternative definition of the gradient.
% To make things more confusing, there is an alternative "denominator" convention,
% in which the Jacobian is defined as transpose of its "numerator" version.
% The denominator convention comes naturally when one interprets the elements of $\R^m$ as {\sl row vectors}
% (i.e., 1-by-$m$ matrices), in which case linear functions
% $L:~\R^m\to\R$ are conveniently written in the form $L(\Delta)=\Delta p$, where $p$ is a {\sl column} $m$-by-1 matrix. When the denominator
% convention is combined with treating the elements of $\R^m$ as column vectors,
% the chain rule for differentiating compositions becomes rather cumbersome.
% % AM_S2024 end
\section{Application to regression}
\label{sec-gd_ridge}
Recall from the previous chapter that choosing a loss function is the
first step in formulating a machine-learning problem as an
optimization problem, and for regression we studied the mean squared
loss, which captures loss as
$({\rm guess} - {\rm actual})^2$.\index{gradient descent!applied to regression}
This leads to the {\em ordinary least squares} objective
\begin{equation}
J(\theta) = \frac{1}{n}\sum_{i =
1}^n\left(\theta^Tx^{(i)} - \ex{y}{i}\right)^2 \;\;.
\end{equation}
We use the gradient of the objective with respect to the parameters,
\begin{equation}
\nabla_{\theta}J = \frac{2}{n}\underbrace{\Xt^T}_{d \times
n}\underbrace{(\Xt\theta - \Yt)}_{n \times 1}\;\;,
\label{eq:reg_gd_deriv}
\end{equation}
to obtain an
analytical solution to the linear regression problem. Gradient
descent could also be applied to numerically compute a solution, using
the update rule
\begin{equation}
\ex{\theta}{t} = \ex{\theta}{t-1} - \eta \frac{2}{n} \sum_{i=1}^{n} \left( \left[ \ex{\theta}{t-1}\right]^T x^{(i)} - y^{(i)} \right) x^{(i)}
\,.
\end{equation}
{~\hfill ~\note{Beware double superscripts! $\left[ \theta \right]^T$ is the transpose of the vector $\theta$.}}
\subsection{Ridge regression}
Now, let's add in the regularization term, to get the ridge-regression
objective:
$$ J_{\text{ridge}}(\theta, \theta_0) = \frac{1}{n}\sum_{i = 1}^n\left(\theta^Tx^{(i)} + \theta_0 - y^{(i)}\right)^2 + \lambda\|\theta\|^2 \;\;.$$
Recall that in ordinary least squares, we finessed handling $\theta_0$
by adding an extra dimension of all 1's. In ridge regression,
we really do need to separate the parameter vector $\theta$ from the
offset $\theta_0$, and so, from the perspective of our general-purpose
gradient descent method, our whole parameter set $\Theta$ is defined
to be $\Theta = (\theta, \theta_0)$. We will go ahead and find the
gradients separately for each one:%
\note{Some passing familiarity with matrix
derivatives is helpful here. A foolproof way of computing them is to compute
partial derivative of $J$ with respect to each component $\theta_i$
of $\theta$. See Appendix~\ref{app:matrix_deriv} on matrix derivatives!}
\begin{align*}
\nabla_\theta J_\text{ridge}(\theta, \theta_0) & = \frac{2}{n}\sum_{i=1}^n
\left(\theta^T\ex{x}{i} + \theta_0 -
\ex{y}{i}\right) \ex{x}{i}
+ 2\lambda\theta \\
\frac{\partial J_\text{ridge}(\theta, \theta_0)}{\partial \theta_0} & =
\frac{2}{n}\sum_{i=1}^n
\left(\theta^T\ex{x}{i} + \theta_0 -
\ex{y}{i} \right) \;\;.
\end{align*}
Note that $\nabla_\theta J_\text{ridge}$ will be of shape $d \times 1$ and
$\partial J_\text{ridge}/\partial \theta_0$ will be a scalar since
we have separated $\theta_0$ from $\theta$ here.
\question{Convince yourself that the dimensions of all these
quantities are correct, under the assumption that $\theta$ is $d
\times 1$. How does $d$ relate to $m$ as discussed for $\Theta$ in the
previous section?
}
\question{Compute $\nabla_\theta \norm{\theta}^2$ by finding the
vector of partial derivatives $(\partial \norm{\theta}^2 / \partial
\theta_1, \ldots, \partial \norm{\theta}^2 / \partial
\theta_d)$. What is the shape of $\nabla_\theta \norm{\theta}^2$?
}
\question{Compute $\nabla_\theta J_\text{ridge}(
\theta^T x + \theta_0, y)$ by finding the
vector of partial derivatives $(\partial J_\text{ridge}(
\theta^T x + \theta_0, y)/ \partial \theta_1, \ldots,
\partial J_\text{ridge}(
\theta^T x + \theta_0, y) / \partial
\theta_d)$.
}
\question{Use these last two results to verify our derivation above.}
Putting everything together, our gradient descent algorithm for ridge
regression becomes\index{gradient descent!applied to ridge regression}
\begin{codebox}
\Procname{$\proc{RR-Gradient-Descent}(\theta_{\it init}, \theta_{0
{\it init}},\eta,\epsilon)$}
\li $\theta^{(0)} \gets \theta_{\it init}$
\li $\theta_0^{(0)} \gets \theta_{0 {\it init}}$
\li $t \gets 0$
\li \Repeat
\li $t \gets t+1$
\li $\theta^{(t)} = \theta^{(t-1)} - \eta\left(\frac{1}{n}\sum_{i=1}^n
\left({\ex{\theta}{t-1}}^T \ex{x}{i} + \ex{\theta_0}{t-1} -
\ex{y}{i}\right) \ex{x}{i}
+ \lambda\ex{\theta}{t-1}
\right)$
\li $\theta_0^{(t)} = \theta_0^{(t-1)} - \eta\left(\frac{1}{n}\sum_{i=1}^n
\left({\ex{\theta}{t-1}}^T \ex{x}{i} + \ex{\theta_0}{t-1} -
\ex{y}{i} \right)
\right)$
\li \Until $\left| J_{\text{ridge}}(\theta^{(t)},\theta_0^{(t)}) - J_{\text{ridge}}(\theta^{(t-1)},
\theta_0^{(t-1)}) \right| <\epsilon$
\li \Return $\theta^{(t)},\theta_0^{(t)}$
\end{codebox}
\question{Is it okay that $\lambda$ doesn't appear in line 7?}
\question{Is it okay that the 2's from the gradient definitions don't
appear in the algorithm?}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Stochastic gradient descent}
\label{sec-sgd}
When the form of the gradient is a sum, rather than take one big(ish)
step in the direction of the gradient, we can, instead,
randomly\note{The word ``stochastic'' means probabilistic, or random;
so does ``aleatoric,'' which is a very cool word. Look up
aleatoric music, sometime.}
select one term of the sum, and take a very small step in that
direction. This seems sort of crazy, but remember that all the
little steps would average out to the same direction as the big step
if you were to stay in one place. Of course, you're not staying in
that place, so you move, in expectation, in the direction of the
gradient.
Most objective functions in machine learning can end up being written
as a sum over data points, in which case, stochastic gradient descent
({\sc sgd}) is implemented by picking a data point randomly out of the
data set, computing the gradient as if there were only that one point
in the data set, and taking a small step in the negative direction.
Let's assume our objective has the form
\[
f(\Theta) = \sum_{i = 1}^n f_i(\Theta)
\;\;,
\]
where $n$ is the number of data points used in the objective (and this
may be different from the number of points available in the whole data
set). Here is pseudocode for applying {\sc sgd} to such an objective $f$;
it assumes we know the form of $\nabla_\Theta f_i$ for all $i$ in
$1\ldots n$:
\begin{codebox}
\Procname{$\proc{Stochastic-Gradient-Descent}(\Theta_{\it init}, \eta, f,
\nabla_\Theta f_1, \ldots, \nabla_\Theta f_n, T)$}
\li $\Theta^{(0)} \gets \Theta_{\it init}$
\li \For $t \gets 1$ \To $T$
\li \Do
randomly select $i \in \{1, 2, \dots, n\}$
\li $\Theta^{(t)} = \Theta^{(t-1)} - \eta(t) \, \nabla_\Theta f_i(\Theta^{(t-1)})$
\End
\li \Return $\Theta^{(t)}$
\end{codebox}
Note that now instead of a fixed value of $\eta$, $\eta$ is indexed by
the iteration of the algorithm, $t$.\index{stochastic gradient descent}
Choosing a good stopping criterion can be a little trickier for {\sc sgd} than
traditional gradient descent. Here we've just chosen
to stop after a fixed number of iterations $T$.
For {\sc sgd} to converge to a local optimum point as $t$ increases, the
learning rate has to decrease as a function of time. The next result shows one
learning rate sequence that works.
\begin{theorem}
If $f$ is convex, and $\eta(t)$ is a sequence satisfying
$$ \sum_{t = 1}^{\infty}\eta(t) = \infty \;\;\text{and}\;\;
\sum_{t = 1}^{\infty}\eta(t)^2 < \infty \;\;,$$
then SGD converges {\em with probability one}%
\note{We have left out some gnarly conditions in this theorem. Also, you
can learn more about the subtle difference between ``with probability
one'' and ``always'' by taking an advanced probability course.}
to the optimal $\Theta$.\index{stochastic gradient descent!convergence}
\end{theorem}
Why these two conditions? The intuition is that the first condition,
on $\sum \eta(t)$, is needed to allow for the possibility of an
unbounded potential range of exploration, while the second condition,
on $\sum\eta(t)^2$, ensures that the learning rates get smaller and
smaller as $t$ increases.
One ``legal'' way of setting the learning rate is to make $\eta(t) = 1/t$ but
people often use rules that decrease more slowly, and so don't
strictly satisfy the criteria for convergence.
\question{
If you start a long way from the optimum, would making $\eta(t)$
decrease more slowly tend to make you move more quickly or more slowly
to the optimum?
}
There are multiple intuitions for why {\sc sgd} might be a better choice
algorithmically than regular {\sc gd} (which is sometimes called {\em
batch} {\sc gd} ({\sc bgd})):
\begin{itemize}
\item {\sc bgd} typically requires computing some quantity over every
data point in a data set. {\sc sgd} may perform well after visiting only
some of the data. This behavior can be useful for very large data sets --
in runtime and memory savings.
\item If your $f$ is actually non-convex, but has many shallow local
optimum points that might trap {\sc bgd}, then taking {\em samples} from the
gradient at some point $\Theta$ might ``bounce'' you around the
landscape and away from the local optimum points.
\item Sometimes, optimizing $f$ really well is not what we want to do,
because it might overfit the training set; so, in fact, although
{\sc sgd} might not get lower training error than {\sc bgd}, it
might result in lower test error.
\end{itemize}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "top"
%%% End: