gradient_descent.tex

\chapter{Gradient Descent}
\label{chap-gradient}
\vspace*{-0.1in}
In the previous chapter,
we showed how to describe an interesting objective function for
machine learning, but we need a way to
find the optimal $\Theta^* = \argmin{\Theta} J(\Theta)$, particularly
when the objective function is not amenable to analytical
optimization.  For example, this can be the case when $J(\Theta)$
involves a more complex loss function, or more general forms of
regularization.  
It can also be the case when there are simply too many
parameters to learn for it to be computationally feasible.

There is an
enormous and fascinating literature on the mathematical and algorithmic
foundations of optimization\note{Which you should consider studying
  some day!}, but for this class, we will consider one of the simplest
methods, called {\em gradient descent.}

Intuitively, in one or two dimensions, we can easily think of
$J(\Theta)$ as defining a surface over $\Theta$;  that same idea
extends to higher dimensions.  Now, our objective is to find the
$\Theta$ value at the lowest point on that surface.  One way to think
about gradient descent is that you start at some arbitrary point on
the surface, look to see in which direction the ``hill'' goes down
most steeply, take a small step in that direction, determine the
direction of steepest descent from where you are, take another small
step, etc.

Below, we explicitly give gradient descent algorithms for one and
multidimensional objective functions (Sections~\ref{sec-gd_onedim}
and~\ref{sec-gd}).  We then illustrate the application of gradient
descent to a loss function which is not merely mean squared loss
(Section~\ref{sec-gd_ridge}).  And we present an important method known
as {\em stochastic gradient descent}
(Section~\ref{sec-sgd}), which is
especially useful when datasets are too large for descent in a single
batch, and has some important behaviors of its own.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Gradient descent in one dimension}\label{sec-gd_onedim}

We start by considering gradient descent in one dimension. Assume
$\Theta \in \R$, and that we know both $J(\Theta)$ and its first
derivative with respect to $\Theta$, $J'(\Theta)$.  Here is pseudo-code
for gradient descent on an arbitrary function $f$\index{gradient descent}.  Along with $f$ and
its gradient
% AM_S2024 begin
$\nabla_{\Theta}f$ (which, in the case of a scalar $\Theta$, is the same as its derivative $f'$),
% AM_S2024 end
we have to specify some hyper-parameters.
These hyper-parameters include the initial value for parameter $\Theta$, a {\em step-size}
hyper-parameter $\eta$, and an {\em accuracy} hyper-parameter~$\epsilon$\index{gradient descent!learning rate}.

The hyper-parameter $\eta$ is often called {\em learning rate} when
gradient descent is applied in machine learning. For simplicity, $\eta$ may be taken as a constant, as is the case in the pseudo-code below; and we'll see adaptive (non-constant) step-sizes soon. What's important to notice though, is that even when $\eta$ is constant, the actual magnitude of the
change to $\Theta$ may not be constant, as that change depends on the
magnitude of the gradient itself too.

\begin{codebox}
  \Procname{$\proc{1D-Gradient-Descent}(\Theta_{\it init}, \eta, f,
      f', \epsilon)$}
  \li $\Theta^{(0)} \gets \Theta_{\it init}$
  \li $t \gets 0$
  \li \Repeat
  \li   $t \gets t+1$
  \li   $\Theta^{(t)} = \Theta^{(t-1)} - \eta \, f'(\Theta^{(t-1)})$
  \li \Until $|f'(\Theta^{(t)})| < \epsilon$
  \li \Return $\Theta^{(t)}$
\end{codebox}
Note that this algorithm terminates when the derivative of the function $f$
is sufficiently small. There are many other reasonable ways to decide
to terminate, including\index{gradient descent!stopping criteria}:
\begin{itemize}
  \item Stop after a fixed number of iterations $T$, i.e.,\ when $t = T$. Practically, this is the most common choice.
  \item Stop when the change in the value of the parameter
        $\Theta$ is sufficiently small, i.e.,\ when $\left| \Theta^{(t)} - \Theta^{(t-1)} \right|
          <\epsilon$.
  % \item Stop when the derivative $f'$ at the latest value of $\Theta$ is sufficiently small, i.e.,\ when $\left|f'(\Theta^{(t)}) \right|
  %         <\epsilon$.
\end{itemize}

\question{Consider all of the potential stopping criteria for $\proc{1D-Gradient-Descent}$, both in the algorithm as it appears
  and listed separately later.
  Can you think of ways that any two of the criteria relate to each other?
}

\begin{theorem}
  Choose any small distance $\tilde{\epsilon} > 0$.
  If we assume that $f$ has a minimum, is sufficiently ``smooth'' and convex,\note{A function is convex if the line segment between any two points on the graph of the function lies above or on the graph.}
  and if the learning rate $\eta$ is sufficiently small, gradient descent will reach a point within $\tilde{\epsilon}$ of a global optimum point $\Theta$.\index{gradient descent!convergence}

  % Previous theorem:
  %  for any desired accuracy $\epsilon$, there is some
  %  learning rate $\eta$ such that gradient
  %  descent will converge to within $\epsilon$ of the optimal $\Theta$.
  %
  % Previous theorem has bug: doesn't quite hold for horizontal line; also,
  % it won't be epsilon away; it'll get arbitrarily close with the distances going to 0
\end{theorem}
However, we must be careful when choosing the learning rate to prevent
slow convergence, non-converging oscillation around the minimum, or divergence.

The following plot illustrates a convex function $f(x) = (x - 2)^2$,
starting gradient descent at $x_\textrm{init} = 4.0$ with a step-size of
$1/2$.  It is very well-behaved!

\begin{center}
  \begin{tikzpicture}
    \begin{axis}[
        axis lines=middle,
        xmin=-1, xmax=6,
        ymin=-1, ymax=5,
        xlabel={$x$}, ylabel={$f(x)$},
      ]
      \addplot [domain=-1:6, samples=100] {(x-2)^2};
      \addplot [only marks,color=black,mark=*,mark size=1.5pt] coordinates {
          (2,0)
          (4,4)};
      \draw[-latex,blue,thick] (axis cs: 4,4) -- (axis cs: 2,0);
    \end{axis}
  \end{tikzpicture}
\end{center}

\question{What happens in this example with very small $\eta$?  With
  very big $\eta$?}

If $f$ is non-convex, where gradient descent converges to depends on
$x_{\it init}$.  First, let's establish some definitions. 
Let $f$ be a real-valued function defined over some domain $D$.
A point $x_0 \in D$ is called a {\em global minimum point} of $f$ if $f(x_0) \leq f(x)$
for all other $x \in D$. 
A point $x_0 \in D$ is instead called a {\em local minimum point} of a function $f$ if there exists
some constant $\epsilon > 0$ such that for all $x$ within the interval defined by $d(x,x_0) < \epsilon,$
$f(x_0) \leq f(x)$, where $d$ is some distance metric, e.g., $d(x,x_0) = || x - x_0 ||.$
% Suppose we have well defined first and second derivatives for $f$.
% Then we say that $f$ has a {\em local minimum point} or {\em local optimum point}
% at $x$ if $f'(x) = 0$ and $f''(x)
%   > 0$, and we say that $f(x)$ is a {\em local minimum value} of $f$.
% % More generally, $x$ is a local minimum point of $f$ if $f(x)$ is at
% % least as low as $f(x')$ for all points $x'$ in some small area around $x$.
% We say that $f$ has a {\em global
%     minimum point} at $x$ if $f(x)$ is at least as low as $f(x')$ for every other input $x'$.
% And then we call $f(x)$ a {\em global minimum value}.
A global minimum point is also a local minimum point, but a local
minimum point does not have to be a global minimum point.

If $f$ is non-convex (and sufficiently smooth),
% AM_S2024 begin
one expects that
% AM_S2024 end
gradient descent (run long enough with small enough learning rate) will get very close to a point at which the gradient is zero, though
we cannot guarantee that it will converge to a global minimum point.
% \note{There are many "gaps" between a point having zero gradient and a global minimum. For example, a local maximum, or a so-called saddle point would be a point at which gradient is zero.}

% AM_S2024 begin
There are two notable exceptions to this common sense expectation: First, gradient descent can get stagnated while approaching a point $x$ which is not a local minimum or maximum,
but satisfies $f'(x)=0$. For example, for $f(x)=x^3$, starting gradient descent from the
initial guess $x_{\it init}=1$, while using learning rate $\eta<1/3$
will lead to $x^{(k)}$ converging to zero as $k\to\infty$. Second, there are functions (even convex ones) with no minimum points,
like $f(x)=\exp(-x)$, for which gradient descent with a positive learning rate converges to $+\infty$.
% AM_S2024 end

The plot below shows two different $x_{\it init}$, and how gradient descent
started from each point heads toward two different local optimum points.
\begin{center}
  \begin{tikzpicture}
    \begin{axis}[
        axis lines=middle,
        xmin=-2, xmax=4,
        ymin=2, ymax=10,
        xlabel={$x$}, ylabel={$f(x)$},
      ]
      \addplot [domain=-2:4, samples=100] {0.5*x^4 - 3*x^3 + 4*x^2 + 3*x + 3};
      \draw[-latex,blue,thick] (axis cs: -1,7.5) -- (axis cs: -0.52,2.98);
      \draw[-latex,blue,thick] (axis cs: -.52,2.98) -- (axis cs: -0.40, 2.65);
      \draw[-latex,blue,thick] (axis cs: -0.40, 2.65) -- (axis cs: -0.28, 2.54);

      \draw[-latex,blue,thick] (axis cs: 2,9) -- (axis cs: 2.15,8.81);
      \draw[-latex,blue,thick] (axis cs: 2.15,8.81) -- (axis cs: 2.38,8.40);
      \draw[-latex,blue,thick] (axis cs: 2.38,8.40) -- (axis cs: 2.68, 7.82);
      \draw[-latex,blue,thick] (axis cs: 2.68, 7.82) -- (axis cs: 2.93, 7.52);
      \draw[-latex,blue,thick] (axis cs: 2.93, 7.52) -- (axis cs: 3, 7.5);

      \addplot [only marks,color=black,mark=*,mark size=1.5pt] coordinates {
          (-1, 7.5)
          (-.4, 2.65)
          (2, 9)
          (3, 7.5)};
    \end{axis}
  \end{tikzpicture}
\end{center}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Multiple dimensions}
\label{sec-gd}

The extension to the case of multi-dimensional $\Theta$ is
straightforward.  Let's  assume $\Theta \in \R^m$, so $f: \R^m
  \rightarrow \R$. The
gradient of $f$ with respect to $\Theta$ is
\[
  \nabla_\Theta f =
  \begin{bmatrix}
    \partial f / \partial \Theta_1 \\
    \vdots                         \\
    \partial f / \partial \Theta_m
  \end{bmatrix}
\]
The algorithm remains the same, except that the update step in line 5
becomes
\[ \ex{\Theta}{t} = \ex{\Theta}{t-1} - \eta\nabla_\Theta f(\ex{\Theta}{t-1}) \]
and any termination criteria that depended on the dimensionality
of $\Theta$ would have to change.  The easiest thing is
to keep the test in line 6 as $\left|f(\ex{\Theta}{t}) -
  f(\ex{\Theta}{t-1}) \right| < \epsilon$, which is sensible no matter
the dimensionality of $\Theta$.
\question{Which termination criteria from the 1D case were defined in
  a way that assumes $\Theta$ is one dimensional?
}

% % AM_S2024 begin
% \subsection{Gradient vs. Derivative}
% It is easy to fall into habit of using the words {\sl gradient} and {\sl derivative} interchangeably, but there is actually a significant difference.
% For a function $f: \R^m\rightarrow \R$, its derivative at point $x\in\R^m$ (strictly speaking, its {\sl total derivative} at $x$)
% is a {\sl linear} function
% $L:\R^m\to\R$ which provides the best approximation of $f(x+\Delta)-f(x)$ for small $\Delta$,
% in the sense that
% \[
%   \lim_{\|\Delta\|\to0+}\frac{\|f(x+\Delta)-f(x)-L(\Delta)\|}{\|\Delta\|}=0.
% \]
% When the elements of $\R^m$ are viewed as
% $m$-by-1 column vectors (as we do in this class), every linear function $L:\R^m\to\R$ can be written in the form
% $L(\Delta)=p\Delta$, where $p$ is a 1-by-$m$ real row matrix, the {\sl Jacobian} of $f$ at $x$.
% This matrix $p$ is usually denoted by $p=df/dx$, and can be computed in terms of partial derivatives as
% \[
%   \frac{df}{dx}=\begin{bmatrix}
%     \partial f / \partial x_1 &
%     \dots                     &
%     \partial f / \partial x_m
%   \end{bmatrix}
% \]
% What is described above corresponds to the so-called "numerator" convention  for denoting derivatives, which is most common.


% In contrast, the {\sl gradient} $\nabla f$ of a function $f: \R^m\rightarrow \R$ at point $x\in\R^m$ provides a very rough estimate for the
% argument of maximum of $f(x+\Delta)-f(x)$ with respect to $\Delta$. The "roughness" corresponds to the fact that, instead of maximizing
% $f(x+\Delta)-f(x)$, the gradient  maximizes $L(\Delta)-0.5\|\Delta\|^2$ (equivalently, $-\nabla f$ minimizes
% $L(\Delta)+0.5\|\Delta\|^2$). Here $L(\Delta)$ is the total derivative of $f$ at
% $x$, hence a good {\sl linear} approximation of $f(x+\Delta)-f(x)$, but the $-0.5\|\Delta\|^2$ term is a wild guess at the
% second order component of the Taylor series expansion of $f(x+\Delta)-f(x)$ (to be more accurate, $-0.5\|\Delta\|^2$ should be replaced by
% $0.5\Delta^Tf''(x)\Delta$, where $f''(x)$ is the {\sl Hessian} of $f$ at $x$, but this approach is much more expensive, and has its own issues).
% When $L(\Delta)=p\Delta$, where $p$ is a 1-by-$m$ row matrix, the maximum of $L(\Delta)-0.5\|\Delta\|^2$ is achieved at $\Delta=p^T$. Hence
% the gradient is the result of transposing the matrix representation of the derivative, when one is using the "numerator" convention.

% The arbitrariness of the $\|\Delta\|^2$ term in the derivation of the gradient is glaring, which leads to the idea of replacing it with
% some other positive definite quadratic form $\Delta^TQ\Delta$, for which maximizing $L(\Delta)-0.5\Delta^TQ\Delta$ is computationally cheap (using
% $Q=f''(x)$ is very expensive in most applications, and there is the additional concern that,
% for a non-convex function, $f''(x)$ does not have to be positive definite).
% One can argue that many of the popular approaches to gradient-like optimization,
% such as the {\sl Adam} are based on using a smart selection of matrix $Q$ in an alternative definition of the gradient.

% To make things more confusing, there is an alternative "denominator" convention,
% in which the Jacobian is defined as transpose of its "numerator" version.
% The denominator convention comes naturally when one interprets the elements of $\R^m$ as {\sl row vectors}
% (i.e., 1-by-$m$ matrices), in which case linear functions
% $L:~\R^m\to\R$ are conveniently written in the form  $L(\Delta)=\Delta p$, where $p$ is a {\sl column} $m$-by-1 matrix. When the denominator
% convention is combined with treating the elements of $\R^m$ as column vectors,
% the chain rule for differentiating compositions becomes rather cumbersome.

% % AM_S2024 end


\section{Application to regression}

\label{sec-gd_ridge}

Recall from the previous chapter that choosing a loss function is the
first step in formulating a machine-learning problem as an
optimization problem, and for regression we studied the mean squared
loss, which captures loss as
$({\rm guess} - {\rm actual})^2$.\index{gradient descent!applied to regression}
This leads to the {\em ordinary least squares} objective
\begin{equation}
  J(\theta) = \frac{1}{n}\sum_{i =
    1}^n\left(\theta^Tx^{(i)} - \ex{y}{i}\right)^2 \;\;.
\end{equation}
We use the gradient of the objective with respect to the parameters,
\begin{equation}
  \nabla_{\theta}J = \frac{2}{n}\underbrace{\Xt^T}_{d \times
    n}\underbrace{(\Xt\theta - \Yt)}_{n \times 1}\;\;,
  \label{eq:reg_gd_deriv}
\end{equation}
to obtain an
analytical solution to the linear regression problem.  Gradient
descent could also be applied to numerically compute a solution, using
the update rule
\begin{equation}
  \ex{\theta}{t} = \ex{\theta}{t-1} - \eta \frac{2}{n} \sum_{i=1}^{n} \left( \left[ \ex{\theta}{t-1}\right]^T x^{(i)} - y^{(i)} \right) x^{(i)}
  \,.
\end{equation}
{~\hfill ~\note{Beware double superscripts!  $\left[ \theta \right]^T$ is the transpose of the vector $\theta$.}}

\subsection{Ridge regression}
Now, let's add in the regularization term, to get the ridge-regression
objective:
$$ J_{\text{ridge}}(\theta, \theta_0) = \frac{1}{n}\sum_{i = 1}^n\left(\theta^Tx^{(i)} + \theta_0 - y^{(i)}\right)^2 + \lambda\|\theta\|^2 \;\;.$$

Recall that in ordinary least squares, we finessed handling $\theta_0$
by adding an extra dimension of all 1's.  In ridge regression,
we really do need to separate the parameter vector $\theta$ from the
offset $\theta_0$, and so, from the perspective of our general-purpose
gradient descent method, our whole parameter set $\Theta$ is defined
to be $\Theta = (\theta, \theta_0)$.  We will go ahead and find the
gradients separately for each one:%
\note{Some passing familiarity with matrix
  derivatives is helpful here.  A foolproof way of computing them is to compute
  partial derivative of $J$ with respect to each component $\theta_i$
  of $\theta$.  See Appendix~\ref{app:matrix_deriv} on matrix derivatives!}
\begin{align*}
  \nabla_\theta J_\text{ridge}(\theta, \theta_0)                      & =  \frac{2}{n}\sum_{i=1}^n
  \left(\theta^T\ex{x}{i} + \theta_0 -
  \ex{y}{i}\right) \ex{x}{i}
  + 2\lambda\theta                                                                                 \\
  \frac{\partial J_\text{ridge}(\theta, \theta_0)}{\partial \theta_0} & =
  \frac{2}{n}\sum_{i=1}^n
  \left(\theta^T\ex{x}{i} + \theta_0 -
  \ex{y}{i} \right) \;\;.
\end{align*}
Note that $\nabla_\theta J_\text{ridge}$ will be of shape $d \times 1$ and
$\partial J_\text{ridge}/\partial \theta_0$ will be a scalar since
we have separated $\theta_0$ from $\theta$ here.
\question{Convince yourself that the dimensions of all these
  quantities are correct, under the assumption that $\theta$ is $d
    \times 1$. How does $d$ relate to $m$ as discussed for $\Theta$ in the
  previous section?
}

\question{Compute $\nabla_\theta \norm{\theta}^2$ by finding the
  vector of partial derivatives $(\partial \norm{\theta}^2 / \partial
    \theta_1, \ldots, \partial \norm{\theta}^2 / \partial
    \theta_d)$. What is the shape of $\nabla_\theta \norm{\theta}^2$?
}

\question{Compute $\nabla_\theta J_\text{ridge}(
    \theta^T x + \theta_0, y)$ by finding the
  vector of partial derivatives $(\partial J_\text{ridge}(
    \theta^T x + \theta_0, y)/ \partial \theta_1, \ldots,
    \partial J_\text{ridge}(
    \theta^T x + \theta_0, y) / \partial
    \theta_d)$.
}

\question{Use these last two results to verify our derivation above.}

Putting everything together, our gradient descent algorithm for ridge
regression becomes\index{gradient descent!applied to ridge regression}
\begin{codebox}
  \Procname{$\proc{RR-Gradient-Descent}(\theta_{\it init}, \theta_{0
        {\it init}},\eta,\epsilon)$}
  \li $\theta^{(0)} \gets \theta_{\it init}$
  \li $\theta_0^{(0)} \gets \theta_{0 {\it init}}$
  \li $t \gets 0$
  \li \Repeat
  \li   $t \gets t+1$
  \li   $\theta^{(t)} = \theta^{(t-1)} - \eta\left(\frac{1}{n}\sum_{i=1}^n
    \left({\ex{\theta}{t-1}}^T \ex{x}{i} + \ex{\theta_0}{t-1} -
      \ex{y}{i}\right) \ex{x}{i}
    + \lambda\ex{\theta}{t-1}
    \right)$
  \li   $\theta_0^{(t)} = \theta_0^{(t-1)} - \eta\left(\frac{1}{n}\sum_{i=1}^n
    \left({\ex{\theta}{t-1}}^T \ex{x}{i} + \ex{\theta_0}{t-1} -
      \ex{y}{i} \right)
    \right)$
  \li \Until $\left| J_{\text{ridge}}(\theta^{(t)},\theta_0^{(t)}) - J_{\text{ridge}}(\theta^{(t-1)},
    \theta_0^{(t-1)}) \right| <\epsilon$
  \li \Return $\theta^{(t)},\theta_0^{(t)}$
\end{codebox}
\question{Is it okay that $\lambda$ doesn't appear in line 7?}
\question{Is it okay that the 2's from the gradient definitions don't
  appear in the algorithm?}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Stochastic gradient descent}

\label{sec-sgd}

When the form of the gradient is a sum, rather than take one big(ish)
step in the direction of the gradient, we can, instead,
randomly\note{The word ``stochastic'' means probabilistic, or random;
  so does ``aleatoric,'' which is a very cool word.  Look up
  aleatoric music, sometime.}
select one term of the sum, and take a very small step in that
direction.  This seems sort of crazy,  but remember that all the
little steps would average out to the same direction as the big step
if you were to stay in one place.   Of course, you're not staying in
that place,  so you move, in expectation, in the direction of the
gradient.

Most objective functions in machine learning can end up being written
as a sum over data points, in which case, stochastic gradient descent
({\sc sgd}) is implemented by picking a data point randomly out of the
data set, computing the gradient as if there were only that one point
in the data set, and taking a small step  in the negative direction.

Let's assume our objective has the form
\[
  f(\Theta) = \sum_{i = 1}^n f_i(\Theta)
  \;\;,
\]
where $n$ is the number of data points used in the objective (and this
may be different from the number of points available in the whole data
set).  Here is pseudocode for applying {\sc sgd} to such an objective $f$;
it assumes we know the form of $\nabla_\Theta f_i$ for all $i$ in
$1\ldots n$:
\begin{codebox}
  \Procname{$\proc{Stochastic-Gradient-Descent}(\Theta_{\it init}, \eta, f,
      \nabla_\Theta f_1, \ldots, \nabla_\Theta f_n, T)$}
  \li $\Theta^{(0)} \gets \Theta_{\it init}$
  \li \For $t \gets 1$ \To $T$
  \li \Do
  randomly select $i \in \{1, 2, \dots, n\}$
  \li   $\Theta^{(t)} = \Theta^{(t-1)} - \eta(t) \, \nabla_\Theta f_i(\Theta^{(t-1)})$
  \End
  \li \Return $\Theta^{(t)}$
\end{codebox}

Note that now instead of a fixed  value of $\eta$, $\eta$ is indexed by
the iteration of the algorithm, $t$.\index{stochastic gradient descent}
Choosing a good stopping criterion can be a little trickier for {\sc sgd} than
traditional gradient descent. Here we've just chosen
to stop after a fixed number of iterations $T$.

For {\sc sgd} to converge to a local optimum point as $t$ increases, the
learning rate has to decrease as a function of time. The next result shows one
learning rate sequence that works.

\begin{theorem}
  If $f$ is convex, and $\eta(t)$ is a sequence satisfying
  $$ \sum_{t = 1}^{\infty}\eta(t) = \infty \;\;\text{and}\;\;
    \sum_{t = 1}^{\infty}\eta(t)^2 < \infty \;\;,$$
  then SGD converges {\em with probability one}%
\note{We have left out some gnarly conditions in this theorem. Also, you
    can learn more about the subtle difference between ``with probability
    one'' and ``always'' by taking an advanced probability course.}
  to the optimal $\Theta$.\index{stochastic gradient descent!convergence}
\end{theorem}

Why these two conditions?  The intuition is that the first condition,
on $\sum \eta(t)$, is needed to allow for the possibility of an
unbounded potential range of exploration, while the second condition,
on $\sum\eta(t)^2$, ensures that the learning rates get smaller and
smaller as $t$ increases.

One ``legal'' way of setting the learning rate is to make $\eta(t) = 1/t$ but
people often use rules that decrease more slowly, and so don't
strictly satisfy the criteria for convergence.
\question{
  If you start a long way from the optimum, would making $\eta(t)$
  decrease more slowly tend to make you move more quickly or more slowly
  to the optimum?
}

There are multiple intuitions for why {\sc sgd} might be a better choice
algorithmically than regular {\sc gd} (which is sometimes called {\em
    batch} {\sc gd} ({\sc bgd})):
\begin{itemize}
  \item {\sc bgd} typically requires computing some quantity over every
        data point in a data set. {\sc sgd} may perform well after visiting only
        some of the data. This behavior can be useful for very large data sets --
        in runtime and memory savings.
  \item If your $f$ is actually non-convex, but has many shallow local
        optimum points that might trap {\sc bgd}, then taking {\em samples} from the
        gradient at some point $\Theta$ might ``bounce'' you around the
        landscape and away from the local optimum points.
  \item Sometimes, optimizing $f$ really well is not what we want to do,
        because it might overfit the training set;  so, in fact, although
          {\sc sgd} might not get lower training error than {\sc bgd}, it
        might result in lower test error.
\end{itemize}


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "top"
%%% End: