-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathregression.tex
746 lines (652 loc) · 35.5 KB
/
regression.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
\chapter{Regression}
\label{chap-regression}
{\em Regression}\note{``Regression,'' in common parlance, means moving
backwards. But this is forward progress!} is an important % needs escaped space after note; no, it didn't! Just a space.
machine-learning problem that provides a good starting point for
diving deeply into the field.
\section{Problem formulation}
A {\em hypothesis} $h$ is employed as a
model for solving the regression problem, in that it maps inputs $x$
to outputs $y$,
$$ x \rightarrow \boxed{h} \rightarrow y \;\;,$$
where $x \in \R^d$ (i.e., a length $d$ column vector of real numbers),
and $y \in \mathbb{R}$ (i.e., a real number)\index{real numbers}.
Real life rarely gives us vectors of real numbers; the $x$ we really
want to take as input is usually something like a song, image, or person.
In that case, we'll have to define a function $\varphi(x)$, whose
range is $\R^d$, where $\varphi$ represents
{\em features}\index{features} of $x$, like a person's height or the amount of bass in
a song, and then let the $h: \varphi(x) \rightarrow \R$.
In much of the following, we'll omit explicit mention of $\varphi$ and
assume that the $\ex{x}{i}$ are in $\R^d$, but you should always have
in mind that some additional process was almost surely required to go
from the actual input examples to their feature representation, and
we'll talk a lot more about features later in the course.
Regression\index{regression} is a {\em supervised learning} problem, in which we are given a training dataset of the form
$$
\dataTrain = \left\{\left(\ex{x}{1}, \ex{y}{1}\right), \dots,
\left(\ex{x}{n}, \ex{y}{n}\right)\right\}\;\;,
$$
which gives examples of input values $\ex{x}{i}$ and the output values
$\ex{y}{i}$ that should be associated with them.
Because $y$ values are real-valued, our hypotheses will have the form
$$ h: \mathbb{R}^d \rightarrow \mathbb{R} \;\;.$$
This is a good framework when we want to predict a numerical quantity,
like height, stock value, etc., rather than to divide the inputs into
discrete categories.
What makes a hypothesis useful? That it works well on {\em new} data;
that is, that it makes good predictions on examples it hasn't
seen.\note{My favorite analogy is to problem sets. We evaluate a
student's ability to {\em generalize} by putting questions on
the exam that were not on the homework (training set).}
% why was this \anchorednote rather than \note?
But we don't know exactly what data this hypothesis might be tested on
when we use it in the real world. So, we have to {\em{assume}} a
connection between the training data and testing data -- typically, the assumption is that
they (the training and testing data) are drawn independently from the same probability distribution.
To make this discussion more concrete, we have to provide a {\em loss
function}, to say how unhappy we are when we guess an output $g$
given an input $x$ for which the desired output was $a$.
Given a training set $\data_n$ and a hypothesis $h$ with parameters $\Theta,$ we can define the
{\em training error} of $h$ to be the average loss on the training data:
\begin{eqnarray}\label{erm}
\trainerr(h; \Theta) = \frac{1}{n}\sum_{i = 1}^n
\mathcal{L}(h(\ex{x}{i};\Theta), \ex{y}{i})\;\;,
\end{eqnarray}
The training error of $h$ gives us some idea of how well it
characterizes the relationship between $x$ and $y$ values in our data,
but it isn't the quantity that we {\em most} care about. What we most care about is
{\em test error}:
\begin{eqnarray*}
\mathcal{E}(h) = \frac{1}{n'}\sum_{i = n + 1}^{n + n'}
\mathcal{L}(h(\ex{x}{i}), \ex{y}{i})
\end{eqnarray*}
on $n'$ new examples that were not used in the process of finding the
hypothesis.\note{It might be worthwhile to stare at the two errors and think about what's the difference. For example, notice how $\Theta$ is no long a variable in the testing error? this is because in evaluating the testing error, the parameters will have been "picked"/"fixed" already.}
For now, we will try to find a hypothesis with small training error
(later, with some added criteria) and try to make some design choices
so that it {\em{generalizes well}}
to new data, meaning that it also has a small {\em test error}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Regression as an optimization problem}
\label{sec-reg_optim}
Given data, a loss function, and a hypothesis class, we
need a method for finding a good hypothesis in the class. One of the
most general ways to approach this problem is by framing the machine
learning problem as an optimization problem. One reason for taking
this approach is that there is a rich area of math and algorithms
studying and developing efficient methods for solving optimization
problems, and lots of very good software implementations of these
methods. So, if we can turn our problem into one of these problems,
then there will be a lot of work already done for us!
We begin by writing down an \emph{objective function} $J(\Theta)$,
where $\Theta$ stands for {\em all} the parameters in our model (i.e., all possible choices over parameters).
%Note that we will sometimes write $J(\theta, \theta_0)$
We often write $J(\Theta; \data)$ to make clear the dependence on\note{Don't be too perturbed by the
semicolon where you expected to see a comma! It's a math way of
saying that we are mostly interested in this as a function of the
arguments before the ``;'', but we should remember that there's a
dependence on the stuff after it, as well.}
the data $\data$.
The objective function describes how we feel about
possible hypotheses $\Theta$: we will generally look for values for
parameters $\Theta$ that minimize the objective function:%
\note{You can think about $\Theta^*$ here as ``the theta that
minimizes $J$'': $\argmin{x} f(x)$
means the value of $x$ for which $f(x)$ is the smallest. Sometimes
we write $\argmin{x \in {\cal X}} f(x)$ when we want to explicitly
specify the set ${\cal X}$ of values of $x$ over which we want to
minimize.
}
\[ \Theta^* = \arg\min_{\Theta} J(\Theta)\;\;. \]
A very common form for a machine-learning objective is
%
\begin{equation}
J(\Theta) = \left(\frac{1}{n} \sum_{i=1}^n
\underbrace{\mathcal{L}(h(\ex{x}{i}; \Theta),
\ex{y}{i})}_\text{loss}\right) + \underbrace{\lambda}
_\text{non-negative constant} {R(\Theta)}.
%
\label{eq:ml_objective_loss}
\end{equation}
%
The \emph{loss} tells us how unhappy we are about the prediction
$h(\ex{x}{i}; \Theta)$ that $\Theta$ makes for $(\ex{x}{i},
\ex{y}{i})$. Minimizing this loss makes the prediction better. The
\emph{regularizer}\index{regularizer} is an additional term that encourages the
prediction to remain general, and the constant $\lambda$ adjusts the
balance between reproducing seen examples, and being able to
generalize to unseen examples. We will return to discuss this
balance, and more about the idea of regularization, in
Section~\ref{sec-regularization}.
\section{Linear regression}
To make this discussion more concrete, we have to provide a hypothesis
class and a loss function.
We will begin by picking a class of hypotheses $\hclass$ that we think
might provide a good set of possible models of the relationship
between $x$ and $y$ in our data. We will start with a very simple
class of {\em linear} hypotheses for regression\index{regression!linear regression}. It is both simple to
study and very powerful, and will serve as the basis for many other
important techniques (even neural networks!).
In linear regression, the set $\hclass$ of hypotheses\index{hypothesis!linear regression} has the form
\begin{equation}
h(x;\theta, \theta_0) = \theta^Tx + \theta_0
\;\;,
\label{eq:linear_reg_hypothesis}
\end{equation}
%
with model parameters $\Theta = (\theta, \theta_0)$. In one
dimension ($d = 1$) this has the same familiar slope-intercept form as $y = mx + b$; in higher dimensions, this model describes the so-called hyperplanes.
We define a {\em loss function}
to describe how to evaluate
the quality of the predictions our hypothesis is making, when compared
to the ``target'' $y$ values in the data set. The choice of loss
function is part of modeling your domain. In the absence of
additional information about a regression problem, we typically use
{\em squared loss}:
\[\mathcal{L}(g, a) = (g - a)^2\;\;.\]
where $g=h(x)$ is our "guess" from the hypothesis, and $a$ is the "actual" observation (in other words, here $a$ is being used equivalently as $y$). With this choice of squared loss, the average loss as generally defined in \ref{erm} will become the so-called {\em mean squared error (MSE),} which we'll study closely very soon.
The squared loss penalizes guesses that are too high the
same amount as it penalizes guesses that are too low, and has a good
mathematical justification in the case that your data are generated
from an underlying linear hypothesis with the so-called Gaussian-distributed\note{We won't get into the details of Gaussian distribution in our class; but it's one of the most important distributions and well-worth studying closely at some point. One obvious fact about Gaussian is that it's symmetric; this is in fact one of the reasons squared loss works well under Gaussian settings, as the loss is also symmetric.}\
noise added to the $y$ values. But there are applications
in which other losses would be better, and much of the framework we
discuss can be applied to different loss functions, although this one
has a form that also makes it particularly computationally convenient.
Our objective in linear regression will be to find a hyperplane that
goes as close as possible, on average, to all of our training data.
Applying the general optimization framework to the linear regression
hypothesis class of
Eq.~\ref{eq:linear_reg_hypothesis} with squared loss and no regularization,
our objective is to find values for
$\Theta = (\theta, \theta_0)$ that minimize the MSE:
\begin{equation}
J(\theta, \theta_0) = \frac{1}{n}\sum_{i =
1}^n\left(\theta^Tx^{(i)} + \theta_0 - \ex{y}{i}\right)^2 \;\;,
\label{eq:reg_mse_withconst}
\end{equation}
resulting in the solution:
\begin{equation}
\theta^*, \theta_0^* = \argmin{\theta, \theta_0} J(\theta,
\theta_0)\;\;.
\label{olsObjective}
\end{equation}
For
one-dimensional data ($d=1$), this becomes the familiar problem of
fitting a line to data. For $d>1$, this
hypothesis may be visualized as a $d$-dimensional hyperplane\index{hyperplane}
embedded in a $(d+1)$-dimensional space (that consists of the input
dimension and the $y$ dimension).
For example, in the left plot
below, we can see data points with labels $y$ and input dimensions
$x_1$ and $x_2$. In the right plot below, we see the result of fitting
these points with a two-dimensional plane that resides in three
dimensions. We interpret the plane as representing a function that
provides a $y$ value for any input $(x_1, x_2)$.
\includegraphics[width=0.9\textwidth]{figures/regression_ex1_plane1.png}
A richer class of hypotheses can be obtained by performing a
non-linear feature transformation before doing the regression, as
we will later see (in Chapter~\ref{chap-features}), but it will still
end up that we have to solve a linear regression problem.
% Here, we begin with the foundations, studying {\em linear} regression, including
% its framing as an optimization problem (Section~\ref{sec-reg_optim}),
% how to use {\em regularization} to keep solutions generalizable
% (Section~\ref{sec-regularization}), and how to evaluate {\em
% learning algorithms} which solve the regression problem
% (Section~\ref{sec-reg_learn_alg}).
\section{A gloriously simple linear regression algorithm}
Okay! Given the objective in Eq.~\ref{eq:reg_mse_withconst}, how
can we find good values of $\theta$ and $\theta_0$? We'll study
several general-purpose, efficient, interesting algorithms. But
before we do that, let's start with the simplest one we can think of:
{\em guess a whole bunch ($k$) of different values of $\theta$ and
$\theta_0$}, see which one has the smallest error on the training
set, and return it.
\begin{codebox}
\Procname{$\proc{Random-Regression}(\data, k)$}
\li For $i$ in $1\dots k$: Randomly generate hypothesis $\ex{\theta}{i},
\ex{\theta_0}{i}$
\li Let $i = \argmin{i} J(\ex{\theta}{i}, \ex{\theta_0}{i}; \data)$
\li Return $\ex{\theta}{i}, \ex{\theta_0}{i}$
\end{codebox}
This seems kind of silly, but it's a learning algorithm, and it's not
completely useless. \index{regression!random regression}
\question{If your data set has $n$ data points, and the dimension of the
$x$ values is $d$, what is the size of an individual $\ex{\theta}{i}$?}
\question{How do you think increasing the number of guesses $k$ will change the training
error of the resulting hypothesis?}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Analytical solution: ordinary least squares}
One very interesting aspect of the problem of finding a linear hypothesis
that minimizes mean squared error is that we can find a
closed-form formula for the answer!\note{What does ``closed form''
mean? Generally, that it involves direct evaluation of a
mathematical expression using a fixed number of ``typical''
operations (like arithmetic operations, trig functions, powers,
etc.). So equation~\ref{olsObjective} is not in closed form,
because it's not at all clear what operations one needs to perform
to find the solution.} This general problem is often
called the {\em ordinary least squares} ({\sc ols})
Everything is easier to deal with if we assume that all of the the $\ex{x}{i}$
have been augmented with an extra input dimension (feature) that
always has value 1, so that they are in $d+1$ dimensions, and rather
than having an explicit $\theta_0$, we let it be the last element of
our $\theta$ vector, so that we have, simply,
\[y = \theta^T x\;\;.\]
In this case, the objective becomes
\begin{equation}
J(\theta) = \frac{1}{n}\sum_{i =
1}^n\left(\theta^Tx^{(i)} - \ex{y}{i}\right)^2 \;\;.
\label{eq:reg_mse}
\end{equation}
\question{
Stop and prove to yourself that adding that extra feature
with value 1 to every input vector and getting rid of the $\theta_0$
parameter is equivalent to our original model.
}
We approach this just like a minimization problem from calculus
homework: take the derivative of $J$ with respect to $\theta$, set it
to zero, and solve for $\theta$. There are additional steps
required, to check that the resulting $\theta$ is a minimum (rather
than a maximum or an inflection point) but we won't work through that
here. It is possible to approach this problem by:
\begin{itemize}
\item Finding $\partial{J}/\partial{\theta_k}$ for $k$ in
\anchorednote{$1, \ldots, d$,}{We will use $d$ here for the total number of features in
each $\ex{x}{i}$, including the added 1.}
\item Constructing a set of $k$ equations of the form
$\partial{J}/\partial{\theta_k} = 0$, and
\item Solving the system for values of $\theta_k$.
\end{itemize}
That works just fine. To get practice for applying techniques like
this to more complex problems, we will work through a more compact
(and cool!) matrix view. Along the way, it will be helpful to collect all of the derivatives in one
vector. In particular,
the
gradient of $J$ with respect to $\theta$ is following column vector of length $d$:
\[
\nabla_\theta J =
\begin{bmatrix}
\partial J / \partial \theta_1 \\
\vdots \\
\partial J / \partial \theta_d
\end{bmatrix}.
\]
\question{Work through the next steps and check your answer against ours below.}
We can think of our training data in terms of matrices $X$ and $Y$,
where each column of $X$ is an example, and each ``column'' of $Y$ is
the corresponding target output value:
\begin{equation*}
X = \begin{bmatrix}x_1^{(1)} & \dots & x_1^{(n)} \\\vdots & \ddots &
\vdots \\x_d^{(1)} & \dots & x_d^{(n)}\end{bmatrix} \;\;\;
Y = \begin{bmatrix}y^{(1)} & \dots & y^{(n)}\end{bmatrix}\;\;.
\end{equation*}
\question{
What are the dimensions of $X$ and $Y$?
}
In most textbooks, they think of an individual example $\ex{x}{i}$ as
a row, rather than a column. So that we get an answer that will be
recognizable to you, we are going to define a new matrix and vector,
% $W$ and $T$, which are just transposes of our $X$ and $Y$, and
$\Xt$ and $\Yt$, which are just transposes of our $X$ and $Y$, and
then work with them:
$$ \Xt = X^T = \begin{bmatrix}x_1^{(1)} & \dots & x_d^{(1)}\\\vdots & \ddots & \vdots\\x_1^{(n)} & \dots & x_d^{(n)}\end{bmatrix} \;\;
\Yt = Y^T = \begin{bmatrix}y^{(1)}\\\vdots\\y^{(n)}\end{bmatrix} \;\;.$$
\question{
What are the dimensions of $\Xt$ and $\Yt$?
}
Now we can write
\[ J(\theta) = \frac{1}{n}\underbrace{(\Xt\theta - \Yt)^T}_{1 \times
n}\underbrace{(\Xt\theta - \Yt)}_{n \times 1} =
\frac{1}{n}\sum_{i=1}^n \left(\left(\sum_{j=1}^d \Xt_{ij}\theta_j
\right) - \Yt_i\right)^2\]
and using facts about matrix/vector calculus, we get \note{You should
be able to verify this by doing a simple (say, $2 \times 2$) example by hand.}
\begin{equation}
\nabla_{\theta}J = \frac{2}{n}\underbrace{\Xt^T}_{d \times n}\underbrace{(\Xt\theta - \Yt)}_{n \times 1}\;\;.
% \label{eq:reg_gd_deriv}
\end{equation}
See Appendix~\ref{app:matrix_deriv} for a nice way to think about finding this
derivative.
Setting $ \nabla_{\theta}J$ to 0 and solving, we get:
\begin{align*}
\frac{2}{n}\Xt^T(\Xt\theta - \Yt) & = 0 \\
\Xt^T\Xt\theta - \Xt^T \Yt & = 0 \\
\Xt^T\Xt\theta & = \Xt^T \Yt \\
\theta & = (\Xt^T\Xt)^{-1} \Xt^T \Yt \\
\end{align*}
And the dimensions work out!
$$ \theta = \underbrace{\left(\Xt^T\Xt\right)^{-1}}_{d \times d}\underbrace{\Xt^T}_{d \times n}\underbrace{\Yt}_{n \times 1} $$
So, given our data, we can directly compute the linear regression that
minimizes mean squared error. That's pretty awesome!
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Regularization}
\label{sec-regularization}
The objective function of Eq.~\ref{eq:ml_objective_loss} balances
(training-data) memorization, induced by the {\em loss} term, with generalization,
induced by the {\em regularization} term. Here, we address the need
for regularization specifically for linear regression, and show how
this can be realized using one popular regularization technique called {\em ridge regression}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Regularization and linear regression}
If all we cared about was finding a hypothesis with small loss on the
training data, we would have no need for regularization, and could
simply omit the second term in the objective. But remember that our
ultimate goal is to {\em perform well on input values that we haven't
trained on!} It may seem that this is an impossible task, but
humans and machine-learning methods do this successfully all the
time. What allows {\em generalization} to new input values is a
belief that there is an underlying regularity that governs both the
training and testing data. One way to
describe an assumption about such a regularity is by choosing a
limited class of possible hypotheses. Another way to do this is to
provide smoother guidance, saying that, within a hypothesis class, we
prefer some hypotheses to others. The regularizer articulates this
preference and the constant $\lambda$ says how much we are willing to
trade off loss on the training data versus preference over hypotheses.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{comment}
This trade-off is illustrated in the figure below. Hypothesis $h_1$
has zero training loss, but is very complicated. Hypothesis $h_2$
mis-classifies two points, but is very simple. In absence of other
beliefs about the solution, it is often better to prefer that the
solution be ``simpler,'' and so we might prefer $h_2$ over $h_1$,
expecting it to perform better on future examples drawn from this same
distribution. \note{To establish some vocabulary, we say that
$h_1$ is {\em overfit} to the training data.} Another nice way of
thinking about regularization is that we would like to prevent our
hypothesis from being too dependent on the particular training data
that we were given: we would like for it to be the case that if the
training data were changed slightly, the hypothesis would not change
by much.
\begin{center}
\begin{tikzpicture}
\draw [->] (-4,0) -- (4,0);
\draw [->] (0,-4) -- (0,4);
\draw[red,thick] (-2.5,-3) .. controls (-4,1) and (-3.8,1.2) .. (-2.4,-.5)
.. controls (-2,-1) .. (-1,-1) .. controls (-.5,-1) .. (.5,-2)
.. controls (.65, -2.1) .. (.7,-1.8) .. controls (.7,-1.5)
.. (.6,-1.2) .. controls (.2,-.4) .. (1.2, -.3)
.. controls (1.6,-.2) .. (1,4);
\foreach \Point in {(-3.4, .3), (2.5, .2), (-1.5,-2), (-1,-3),(.7,-.7),
(3, -2), (2.5, -3.2), (-.7,-1.2), (0.1,-1.8), (1.5, -1.5)}{
\draw \Point circle[radius=2pt];
}
\foreach \Point in {(.5,-1.7),(-.2,.3),(-2.8,.2),(-2.4,-.3),(-1,2.7),
(-1.5, 2), (-1.6, .9), (-3.8,-1), (-3.5,1.2),(.5, 1.2)}{
\fill \Point circle[radius=2pt];
}
\draw[blue, thick] (-3.5,-1.8) -- (3.5,1.3);
\node[right,blue] at (3.5,1.3) {$h_2$};
\node[below,right,red] at (1,4) {$h_1$};
\end{tikzpicture}
\end{center}
\end{comment}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
For example, consider what happens when $d=2,$ and $x_2$ is highly
correlated with $x_1$, meaning that the data look like a line, as
shown in the left panel of the figure below. Thus, there isn't a unique best hyperplane.\note{Sometimes there's technically a
unique best hyperplane, but just because of noise.} Such
correlations happen often in real-life data, because of underlying
common causes; for example, across a population, the height of people
may depend on both age and amount of food intake in the same way.
This is especially the case when there are many feature dimensions
used in the regression. Mathematically, this leads to $\Xt^T\Xt$
close to singularity, such that $(\Xt^T\Xt)^{-1}$ is undefined or has huge
values, resulting in unstable models (see the middle panel of figure
and note the range of the $y$ values---the slope is huge!):
\includegraphics[width=1.12\textwidth]{figures/regression_ex2_plane1.png}
% (* mathematica: effect of regularization *)
% npts = 40;
% SeedRandom[3];
% x1 = RandomReal[{-3, 3}, {npts}];
% x2 = 1.2 * x1 + 0.3;
% noise = RandomReal[{-1.5, 1.5}, {npts}];
% yv = x1 + 1.4 * x2 + 1 + noise;
% pts = Transpose@{x1, x2, yv};
%
% regmodel =
% Fit[pts, {1, Subscript[x, 1], Subscript[x, 2]}, {Subscript[x, 1],
% Subscript[x, 2]}, FitRegularization -> {Tikhonov, 0.1}]
%
% model =
% LinearModelFit[
% pts, { Subscript[x, 1], Subscript[x, 2]}, { Subscript[x, 1],
% Subscript[x, 2]}]
%
% p1 = ListPointPlot3D[ pts,
% AxesLabel -> {Subscript[x, 1], Subscript[x, 2], y} ];
% p2 = ListPlot3D[ pts ];
% p3 = Plot3D[
% model[Subscript[x, 1], Subscript[x, 2]], {Subscript[x, 1], -3,
% 3}, {Subscript[x, 2], -3, 3}, PlotStyle -> Opacity[0.2]
% ];
% p5 = Plot3D[
% regmodel, {Subscript[x, 1], -3, 3}, {Subscript[x, 2], -3, 3},
% PlotStyle -> Opacity[0.2]];
% p4 = Show[{p1, p3}];
% p6 = Show[{p1, p5}];
% gr1 = GraphicsRow[{p1, p4, p6}, ImageSize -> Large]
% Export["regression_ex2_plane1.png", gr1, ImageResolution -> 1200]
A common strategy for specifying a \emph{regularizer} is to use the form
\[ R(\Theta) = \norm{\Theta - \Theta_{\it prior}}^2 \]
when we have some idea in advance that $\Theta$ ought to be near some
value $\Theta_{\it prior}$.%
\note{Learn about Bayesian methods in machine learning to see the theory behind this and cool results!}
Here, the notion of distance is quantified by squaring the $l_2$ \textit{norm} of
the parameter vector: for any $d$-dimensional vector $v \in \mathbb{R}^d,$ the $l_2$ \textit{norm} of $v$ is defined as,
\[\|v\| = \sqrt{\sum_{i=1}^d |v_i|^2}\;\;. \]
In the absence of such knowledge a default is
to {\em regularize toward zero}:
\[ R(\Theta) = \norm{\Theta}^2\;\;. \]
%
When this is done in the example depicted above, the regression model
becomes stable, producing the result shown in the right-hand panel in
the figure. Now the slope is much more sensible.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Ridge regression}
\label{sec-ridge_regression}
There are some kinds of trouble we can get into in regression problems.
What if $\left(\Xt^T\Xt\right)$ is not invertible?
\question{Consider, for example, a situation where the data-set is
just the same point repeated twice:
$\ex{x}{1} = \ex{x}{2} = [1~~2]^T$.
What is $\Xt$ in this case? What is $\Xt^T\Xt$? What is $(\Xt^T\Xt)^{-1}$?
}
Another kind of problem is {\em overfitting}: we have formulated an
objective that is just about fitting the data as well as possible, but
we might also
want to {\em regularize} to keep the hypothesis from getting {\em too}
attached to the data.
We address both the problem of not being able to invert $(\Xt^T\Xt)^{-1}$ and
the problem of overfitting using a mechanism called {\em ridge
regression}. We add a regularization term $\|\theta\|^2$ to the
{\sc ols} objective, with a non-negative scalar value $\lambda$ to control the tradeoff
between the training error and the regularization term.
\question{Why do we emphasize the non-negativity of the scalar $\lambda?$ When we add a regularizer of the form $\|\theta\|^2$, what
is our most ``preferred'' value of $\theta$, in the absence of any
data?}
Here is the ridge regression objective function:\index{regression!ridge regression}
$$ J_{\text{ridge}}(\theta, \theta_0) = \frac{1}{n}\sum_{i = 1}^n\left(\theta^Tx^{(i)} + \theta_0 - y^{(i)}\right)^2 + \lambda\|\theta\|^2 $$
Larger $\lambda$ values pressure $\theta$ values to be near zero.
Note that we don't penalize $\theta_0$; intuitively, $\theta_0$ is
what ``floats'' the regression surface to the right level for the data
you have, and so you shouldn't make it harder to fit a data set where
the $y$ values tend to be around one million than one where they tend
to be around one. The other parameters control the orientation of the
regression surface, and we prefer it to have a not-too-crazy
orientation.
There is an analytical expression for the $\theta, \theta_0$ values
that minimize $J_\text{ridge}$, but it's a little bit more complicated
to derive than the solution for {\sc ols} because $\theta_0$ needs
special treatment. If we decide not to treat $\theta_0$ specially
(so we add a 1 feature to our input vectors as discussed above), then we get:
$$ \nabla_{\theta}J_\text{ridge} = \frac{2}{n}\Xt^T(\Xt\theta - \Yt) + 2
\lambda \theta\;\;.$$
Setting to 0 and solving, we get:\note{Remember that $I$ stands for
the identity matrix, a square matrix that has 1's along the diagonal and 0's
everywhere else.}
\begin{align*}
\frac{2}{n}\Xt^T(\Xt\theta - \Yt) + 2 \lambda \theta & = 0 \\
\frac{1}{n}\Xt^T\Xt\theta - \frac{1}{n}\Xt^T\Yt + \lambda \theta & = 0 \\
\frac{1}{n}\Xt^T\Xt\theta + \lambda \theta & = \frac{1}{n}\Xt^T\Yt \\
\Xt^T\Xt\theta + n \lambda \theta & = \Xt^T\Yt \\
(\Xt^T\Xt + n \lambda I)\theta & = \Xt^T\Yt \\
\theta & = (\Xt^T\Xt + n \lambda I)^{-1}\Xt^T\Yt
\end{align*}
Whew! So the solution is:
\begin{equation}
\theta_{\text{ridge}} = \left(\Xt^T\Xt + n\lambda I\right)^{-1}\Xt^T\Yt
\label{eq:ridge_regression_solution}
\end{equation}
Now, why is the term $\left(\Xt^T\Xt + n\lambda I\right)$\note{This is called
``ridge'' regression because we are adding a ``ridge'' of $n\lambda$
values along the diagonal of the matrix before inverting it.} invertible?
Explaining this requires some linear algebra. The matrix $\Xt^T\Xt$ is positive
semidefinite, which implies that its eigenvalues $\{\gamma_i\}_i$ are greater
than or equal to 0. The matrix $\Xt^T\Xt+n\lambda I$ has eigenvalues
$\{\gamma_i+n\lambda\}_i$ which are guaranteed to be strictly positive since
$\lambda>0$. Recalling that the determinant of a matrix is simply the product of its eigenvalues,
we get that $\det(\Xt^T\Xt + n\lambda I)>0$ and conclude that $\Xt^T\Xt + n\lambda I$ is invertible.
\question{Why is $\Xt^T\Xt$ positive semidefinite?}
\question{What is the dimension of $I$ in the equation above?}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Evaluating learning algorithms}\label{sec-reg_learn_alg}
In this section, we will explore how to evaluate supervised
machine-learning algorithms. We will study the special case of
applying them to regression problems, but the basic ideas of
validation, hyper-parameter selection, and cross-validation apply much
more
broadly.
We have seen how linear regression is a well-formed optimization
problem, which has an analytical solution when ridge regularization is
applied. But how can one choose the best amount of regularization, as
parameterized by $\lambda$? Two key ideas involve the evaluation of
the performance of a hypothesis, and a separate evaluation of the
algorithm used to produce hypotheses, as described below.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Evaluating hypotheses}
The performance of a given hypothesis $h$ may be evaluated by
measuring {\em test error} on data that was not used to train it.
\note{It's a bit funny to interpret the analytical
formulas given above for $\theta$ as ``training,'' but later when we employ more statistical methods, ``training'' will be a meaningful concept.}
%
Given a training set $\data_n,$ a regression hypothesis $h$, and if we choose squared loss, we can define the OLS {\em{training error}} of $h$ to be the mean square error between its predictions and the expected outputs:
\begin{eqnarray*}
\trainerr(h) = \frac{1}{n}\sum_{i = 1}^{n} \left[ h(\ex{x}{i}) - \ex{y}{i} \right]^2
\;\;.
\end{eqnarray*}
Test error captures the performance of $h$ on unseen data, and is the
mean square error on the test set, with a nearly identical expression
as that above, differing only in the range of index $i$:
\begin{eqnarray*}
\mathcal{E}(h) = \frac{1}{n'}\sum_{i = n + 1}^{n + n'} \left[ h(x^{(i)}) - y^{(i)} \right]^2
\end{eqnarray*}
on $n'$ new examples that were not used in the process of constructing $h$.
In machine learning in general, not just regression, it is useful to
distinguish two ways in which a hypothesis $h \in \hclass$ might
contribute to test error. Two are:
\begin{description}
\item{\bf Structural error}: This is error that arises because there
is no hypothesis $h \in \hclass$ that will perform well on the data,
for example because the data was really generated by a sine wave but
we are trying to fit it with a line.\index{structural error}
\item{\bf Estimation error}: This is error that arises because we
do not have enough data (or the data are in some way unhelpful) to
allow us to choose a good $h \in \hclass$, or because we didn't
solve the optimization problem well enough to find the best $h$
given the data that we had.\index{estimation error}
\end{description}
\note{There are technical definitions of these concepts that are
studied in more advanced treatments of machine learning. Structural
error is referred to as {\em bias} and estimation error is referred
to as {\em variance}.}
When we increase $\lambda$, we tend to increase structural error but
decrease estimation error, and vice versa.
% \question{
% Consider using a polynomial basis of order $k$ as a feature
% transformation $\phi$ on your data. Would increasing $k$ tend to
% increase or decrease structural error? What about estimation error?
% }
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Evaluating learning algorithms}
{\em Note that this section is relevant to learning algorithms
generally---we are just introducing the topic here since we now have
an algorithm that can be evaluated!}
\vskip0.2in
A {\em learning algorithm}\index{learning algorithm} is a procedure that takes a data set
$\data_n$ as input and returns an hypothesis $h$ from a hypothesis
class $\hclass$; it looks like
\begin{equation*}
\data_n \longrightarrow \boxed{\text{learning alg ($\hclass$)}} \longrightarrow h
\end{equation*}
%
% $\theta$ and $\theta_{0}$.
Keep in mind that $h$ has parameters. The learning algorithm itself may have its own parameters, and such
parameters are often called {\em hyperparameters}. The analytical
solutions presented above for linear regression,
e.g., Eq.~\ref{eq:ridge_regression_solution}, may be thought of as
learning algorithms, where $\lambda$ is a hyperparameter that governs
how the learning algorithm works and can strongly affect its performance.
How should we evaluate the performance of a {learning algorithm}?
This can be tricky. There are many potential sources of variability in
the possible result of computing test error on a learned hypothesis $h$:
\begin{itemize}
\item Which particular {\em training examples} occurred in $\dataTrain$
\item Which particular {\em testing examples} occurred in $\dataTest$
\item Randomization inside the learning {\em algorithm} itself
\end{itemize}
\subsubsection{Validation}
Generally, to evaluate how well a learning {\em algorithm} works,
given an unlimited data source,
we would like to execute the following process multiple
times:
\begin{itemize}
\item Train on a new training set (subset of our big data source)
\item Evaluate resulting $h$ on a {\em validation set} that does not
overlap the training set (but is still a subset of our same big
data source)
\end{itemize}
% Average the performance of the resulting hypotheses over the different
% ``trials'' of training and validation.
Running the algorithm multiple times controls for possible poor choices of
training set or unfortunate randomization inside the algorithm itself.
\subsubsection{Cross validation}
\label{cross-validation}
One concern is that we might need a lot of data to do this, and in
many applications data is expensive or difficult to acquire. We can
re-use data with {\em{cross validation}}\index{cross validation} (but it's harder to do theoretical
analysis). \\
\begin{codebox}
\Procname{$\proc{Cross-Validate}(\data, k)$}
\li divide $\data$ into $k$ chunks $\data_1, \data_2, \ldots \data_k$ (of roughly equal size)
\li \For $i \gets 1$ \To $k$
\li \Do
train $h_i$ on $\data \setminus \data_i$ (withholding chunk $\data_i$ as the validation set)
\li compute ``test'' error $\mathcal{E}_i (h_i)$ on withheld data $\data_i$
\End
\li \Return $\frac{1}{k} \sum_{i=1}^k \mathcal{E}_i (h_i)$
\end{codebox}
It's very important to understand that (cross-)validation neither
delivers nor evaluates a single particular hypothesis $h$. It
evaluates the {\em learning algorithm} that produces hypotheses.
\subsubsection{Hyperparameter tuning}
The hyper-parameters of a learning algorithm affect how the algorithm
{\em works} but they are not part of the resulting hypothesis. So,
for example, $\lambda$ in ridge regression affects {\em which}
hypothesis will be returned, but $\lambda$ itself doesn't show up in
the hypothesis (the hypothesis is specified using parameters $\theta$ and
$\theta_0$).
You can think about each different setting of a hyper-parameter as
specifying a different learning algorithm.
In order to pick a good value of the hyper-parameter, we often end up
just trying a lot of values and seeing which one works best via
validation or cross-validation.\index{hyperparameter!hyperparameter tuning}
\question{How could you use cross-validation to decide whether to use
analytic ridge regression or our random-regression algorithm and to pick
$K$ for random regression or $\lambda$ for ridge regression?}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "top"
%%% End: