-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathmachine_learning.qmd
1652 lines (1207 loc) · 79.1 KB
/
machine_learning.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Core Concepts in Machine Learning {#sec-ml-core-concepts}
![](img/chapter_gp_plots/gp_plot_6.svg){.no-border width=75%}
**Machine learning** is used everywhere, and allows us to do things that would have been impossible just a couple decades ago. It is used in everything from self-driving cars, to medical diagnosis, to predicting the next word in your text message. The ubiquity of it is such that machine learning and related adventures like artificial intelligence are used as buzzwords, and it is not always clear what it meant by the one speaking them. In this chapter we hope you'll come away with a better understanding of what machine learning is, and how it can be used in your own work.
At its core, machine learning is a form of data analysis with a primary focus on predictive performance. Honestly, that's pretty much it from a practical standpoint. It is *not* a subset of particular types of models, it does not prohibit using statistical models, it doesn't mean that a program spontaneously learns without human involvement[^machinelearn], it doesn't require any 'machines' outside of a laptop, and it doesn't even mean that the model used is particularly complex. Machine learning is a set of tools and a modeling approach that attempts to improve and generalize model performance[^generalize].
This is a *different focus* than statistical modeling approaches that put much more emphasis on interpreting coefficients and uncertainty. But these two approaches can work together. Some implementations of machine learning include models that have their basis in traditional statistics, while others are often sufficiently complex that they are barely interpretable at all. However, even after you conduct your modeling via machine learning, you may still fall back on statistical analysis for further exploration of the results.
Here we will discuss some of the key ideas in machine learning, such as model assessment, loss functions, and cross-validation. Later we'll demonstrate common models used, but if you want to dive in, [you can head there now](ml_common_models.html)!
[^machinelearn]: The description of ML as machines learning 'without being programmed' can be misleading to the newcomer. In fact, many of the most common models used in machine learning are not capable of learning 'on their own' at any level, and require human intervention to provide processed data, specify the model, its parameters, set up the search through that parameter space, analyze the results, update the model, etc. We only very recently, post-2020, have developed models that appear to be able to generalize well to new tasks as if they have learned them without human involvement, but we still can't ignore all the hands-on work that went into the development of those models, which never could have such capabilities otherwise. When you see this 'learning without being programmed' it is an odd way to say that we don't have to guess the parameters ourselves (aside from the first guess). That said, it does feel like The Matrix, Star Trek and the rest is just around the corner though, doesn't it?
[^generalize]: Generalization in statistical analysis is more about generalizing from our sample of data to the population from which it's drawn. In order to do that well or precisely, one needs to meet certain assumptions about the model. In machine learning, generalization is more about how well the model will perform on new data, and is often referred to as 'out-of-sample' performance. These are not mutually exclusive, but the connotation is different.
::: {.callout-note title='ML by any other name...' collapse='true'}
AI, statistical learning, data mining, predictive analytics, data science, BI, there are a lot of names used alongside or even interchangeably with machine learning. It's mostly worth noting that using 'machine learning' without context makes it very difficult to know what tools have actually been employed, so you may have to do a bit of digging to find out the details.
:::
## Key Ideas
Here are the key ideas we'll cover in this chapter:
- Machine learning is an approach that prioritizes making accurate predictions using a variety of tools and methods.
- Models used in machine learning are typically more complex and difficult to interpret than those used in standard statistical models. However, *any model can be used with ML*.
- There are many performance metrics used in machine learning, and care should be taken to choose the appropriate one for your situation. You can also use multiple performance metrics to evaluate a model.
- Objective functions likewise should be chosen for the situation, and are often different from the performance metric.
- Regularization is a general approach to penalize complexity in a model, and is typically used to improve generalization.
- Cross-validation is a technique that helps us choose parameters for our models and compare different models.
### Why this matters
Machine learning applications help define the modern world and how we interact with it. There are few aspects of modern society that have not been touched by it in some way. With a basic understanding of the core ideas behind machine learning, you will better understand the models and techniques that are used in ML applications, and be able to apply them to your own work. You'll also be able to understand the limitations of these models, and not think of machine learning as 'magic'.
### Helpful context
To dive into applying machine learning models, you really only need a decent grasp of linear models as applied to regression and classification problems (@sec-foundation, @sec-glm). It would also be good to have an idea behind how they are estimated (@sec-estimation), as the same basic logic serves as a starting point here.
## Objective Functions {#sec-ml-objective}
We've implemented a variety of objective functions in other chapters, such as mean squared error for numeric targets and log loss for binary targets (@sec-estimation). The objective function is what we used to estimate model parameters, but it's not necessarily the same as the performance metric we ultimately use to select a model. For example, we may use log loss as the objective function, but then use accuracy as the performance metric. In that setting, the log loss provides a 'smooth' objective function to search the parameter space over, while accuracy is a straightforward and more interpretable metric for stakeholders. In this case, the objective function is used to optimize the model, while the performance metric is used to evaluate the model. In some cases, the objective function and performance metric are the same (e.g., (R)MSE), and even if not, they might have selected the same 'best' model, but this is not always the case. The following table shows some commonly used objective functions in machine learning for regression and classification tasks (@tbl-objective-functions).
<!-- TODO: check table for pdf and too much white space -->
{{< pagebreak >}}
\blandscape
\footnotesize
```{r}
#| echo: false
#| label: tbl-objective-functions
#| tbl-cap: Commonly used objective functions in machine learning for regression and classification tasks.
# NOTE: could not get any latex linebreak approaches to work so had to notably simplify the descriptions
objective_functions = tibble(
Task = c('Regression', 'Regression', 'Regression', 'Regression', 'Classification', 'Classification'),
`Objective Function` = c('Mean Squared Error (MSE)', 'Mean Absolute Error (MAE)', 'Huber Loss', 'Log Likelihood', 'Binary Cross-Entropy / Log Likelihood (Loss)', 'Categorical Cross-Entropy'),
Description = c('Average of the squared differences between the predicted and actual values.',
'Average of the absolute differences between the predicted and actual values.',
'A robust approach that is less sensitive to outliers than MSE.',
'Maximizes the likelihood of the data given the model parameters.',
'Used for binary classification problems. Same as log-likelihood .',
'Binary approach extended to multi-class classification problems.')
)
objective_functions |>
group_by(Task) |>
gt()
```
\normalsize
\elandscape
{{< pagebreak >}}
## Performance Metrics {#sec-ml-metrics}
When discussing how to understand our model (@sec-knowing-model-metrics), we noted there are many performance metrics used in machine learning. Care should be taken to choose the appropriate one for your situation. Usually we have a standard set we might use for the type of predictive problem. For example, for numeric targets, we typically are interested in (R)MSE and MAE. For classification problems, many metrics are based on the **confusion matrix**, which is a table of the predicted classes versus the observed classes. From that we can calculate things like accuracy, precision, recall, AUROC, etc. (see @tbl-performance-metrics).
As an example, and as a reason to get our first taste of machine learning, let's get some metrics for a movie review model. We'll do this for both numeric and classification targets to demonstrate the different types of metrics we can obtain. As we start our journey into machine learning, we'll show Python code first, as it's the dominant tool for ML. Here we'll model the target in both numeric and binary form with corresponding metrics.
:::{.panel-tabset}
##### Python
In Python, we can use the `sklearn.metrics` module to get a variety of metrics.
```{python}
#| label: py-metrics
#| output: false
#| results: 'hide'
from sklearn.metrics import (
mean_squared_error, root_mean_squared_error,
mean_absolute_error, r2_score,
accuracy_score, precision_score, recall_score,
roc_auc_score, roc_curve, auc, confusion_matrix
)
from sklearn.linear_model import LinearRegression, LogisticRegression
import pandas as pd
df_reviews = pd.read_csv('https://tinyurl.com/moviereviewsdata')
X = df_reviews[
[
'word_count',
'age',
'review_year',
'release_year',
'length_minutes',
'children_in_home',
'total_reviews',
]
]
y = df_reviews['rating']
y_class = df_reviews['rating_good']
model_lin_reg = LinearRegression().fit(X, y)
# note that sklearn uses regularization by default for logistic regression
model_log_reg = LogisticRegression().fit(X, y_class)
y_pred_linreg = model_lin_reg.predict(X)
y_pred_logreg = model_log_reg.predict(X)
# regression metrics
rmse = root_mean_squared_error(y, y_pred_linreg)
mae = mean_absolute_error(y, y_pred_linreg)
r2 = r2_score(y, y_pred_linreg)
# classification metrics
accuracy = accuracy_score(y_class, y_pred_logreg)
precision = precision_score(y_class, y_pred_logreg)
recall = recall_score(y_class, y_pred_logreg)
```
##### R
In R, we can use [mlr3measures]{.pack}, which has a variety of metrics.
```{r}
#| label: r-metrics
#| results: hide
library(mlr3measures)
# convert rating_good to factor for some metric inputs
df_reviews = read_csv('https://tinyurl.com/moviereviewsdata') |>
mutate(rating_good = factor(rating_good, labels = c('bad', 'good')))
model_lin_reg = lm(
rating ~
word_count
+ age
+ review_year
+ release_year
+ length_minutes
+ children_in_home
+ total_reviews,
data = df_reviews
)
model_log_reg = glm(
rating_good ~
word_count
+ age
+ review_year
+ release_year
+ length_minutes
+ children_in_home
+ total_reviews,
data = df_reviews,
family = binomial(link = 'logit')
)
y_pred_linreg = predict(model_lin_reg)
y_pred_logreg = predict(model_log_reg, type = 'response')
y_pred_logreg = factor(ifelse(y_pred_logreg > .5, 'good', 'bad'))
# regression metrics
rmse_val = rmse(df_reviews$rating, y_pred_linreg)
mae_val = mae(df_reviews$rating, y_pred_linreg)
r2_val = rsq(df_reviews$rating, y_pred_linreg)
# classification metrics
accuracy = acc(df_reviews$rating_good, y_pred_logreg)
precision = precision(df_reviews$rating_good, y_pred_logreg, positive = 'good')
recall = recall(df_reviews$rating_good, y_pred_logreg, positive = 'good')
```
:::
We put them all together in the following table. Now we know how to get them, and it was easy! But as we'll see later, there is a lot more to think about before we use these for model assessment.
```{r}
#| echo: false
#| label: tbl-r_metrics_show
#| tbl-cap: Example Metrics for Linear and Logistic Regression Models
tibble(
Model = rep(c('Linear Regression', 'Logistic Regression'), each = 3),
Metric = c('RMSE', 'MAE', 'R-squared', 'Accuracy', 'Precision', 'Recall'),
Value = c(rmse_val, mae_val, r2_val, accuracy, precision, recall)
) |>
group_by(Model) |>
gt()
```
## Generalization {#sec-ml-generalization}
So getting metrics is easy enough, but how will we use them? One of the key differences separating ML from traditional statistical modeling approaches is the assessment of performance on unseen or future data, a concept commonly referred to as **generalization**. The basic idea is that we want to build a model that will perform well on *new* data, and not just the data we used to train the model. This is because ultimately data is ever evolving, and when we are concerned with making predictions, we don't want to be beholden to a particular set of data that we just happened to have at a particular time and context.
But how do we do this? As a starting point, we can simply split (often called **partitioning**) our data into two sets, a **training set** and a **test set**, or, **holdout set**. The test set is typically a smaller subset, say 25% of the original data, but this amount is arbitrary, and will reflect the data situation. We **fit** or **train** the model on the training set, and then use the model to make predictions on, or **score**, the test set. This general approach is also known as the **holdout method**.
Consider a simple linear regression. We can fit the linear regression model on the training set, which provides us coefficients, etc. We can then use that model result to predict on the test set, and then compare the predictions to the observed target values in the test set. We can calculate metrics on both the training and test sets. Here we demonstrate this with our simple linear model.
:::{.panel-tabset}
##### Python
```{python}
#| label: py-holdout
#| results: hide
from sklearn.model_selection import train_test_split
X = df_reviews[[
'word_count',
'age',
'review_year',
'release_year',
'length_minutes',
'children_in_home',
'total_reviews',
]]
y = df_reviews['rating']
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.25,
random_state=123
)
model_linreg_train = LinearRegression().fit(X_train, y_train)
# get predictions
y_pred_train = model_linreg_train.predict(X_train)
y_pred_test = model_linreg_train.predict(X_test)
# get RMSE
rmse_train = root_mean_squared_error(y_train, y_pred_train)
rmse_test = root_mean_squared_error(y_test, y_pred_test)
pd.DataFrame(
dict(
prediction = ['Train', 'Test'],
rmse = [rmse_train, rmse_test]
)
).round(3)
```
##### R
```{r}
#| label: r-holdout
#| results: hide
# create a train and test set
library(rsample)
set.seed(123)
split = initial_split(df_reviews, prop = .75)
X_train = training(split)
X_test = testing(split)
model_linreg_train = lm(
rating ~
word_count
+ age
+ review_year
+ release_year
+ length_minutes
+ children_in_home
+ total_reviews,
data = X_train
)
# get predictions
y_train_pred = predict(model_linreg_train, newdata = X_train)
y_test_pred = predict(model_linreg_train, newdata = X_test)
# get RMSE
rmse_train = rmse(X_train$rating, y_train_pred)
rmse_test = rmse(X_test$rating, y_test_pred)
tibble(
prediction = c('Train', 'Test'),
rmse = c(rmse_train, rmse_test)
)
```
:::
```{r}
#| echo: false
#| label: tbl-r-holdout-show
#| tbl-cap: RMSE for Linear Regression Model on Train and Test Sets
tibble(
prediction = c('Train', 'Test'),
rmse = c(py$rmse_train, py$rmse_test)
) |>
gt(decimals = 3)
```
So there you have it, you just did some machine learning! And now we have a model that we can use to predict with any new data that comes along with ease. But as we'll soon see, there are limitations to doing things this simply. But conceptually this is an important idea, and one we will continue to return to.
:::{.callout-note title='Split Results' collapse='true'}
A reminder: the results of the split can vary depending on the random seed used, and whether you are using R or Python. So your results may look slightly different from the presented tables.
:::
### Using metrics for model evaluation and selection {#sec-ml-metrics-eval}
As we've seen elsewhere, there are many performance metrics to choose from to assess model performance, and the choice of metric depends on the type of problem (@sec-knowing-model-metrics). It also turns out that assessing the metric on the data we used to train the model does not give us the best assessment of that metric. This is because the model will do better on the data it was trained on than on data it wasn't trained on, and we can generally always improve that metric in training by making the model more complex. However, in many modeling situations, this complexity comes at the expense of generalization. So what we really want to ultimately say about our model will regard performance on the test set with our chosen metric, and not the data we used to train the model. At that point, we can also compare multiple models to one another given their performance on the test set, and select the one that performs best.
In the previous section you can compare our results on the tests vs. training set. Metrics are notably better on the training set on average, and that's what we see here. But since we should be more interested in how well the model will do on new data, we should focus on the test set result.
### Understanding test error and generalization {#sec-ml-generalization-understanding}
> This part gets into the weeds a bit. If you are not so inclined, skip to the summary of this section.
In the following discussion, you can think of a standard linear model scenario, for example, with squared-error loss function, and a data set where we split some of the observations in a random fashion into a training set, for initial model fitting, and a test set, which will be kept separate and independent, and used to measure generalization performance. We note **training error** as the average loss over all the training sets we could create in this process of random splitting. The **test error** is the average prediction error obtained when a model fitted on the training data is used to make predictions on the test data.
<!-- So, in addition to the previously noted goal of finding the 'best' model, **model selection**, we are interested further in estimating the prediction error with new data, **model performance**. -->
#### Generalization in the classical regime
So what result should we expect in this scenario? Let's look at the following visualization, inspired by @hastie_elements_2017.
```{r}
#| echo: false
#| eval: false
#| label: bias-var-tradeoff
fit_ridgeless = function(X, y, x_test, y_test) {
b = psych::Pinv(crossprod(X)) %*% crossprod(X, y)
predictions_train = X %*% b
predictions_test = x_test %*% b
rmse_train = yardstick::rmse_vec(y, predictions_train[, 1])
rmse_test = yardstick::rmse_vec(y_test, predictions_test[, 1])
list(b = b, rmse_train = rmse_train, rmse_test = rmse_test)
}
fit_reshuffle = function(
n,
scale = TRUE,
random = FALSE,
bs = 'cs',
k = 10
) {
df_cars_shuffled = mtcars[sample(1:nrow(mtcars)), c(1, sample(2:ncol(mtcars)))] # shuffle rows/columns (keep mpg as first)
if (scale) df_cars_shuffled = data.frame(scale(df_cars_shuffled))
sc_wt = mgcv::smoothCon(mgcv::s(wt, bs = bs, k = k), data = df_cars_shuffled)[[1]]$X
sc_hp = mgcv::smoothCon(mgcv::s(hp, bs = bs, k = k), data = df_cars_shuffled)[[1]]$X
sc_disp = mgcv::smoothCon(mgcv::s(disp, bs = bs, k = k), data = df_cars_shuffled)[[1]]$X
if (random) {
ran_dim = 10
sc_wt = matrix(rnorm(nrow(df_cars_shuffled) * ran_dim), ncol = ran_dim)
sc_hp = matrix(rnorm(nrow(df_cars_shuffled) * ran_dim), ncol = ran_dim)
sc_disp = matrix(rnorm(nrow(df_cars_shuffled) * ran_dim), ncol = ran_dim)
}
X = as.matrix(cbind(1, df_cars_shuffled[, -1], sc_wt, sc_hp, sc_disp))
y = df_cars_shuffled$mpg
train_idx = sample(1:nrow(df_cars_shuffled), n)
X_train = X[train_idx, , drop = FALSE]
X_test = X[-train_idx, , drop = FALSE]
y_train = y[train_idx]
y_test = y[-train_idx]
ncol_X_main = ncol(mtcars) # max model.matrix dim for original data
ncol_cubic_wt = ncol(sc_wt)
ncol_cubic_hp = ncol(sc_hp)
ncol_cubic_disp = ncol(sc_disp)
if (random) {
total_explore = 1:ncol(X)
} else {
total_explore = c(
1:ncol_X_main,
ncol_X_main + ncol_cubic_wt,
ncol_X_main + ncol_cubic_wt + ncol_cubic_hp,
ncol(X) # max dim
)
}
# for each dimension, run fit_ridgeless
rmse = map_df(
total_explore,
~ fit_ridgeless(X_train[, 1:.x], y_train, X_test[, 1:.x], y_test)[c('rmse_train', 'rmse_test')] |> as_tibble()
)
tibble(p = total_explore, rmse)
}
set.seed(42)
n = 20
res = map_df(1:250, ~ fit_reshuffle(
n = n,
random = FALSE,
bs = 'cs',
k = 15
), .id = 'sim')
test_error_summary = res |>
group_by(p) |>
summarize(
rmse_train = mean(rmse_train, na.rm = TRUE),
rmse_test = mean(rmse_test, na.rm = TRUE),
) |>
pivot_longer(-p, names_to = 'type', values_to = 'rmse') |>
mutate(
type = str_to_title(str_remove(type, 'rmse_'))
)
# test_error_summary
max_p = 20
p_dd_main = res |>
pivot_longer(-c(p, sim), names_to = 'type', values_to = 'rmse') |>
mutate(
type = str_to_title(str_remove(type, 'rmse_'))
) |>
filter(p <= max_p) |>
ggplot(aes(as.integer(p), rmse)) +
# geom_vline(xintercept = n, alpha = .5) +
annotate(
geom = 'text',
x = 2,
y = 0.9,
label = 'High Bias\nLow Variance',
color = 'black',
size = 2.5,
hjust = 0
) +
annotate(
geom = 'text',
x = 9,
y = 0.9,
label = 'Low Bias\nHigh Variance',
color = 'black',
size = 2.5,
hjust = 0
) +
geom_line(aes(color = type, group = interaction(type, sim)), alpha = .33, linewidth = .05) +
geom_line(aes(color = type, linetype = type),
linewidth = 1.5,
data = test_error_summary |> filter(p <= max_p)
) +
scale_color_manual(values = unname(okabe_ito[c(1,5)])) +
# geom_point(aes(color = type),
# alpha = .75,
# size = 3,
# data = test_error_summary
# ) +
# geom_point(
# # aes(color = type),
# color = 'gray50',
# alpha = 1,
# size = 9,
# data = test_error_summary |>
# filter(p < n, type == 'test') |>
# filter(rmse == min(rmse)),
# show.legend = FALSE
# ) +
# geom_point(
# color = okabe_ito[[1]],
# alpha = 1,
# size = 10,
# data = test_error_summary |> filter(type == 'test') |> filter(rmse == min(rmse)),
# show.legend = FALSE
# ) +
scale_x_continuous(breaks = 1:max_p, limits = c(1, 11)) +
coord_cartesian(
# xlim = c(0, n),
# ylim = c(0, max(test_error_summary$rmse) + 1)
ylim = c(0, 1)
) +
labs(
x = 'Model Complexity',
y = 'Prediction\nError',
caption = 'Training error continues to decrease with model complexity, but test error increases after a certain point.',
subtitle = 'Bias Variance Tradeoff'
) +
theme(
# axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
legend.key.size = unit(1.5, 'cm'),
legend.text = element_text(size = 12),
)
p_dd_main
ggsave('img/biasvar_mtcars.svg')
```
![Bias Variance Tradeoff](img/biasvar_mtcars.svg){#fig-bias-variance width=100%}
Prediction error on the test set is a function of several components, and two of these are **bias** and **variance**.
A key idea is that as the model complexity increases, we potentially capture more of the data variability. This reduces **bias**, which is the difference in our average prediction and the *true* model prediction. But this only works for training error, where eventually our model can potentially fit the training data perfectly!
For test error though, as the model complexity increases, the bias decreases, but the **variance** eventually increases. This variance reflects how much our prediction changes with different data. If our model gets too cozy with the training data, it will do poorly when we try to generalize beyond it, and this will be reflected in increased variance. This is traditionally known as the **bias-variance tradeoff** - we can reduce one source of error in the test set at the expense of the other, but not both at the same time indefinitely. In other words, we can reduce bias by increasing model complexity, but this will eventually increase variance in our test predictions. We can reduce variance by reducing model complexity, but this will increase bias. One additional thing to note is that even if we had the 'true' model given the features specified correctly, for the vast majority of cases there would still be prediction error due to the random data generating process (**noise**). This can potentially be reduced using additional valid features, getting better measurements, etc., but it will still be there to some extent in practice, and so will limit test set performance.
The ultimate goal is to find the sweet spot. We want a model that's complex enough to capture the data, but not so complex that it overfits to the training data.
#### Generalization in deep learning {.unnumbered}
It turns out that with lots of data and very complex models, or maybe even in most settings, the 'classical' understanding just described doesn't hold up. In fact, it is possible to get a model that fits the training data perfectly, and yet ultimately still generalizes well to new data!
This phenomenon is encapsulated in the notion of **double descent**. The idea is that, with overly complex models such as those employed with deep learning, we can get to the point of interpolating the data exactly. But as we continue to increase the complexity of the model, we actually start to generalize better again, as the model continues to explore potential options for fitting the data. This is a fascinating and somewhat counterintuitive result, and visually this displays as a 'double descent' in terms of test error. We see an initial decrease in test error as the model gets better in general. After a while, it begins to rise as we would expect in the classical regime (@fig-bias-variance). Eventually it peaks at the point where we have as many parameters as data points. Beyond that however, as we get even more complex with our model, we can possibly see a decrease in test error again[^grok]. Crazy!
[^grok]: A similar phenomenon is found in the idea of **grokking** in deep learning. In this case, even after seemingly doing as well as the model can on training and validation, the model 'spontaneously' starts to improve on validation with continued iterations. See @power_grokking_2022 for more on this.
We can demonstrate this on the classic `mtcars` dataset[^mtcars], which has only 32 observations! We repeatedly trained a model to predict miles per gallon on only 10 of those observations, and assess test error on the rest. The model we used is a form of ridge regression[^ridgeless], but we implemented splines for the car's weight, horsepower, and displacement, i.e., we GAMed it up (@sec-gam). We trained increasingly complex models, and in what follows we visualize the error as a function of model complexity as we did previously.
<!-- could add demo code to appendix? or note the witten tweets -->
[^mtcars]: If not familiar, the `mtcars` object is a data frame that comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).
[^ridgeless]: It's actually called [*ridgeless* regression](https://www.stat.berkeley.edu/~ryantibs/statlearn-s23/lectures/ridgeless.pdf).
```{r}
#| echo: false
#| eval: false
#| label: fig-double-descent
#| fig-cap: Double Descent on the classic mtcars dataset
set.seed(42)
fit_ridgeless = function(X, y, x_test, y_test) {
b = psych::Pinv(crossprod(X)) %*% crossprod(X, y)
predictions_train = X %*% b
predictions_test = x_test %*% b
rmse_train = yardstick::rmse_vec(y, predictions_train[, 1])
rmse_test = yardstick::rmse_vec(y_test, predictions_test[, 1])
list(b = b, rmse_train = rmse_train, rmse_test = rmse_test)
}
fit_reshuffle = function(
n,
scale = TRUE,
random = FALSE,
bs = 'cs',
k = 10
) {
df_cars_shuffled = mtcars[sample(1:nrow(mtcars)), c(1, sample(2:ncol(mtcars)))] # shuffle rows/columns (keep mpg as first)
if (scale) df_cars_shuffled = data.frame(scale(df_cars_shuffled))
sc_wt = mgcv::smoothCon(mgcv::s(wt, bs = bs, k = k), data = df_cars_shuffled)[[1]]$X
sc_hp = mgcv::smoothCon(mgcv::s(hp, bs = bs, k = k), data = df_cars_shuffled)[[1]]$X
sc_disp = mgcv::smoothCon(mgcv::s(disp, bs = bs, k = k), data = df_cars_shuffled)[[1]]$X
if (random) {
ran_dim = 10
sc_wt = matrix(rnorm(nrow(df_cars_shuffled) * ran_dim), ncol = ran_dim)
sc_hp = matrix(rnorm(nrow(df_cars_shuffled) * ran_dim), ncol = ran_dim)
sc_disp = matrix(rnorm(nrow(df_cars_shuffled) * ran_dim), ncol = ran_dim)
}
X = as.matrix(cbind(1, df_cars_shuffled[, -1], sc_wt, sc_hp, sc_disp))
y = df_cars_shuffled$mpg
train_idx = sample(1:nrow(df_cars_shuffled), n)
X_train = X[train_idx, , drop = FALSE]
X_test = X[-train_idx, , drop = FALSE]
y_train = y[train_idx]
y_test = y[-train_idx]
ncol_X_main = ncol(mtcars) # max model.matrix dim for original data
ncol_cubic_wt = ncol(sc_wt)
ncol_cubic_hp = ncol(sc_hp)
ncol_cubic_disp = ncol(sc_disp)
if (random) {
total_explore = 1:ncol(X)
} else {
total_explore = c(
1:ncol_X_main,
ncol_X_main + ncol_cubic_wt,
ncol_X_main + ncol_cubic_wt + ncol_cubic_hp,
ncol(X) # max dim
)
}
# for each dimension, run fit_ridgeless
rmse = map_df(
total_explore,
~ fit_ridgeless(X_train[, 1:.x], y_train, X_test[, 1:.x], y_test)[c('rmse_train', 'rmse_test')] |> as_tibble()
)
tibble(p = total_explore, rmse)
}
n = 10
res = map_df(1:250, ~ fit_reshuffle(
n = n,
random = FALSE,
bs = 'cs',
k = 10
), .id = 'sim')
test_error_summary = res |>
group_by(p) |>
summarize(
rmse_train = mean(rmse_train, na.rm = TRUE),
rmse_test = mean(rmse_test, na.rm = TRUE),
) |>
pivot_longer(-p, names_to = 'type', values_to = 'rmse') |>
mutate(type = str_to_title(str_remove(type, 'rmse_')))
# test_error_summary
p_dd_main = res |>
pivot_longer(-c(p, sim), names_to = 'type', values_to = 'rmse') |>
mutate(type = str_to_title(str_remove(type, 'rmse_'))) |>
ggplot(aes(as.integer(p), rmse)) +
geom_vline(xintercept = n, alpha = .5) +
geom_line(aes(color = type, group = interaction(type, sim)), alpha = .25, linewidth = .05) +
geom_line(aes(color = type, linetype = type),
linewidth = .5,
data = test_error_summary
) +
scale_color_manual(values = unname(okabe_ito[c(1,5)])) +
geom_point(aes(color = type),
alpha = .75,
size = 3,
data = test_error_summary,
show.legend = FALSE
) +
geom_point(
# aes(color = type),
color = 'gray50',
alpha = 1,
size = 9,
data = test_error_summary |>
filter(p < n, type == 'Test') |>
filter(rmse == min(rmse)),
show.legend = FALSE
) +
geom_point(
color = okabe_ito[[1]],
alpha = 1,
size = 10,
data = test_error_summary |> filter(type == 'Test') |> filter(rmse == min(rmse)),
show.legend = FALSE
) +
scale_x_continuous(breaks = c(1, n, seq(20, 40, 10)), limits = c(1, max(res$p))) +
coord_cartesian(
# xlim = c(0, n),
# ylim = c(0, max(test_error_summary$rmse) + 1)
ylim = c(0, max(test_error_summary$rmse) + 1)
) +
labs(
x = 'Model Complexity',
y = 'Prediction\nError',
subtitle = 'Double Descent on the classic mtcars dataset. This is the same setup used in the bias-variance tradeoff plot.',
)
p_dd_main
ggsave('img/ml-core-double_descent.svg', p_dd_main, width = 8, height = 6)
```
![Double Descent on the classic mtcars dataset](img/ml-core-double_descent.svg){#fig-double-descent}
On the left part of the visualization, we see that the test error dips as we get a better model. Our best test error is noted by the large dot on the left. Eventually though, the test error rises as expected, even as training error gets better. Test error eventually hits a peak when the number of parameters equals the number of training observations. But then we keep going, and the test error starts to decrease again! By the end we have essentially perfect training prediction, and our test error is as good as it was with the simpler models (large dot on the right). This is the double descent phenomenon with one of the simplest datasets around. Cool!
#### Generalization summary
The take home point is this: our primary concern is generalization error. We can reduce this error by increasing model complexity, but this may eventually cause test error to increase. However, with enough data and model complexity, we can get to the point where we can fit the training data perfectly, and yet still generalize well to new data. In many standard, or at least smaller, data and model settings, you can maybe assume the classical regime holds. But when employing deep learning with massive data and billions of parameters, you can worry less about the model's complexity. But no matter what, we should use tools to help make our model work better, and we prefer smaller and simpler models that can do as well as more complex ones, even if those 'smaller' models are still billions of parameters!
## Regularization {#sec-ml-regularization}
We now are very aware that a key aspect of the machine learning approach is having our model to work well with new data. One way to improve generalization is through the use of **regularization**, which is a general approach to penalize complexity in a model, and is typically used to prevent **overfitting**. Overfitting occurs when a model fits the training data very well, but does not generalize well to new data. This usually happens when the model is too complex and starts fitting to random noise in the training data. We can also have the opposite problem, where the model is too simple to capture the patterns in the data, and this is known as **underfitting**[^underfit].
[^underfit]: Underfitting is a notable problem in many academic disciplines, where the models are often too simple to capture the complexity of the underlying process. Typically, the model employed assumes linear relationships without any interactions, and the true data generating process may be anything but. These models are chosen for their simplicity and interpretability, rather than how well they can explain the phenomenon in question. However, one could make the argument that 'understanding' an unrealistic result is not very useful either, and that the goal should be to understand the true process however we can, and not just choose a model that's convenient.
In the following demonstration, the first plot shows results from a model that is probably too complex for the data setting. The curve is very wiggly as it tries as much of the data as possible, and is an example of overfitting. The second plot shows a straight line fit as we'd get from linear regression. It's too simple for the underlying feature-target relationship, and is an example of underfitting. The third plot shows a model that is a better fit to the data, and is an example of a model that is complex enough to capture the nonlinear aspect of the data, but not so complex that it capitalizes on a lot of noise.
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-over-under
#| fig-cap: Overfitting and Underfitting
#| out-width: 100%
# create sample data
set.seed(123)
# x = seq(0, 1, length.out = 10)
# y = sin(x * pi) + rnorm(10, sd = 0.1)
# df = data.frame(x = x, y = y)
N = 200
df = mgcv::gamSim(n= N, scale = 1, verbose = FALSE) |>
rename(x = x2)
train_idx = sample(1:N, N/2)
X_train = df[train_idx, , drop = FALSE]
X_test = df[-train_idx, , drop = FALSE]
mod_over = lm(y ~ poly(x, 25), data = X_train)
mod_under = lm(y ~ x, data = X_train)
mod_goldie = mgcv::gam(y ~ s(x), data = X_train)
p_over = ggplot(X_train, aes(x, y)) +
# geom_line(aes(y = y + rnorm(n = N, sd = 1)), linewidth = 1) +
# geom_smooth(se=FALSE, method ='lm', formula = 'y ~ poly(x, 25)', linewidth = 1) +
stat_smooth(geom='line', se=FALSE, method ='lm', formula = 'y ~ poly(x, 25)', linewidth = 1) +
geom_point() +
labs(subtitle = 'Overfit') +
theme_void()
p_under = ggplot(X_train, aes(x, y)) +
stat_smooth(geom = 'line', method = 'lm', se = FALSE, linewidth = 1) +
geom_point() +
labs(subtitle = 'Underfit') +
theme_void()
p_goldie = ggplot(X_train, aes(x, y)) +
stat_smooth(geom = 'line', method = 'gam', se = FALSE, linewidth = 1) +
geom_point() +
labs(subtitle = 'Better') +
theme_void()
(p_over + p_under + p_goldie) &
theme(
panel.border = element_rect(color = 'gray50', fill = NA, size = .05),
plot.subtitle = element_text(size = 10, hjust = 0.5)
)
ggsave('img/ml-core-over-under-fit.svg', width = 8, height = 6)
```
:::
:::{.content-visible when-format='pdf'}
![Overfitting and Underfitting](img/ml-core-over-under-fit.svg){#fig-over-under}
:::
As a demonstration, let's examine generalization performance in this type of setting[^gamdat] with the following table that represents the test set RMSE. We see that the overfit model does best on training data, but relatively very poorly on test- nearly a 20% increase in the RMSE value. The underfit model doesn't change as much in test performance because it was poor to begin with, and is the worst performer for both. Our 'better' model wasn't best on training, but was best on the test set.
[^gamdat]: The data is based on a simulation (using `mgcv::gamSim`), with training sample of 200 and scale of 1, so the test data is just more simulated data points.
```{r}
#| echo: false
#| label: tbl-over-under
#| tbl-cap: RMSE for each model on test data
X_test_perf = X_train |>
mutate(Data = 'Train') |>
bind_rows(X_test |> mutate(Data = 'Test')) |>
select(Data, y) |>
mutate(
Over = ifelse(Data == 'Train', predict(mod_over, X_train), predict(mod_over, X_test)),
Under = ifelse(Data == 'Train', predict(mod_under, X_train), predict(mod_under, X_test)),
Better = ifelse(Data == 'Train', predict(mod_goldie, X_train), predict(mod_goldie, X_test)),
)
init = X_test_perf |>
pivot_longer(
cols = Over:Better,
names_to = c('Model'),
values_to = 'pred'
) |>
mutate(Data = factor(Data, levels = c('Train', 'Test'))) |>
group_by(Data, Model) |>
summarize(
RMSE = yardstick::rmse_vec(y, pred)
) |>
pivot_wider(names_from = Model, values_from = RMSE) |>
ungroup()
init |>
bind_rows(
tibble(
Data = '% inc. RMSE',
Better = (init$Better[2] / init$Better[1] - 1) * 100,
Over = (init$Over[2] / init$Over[1] - 1) * 100,
Under = (init$Under[2] / init$Under[1] - 1) * 100
)
) |>
select(Data, Over, Under, Better) |>
gt() |>
tab_style(
style = list(
# cell_fill(color = '#F9E3D6'), # will break latex/pdf
cell_text(style = 'italic', weight = 600)
),
locations = cells_body(
columns = Better,
)
) |>
tab_options(
footnotes.font.size = 10
)
```
A fairly simple example of regularization can be seen with a ridge regression model (@sec-estim-penalty), where we add a penalty term to the objective function. The penalty is a function of the size of the coefficients, and helps keep the model from getting too complex. It is also known as **L2 regularization** due to squaring the coefficients (formally, the $L_2$ norm). Another type is the **L1 penalty**, used in the 'lasso' model, which is based on the absolute values of the coefficients ($L_1$ norm). Yet another common approach combines the two, called **elastic net**. There, we use both L1 and L2 penalties with different weights, and use cross-validation to find the best balance. L1 and/or L2 penalties are applied in many other models such as gradient boosting, neural networks, and others, and are a key aspect of machine learning.
Regularization is used in many modeling scenarios. Here is a quick rundown of some examples.
- GAMs use penalized regression for estimation of the coefficients for the basis functions (typically with L2). This keeps the 'wiggly' part of the GAM from getting too wiggly, as in the overfit model in @fig-over-under. This shrinks the feature-target relationship toward a linear one.
- Similarly, the variance estimate of a random effect in mixed models, e.g., for the intercept or slope, is inversely related to an L2 penalty on the effect estimates for that group effect. The more penalization applied, the less random effect variance, and the more the random effect is shrunk toward the overall mean[^mixedpenalty].
[^mixedpenalty]: One more reason to prefer a random effects approach over so-called fixed effects models, as the latter are not penalized at all, and thus are more prone to overfitting.
- Still another form of regularization occurs in the form of priors in Bayesian models. There we use priors to control the influence of the data on the final model. A small variance on the prior shrinks the model towards the prior mean. If large, there is little influence of the prior on the posterior. In regression models, there is correspondence between ridge regression and using a normal distribution prior for the coefficients in Bayesian regression, where the L2 penalty is related to the variance of that prior. Even in deep learning, there is usually a 'Bayesian interpretation' of the regularization approaches employed.
- As a final example of regularization, **dropout** is a technique used in deep learning to prevent overfitting. Feel free to return to this discussion after seeing neural networks in action in the next chapter(@sec-ml-common-dl-nn), as that will provide the appropriate context. But the gist is that dropout works by randomly removing some of the nodes in intervening/hidden layers in the network during training. This tends to force the network to learn more robust features, allowing for better generalization[^batchnorm].
[^batchnorm]: **Batch normalization**, another common layer application in deep learning, also has a regularizing effect by adding a bit of noise to the network. This can help with generalization, and also helps with training stability.
```{r}
#| echo: false
#| eval: false
#| label: tree-graph-setup
g = DiagrammeR::grViz('img/dropout.dot')
g |>
DiagrammeRsvg::export_svg() |>
charToRaw() |>
rsvg::rsvg_svg('img/dropout.svg')
```
![A neural net with dropout](img/dropout.svg){#fig-dropout}
In the end, regularization comes in many forms across the modeling landscape, and is a key aspect of machine learning and traditional statistical modeling alike. The primary goal is to decrease model complexity in the hopes of increasing our ability to generalize the selected model to new data scenarios.
:::{.callout-note title='Regularization with Large Data' collapse='true'}
For the linear model and related models for typical tabular data, a very large dataset can often lessen the need for regularization. This is because the model can learn the patterns in the data without overfitting, and the penalty ultimately is overwhelmed by the other parts of the objective function. However, regularization is still useful in many cases, and can help with model stability and speed of convergence.
:::
## Cross-validation {#sec-ml-cv}
We've talked a lot about generalization, so now let's think about some ways to go about a general process of selecting parameters for a model and assessing performance in a way that helps us generalize to new data better.
We previously used a simple approach where we split the data into training and test sets. We then fit the model on the training set, and subsequently assessed performance on the test set. This is fine, but the test set error, or any other metric, has uncertainty. It would be slightly different with any training-test split we came up with.
We'd also like to get better model assessment when searching the hyperparameter space, because we have no way of guessing the value beforehand, and we'll need to try out different ones. An example would be the penalty parameter in lasso regression. In this case, we need to figure out the best parameters *before* assessing a final model's performance.
One way to do this is to split the training data into different partitions, which we now call **validation sets**. We fit the model on the training set, and then assess performance on the validation set(s). We then repeat this process for many different splits of the data into training and validation sets, and average the results. This is known as **K-fold cross-validation**. It's important to note that we still want a test set to be held out that is in no way used during the training process. The validation sets are used to help us choose the best model based on some metric, and the test set is used to assess the chosen model's performance.
Here is a visualization of 3-fold cross validation. We split the data such that 2/3 of it will be used for training, and 1/3 for validation. We then do this for a total of 3 times, so that the validation set represents a different part of the data each time. Ultimately, all observations are used for both training and validation by the end of the process. We then average the results of any metric across the validation sets. Note that in each case here, there is no overlap of data between the training and validation sets.
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| eval: false
#| label: viz-create-kfold-image
#| out-width: 100%
# create example data
set.seed(123)
data = data.frame(x = as.numeric(1:30))
# perform 3-fold cross validation
p_dat = data |>
mutate(
fold1 = cut(seq(1,nrow(data)),breaks=3,labels=FALSE),
`Fold 2` = case_when(
fold1 == 1 ~ 3,
fold1 == 2 ~ 1,
fold1 == 3 ~ 2
),
`Fold 3` = case_when(
fold1 == 1 ~ 3,
fold1 == 2 ~ 2,
fold1 == 3 ~ 1
)
) |>
rename(`Fold 1` = fold1) |>
pivot_longer(cols = `Fold 1`:`Fold 3`, names_to = 'fold', values_to = 'value') |>
mutate(fold = as.factor(fold), value = ifelse(value == 1, 'Test', 'Train'))
p_dat |>
ggplot(aes(color= value)) +
geom_segment(
aes(x = 1, y = fold, xend = 30, yend = fold, group = 1),
color = okabe_ito[['orange']],
linewidth = 20
) +
geom_segment(
aes(x = 1, xend= 10, y = fold, yend = fold),
linewidth = 20,
color = okabe_ito[['darkblue']],
data = p_dat |> filter(fold == 'Fold 1')
) +
annotate(
geom = 'text',
x = 20,
y = 'Fold 1',
size = 8,
label = 'Training',
color = 'white',
# data = p_dat |> filter(fold == 'Fold 1')
) +
annotate(
geom = 'text',
x = 5.5,
y = 'Fold 1',
size = 8,
label = 'Validation',
color = 'white',
# data = p_dat |> filter(fold == 'Fold 1')
) +
geom_segment(
aes(x = 10, xend= 20, y = fold, yend = fold),
linewidth = 20,
color = okabe_ito[['darkblue']],
data = p_dat |> filter(fold == 'Fold 2')
) +
annotate(
geom = 'text',
x = c(6, 25),
y = 'Fold 2',
size = 8,
label = 'Training',