-
Notifications
You must be signed in to change notification settings - Fork 16
/
Copy pathModel-Selection-in-R.html
560 lines (518 loc) · 42.8 KB
/
Model-Selection-in-R.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
<!DOCTYPE html>
<html>
<head>
<title>Model Selection</title>
<meta charset="utf-8">
<meta name="Description" content="R Language Tutorials for Advanced Statistics">
<meta name="Keywords" content="R, Tutorial, Machine learning, Statistics, Data Mining, Analytics, Data science, Linear Regression, Logistic Regression, Time series, Forecasting">
<meta name="Distribution" content="Global">
<meta name="Author" content="Selva Prabhakaran">
<meta name="Robots" content="index, follow">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="shortcut icon" href="/screenshots/iconb-64.png" type="image/x-icon" />
<link href="www/bootstrap.min.css" rel="stylesheet">
<link href="www/highlight.css" rel="stylesheet">
<link href='http://fonts.googleapis.com/css?family=Inconsolata:400,700'
rel='stylesheet' type='text/css'>
<!-- Color Script -->
<style type="text/css">
a {
color: #3675C5;
color: rgb(25, 145, 248);
color: #4582ec;
color: #3F73D8;
}
li {
line-height: 1.65;
}
/* reduce spacing around math formula*/
.MathJax_Display {
margin: 0em 0em;
}
</style>
<!-- Add Google search -->
<script language="Javascript" type="text/javascript">
function my_search_google()
{
var query = document.getElementById("my-google-search").value;
window.open("http://google.com/search?q=" + query
+ "%20site:" + "http://r-statistics.co");
}
</script>
</head>
<body>
<div class="container">
<div class="masthead">
<!--
<ul class="nav nav-pills pull-right">
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">
Table of contents<b class="caret"></b>
</a>
<ul class="dropdown-menu pull-right" role="menu">
<li class="dropdown-header"></li>
<li class="dropdown-header">Tutorial</li>
<li><a href="R-Tutorial.html">R Tutorial</a></li>
<li class="dropdown-header">ggplot2</li>
<li><a href="ggplot2-Tutorial-With-R.html">ggplot2 Short Tutorial</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part1-With-R-Code.html">ggplot2 Tutorial 1 - Intro</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html">ggplot2 Tutorial 2 - Theme</a></li>
<li><a href="Top50-Ggplot2-Visualizations-MasterList-R-Code.html">ggplot2 Tutorial 3 - Masterlist</a></li>
<li><a href="ggplot2-cheatsheet.html">ggplot2 Quickref</a></li>
<li class="dropdown-header">Foundations</li>
<li><a href="Linear-Regression.html">Linear Regression</a></li>
<li><a href="Statistical-Tests-in-R.html">Statistical Tests</a></li>
<li><a href="Missing-Value-Treatment-With-R.html">Missing Value Treatment</a></li>
<li><a href="Outlier-Treatment-With-R.html">Outlier Analysis</a></li>
<li><a href="Variable-Selection-and-Importance-With-R.html">Feature Selection</a></li>
<li><a href="Model-Selection-in-R.html">Model Selection</a></li>
<li><a href="Logistic-Regression-With-R.html">Logistic Regression</a></li>
<li><a href="Environments.html">Advanced Linear Regression</a></li>
<li class="dropdown-header">Advanced Regression Models</li>
<li><a href="adv-regression-models.html">Advanced Regression Models</a></li>
<li class="dropdown-header">Time Series</li>
<li><a href="Time-Series-Analysis-With-R.html">Time Series Analysis</a></li>
<li><a href="Time-Series-Forecasting-With-R.html">Time Series Forecasting </a></li>
<li><a href="Time-Series-Forecasting-With-R-part2.html">More Time Series Forecasting</a></li>
<li class="dropdown-header">High Performance Computing</li>
<li><a href="Parallel-Computing-With-R.html">Parallel computing</a></li>
<li><a href="Strategies-To-Improve-And-Speedup-R-Code.html">Strategies to Speedup R code</a></li>
<li class="dropdown-header">Useful Techniques</li>
<li><a href="Association-Mining-With-R.html">Association Mining</a></li>
<li><a href="Multi-Dimensional-Scaling-With-R.html">Multi Dimensional Scaling</a></li>
<li><a href="Profiling.html">Optimization</a></li>
<li><a href="Information-Value-With-R.html">InformationValue package</a></li>
</ul>
</li>
</ul>
-->
<ul class="nav nav-pills pull-right">
<div class="input-group">
<form onsubmit="my_search_google()">
<input type="text" class="form-control" id="my-google-search" placeholder="Search..">
<form>
</div><!-- /input-group -->
</ul><!-- /.col-lg-6 -->
<h3 class="muted"><a href="/">r-statistics.co</a><small> by Selva Prabhakaran</small></h3>
<hr>
</div>
<div class="row">
<div class="col-xs-12 col-sm-3" id="nav">
<div class="well">
<li>
<ul class="list-unstyled">
<li class="dropdown-header"></li>
<li class="dropdown-header">Tutorial</li>
<li><a href="R-Tutorial.html">R Tutorial</a></li>
<li class="dropdown-header">ggplot2</li>
<li><a href="ggplot2-Tutorial-With-R.html">ggplot2 Short Tutorial</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part1-With-R-Code.html">ggplot2 Tutorial 1 - Intro</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html">ggplot2 Tutorial 2 - Theme</a></li>
<li><a href="Top50-Ggplot2-Visualizations-MasterList-R-Code.html">ggplot2 Tutorial 3 - Masterlist</a></li>
<li><a href="ggplot2-cheatsheet.html">ggplot2 Quickref</a></li>
<li class="dropdown-header">Foundations</li>
<li><a href="Linear-Regression.html">Linear Regression</a></li>
<li><a href="Statistical-Tests-in-R.html">Statistical Tests</a></li>
<li><a href="Missing-Value-Treatment-With-R.html">Missing Value Treatment</a></li>
<li><a href="Outlier-Treatment-With-R.html">Outlier Analysis</a></li>
<li><a href="Variable-Selection-and-Importance-With-R.html">Feature Selection</a></li>
<li><a href="Model-Selection-in-R.html">Model Selection</a></li>
<li><a href="Logistic-Regression-With-R.html">Logistic Regression</a></li>
<li><a href="Environments.html">Advanced Linear Regression</a></li>
<li class="dropdown-header">Advanced Regression Models</li>
<li><a href="adv-regression-models.html">Advanced Regression Models</a></li>
<li class="dropdown-header">Time Series</li>
<li><a href="Time-Series-Analysis-With-R.html">Time Series Analysis</a></li>
<li><a href="Time-Series-Forecasting-With-R.html">Time Series Forecasting </a></li>
<li><a href="Time-Series-Forecasting-With-R-part2.html">More Time Series Forecasting</a></li>
<li class="dropdown-header">High Performance Computing</li>
<li><a href="Parallel-Computing-With-R.html">Parallel computing</a></li>
<li><a href="Strategies-To-Improve-And-Speedup-R-Code.html">Strategies to Speedup R code</a></li>
<li class="dropdown-header">Useful Techniques</li>
<li><a href="Association-Mining-With-R.html">Association Mining</a></li>
<li><a href="Multi-Dimensional-Scaling-With-R.html">Multi Dimensional Scaling</a></li>
<li><a href="Profiling.html">Optimization</a></li>
<li><a href="Information-Value-With-R.html">InformationValue package</a></li>
</ul>
</li>
</div>
<div class="well">
<p>Stay up-to-date. <a href="https://docs.google.com/forms/d/1xkMYkLNFU9U39Dd8S_2JC0p8B5t6_Yq6zUQjanQQJpY/viewform">Subscribe!</a></p>
<p><a href="https://docs.google.com/forms/d/13GrkCFcNa-TOIllQghsz2SIEbc-YqY9eJX02B19l5Ow/viewform">Chat!</a></p>
</div>
<h4>Contents</h4>
<ul class="list-unstyled" id="toc"></ul>
<!--
<hr>
<p><a href="/contribute.html">How to contribute</a></p>
<p><a class="btn btn-primary" href="">Edit this page</a></p>
-->
</div>
<div id="content" class="col-xs-12 col-sm-8 pull-right">
<h1>Model Selection Approaches</h1>
<blockquote>
<p>It is possible to build multiple models from a given set of X variables. But building a good quality model can make all the difference. Here, we explore various approaches to build and evaluate regression models.</p>
</blockquote>
<h2>Data Prep</h2>
<p>Lets prepare the data upon which the various model selection approaches will be applied. A dataframe containing only the predictors and one containing the response variable is created for use in the model seection algorithms.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">inputData <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"http://rstatistics.net/wp-content/uploads/2015/09/ozone2.csv"</span>, <span class="dt">stringsAsFactors=</span>F)
response_df <-<span class="st"> </span>inputData[<span class="st">'ozone_reading'</span>] <span class="co"># Y variable</span>
predictors_df <-<span class="st"> </span>inputData[, !<span class="kw">names</span>(inputData) %in%<span class="st"> "ozone_reading"</span> ] <span class="co"># X variables</span>
<span class="kw">head</span>(inputData)
<span class="co">#=> Month Day_of_month Day_of_week ozone_reading pressure_height Wind_speed Humidity</span>
<span class="co">#=> 1 1 4 3.01 5480 8 20.00000</span>
<span class="co">#=> 1 2 5 3.20 5660 6 48.41432</span>
<span class="co">#=> 1 3 6 2.70 5710 4 28.00000</span>
<span class="co">#=> 1 4 7 5.18 5700 3 37.00000</span>
<span class="co">#=> 1 5 1 5.34 5760 3 51.00000</span>
<span class="co">#=> 1 6 2 5.77 5720 4 69.00000</span>
<span class="co">#=></span>
<span class="co">#=> Temperature_Sandburg Temperature_ElMonte Inversion_base_height Pressure_gradient</span>
<span class="co">#=> 37.78175 35.31509 5000.000 -15</span>
<span class="co">#=> 38.00000 45.79294 4060.589 -14</span>
<span class="co">#=> 40.00000 48.48006 2693.000 -25</span>
<span class="co">#=> 45.00000 49.19898 590.000 -24</span>
<span class="co">#=> 54.00000 45.32000 1450.000 25</span>
<span class="co">#=> 35.00000 49.64000 1568.000 15</span>
<span class="co">#=></span>
<span class="co">#=> Inversion_temperature Visibility</span>
<span class="co">#=> 30.56000 200</span>
<span class="co">#=> 46.86914 300</span>
<span class="co">#=> 47.66000 250</span>
<span class="co">#=> 55.04000 100</span>
<span class="co">#=> 57.02000 60</span>
<span class="co">#=> 53.78000 60</span></code></pre></div>
<h2>Stepwise Regression</h2>
<p>In stepwise regression, we pass the full model to <code>step</code> function. It iteratively searches the full scope of variables in <code>backwards</code> directions by default, if <code>scope</code> is not given. It performs multiple iteractions by droping one X variable at a time. In each iteration, multiple models are built by dropping each of the X variables at a time. The <code>AIC</code> of the models is also computed and the model that yields the lowest AIC is retained for the next iteration.</p>
<p>In simpler terms, the variable that gives the minimum AIC when dropped, is dropped for the next iteration, until there is no significant drop in AIC is noticed.</p>
<p>The code below shows how stepwise regression can be done. We are providing the full model here, so a backwards stepwise will be performed, which means, variables will only be removed. In forward stepwise, variables will be progressively added.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">lmMod <-<span class="st"> </span><span class="kw">lm</span>(ozone_reading ~<span class="st"> </span>. , <span class="dt">data =</span> inputData)
selectedMod <-<span class="st"> </span><span class="kw">step</span>(lmMod)
<span class="kw">summary</span>(selectedMod)
<span class="co">#=> Call:</span>
<span class="co">#=> lm(formula = ozone_reading ~ Month + pressure_height + Wind_speed + </span>
<span class="co">#=> Humidity + Temperature_Sandburg + Temperature_ElMonte + Inversion_base_height, </span>
<span class="co">#=> data = inputData)</span>
<span class="co">#=> </span>
<span class="co">#=> Residuals:</span>
<span class="co">#=> Min 1Q Median 3Q Max </span>
<span class="co">#=> -13.5219 -2.6652 -0.1885 2.5702 12.7184 </span>
<span class="co">#=> </span>
<span class="co">#=> Coefficients:</span>
<span class="co">#=> Estimate Std. Error t value Pr(>|t|) </span>
<span class="co">#=> (Intercept) 97.9206462 27.5285900 3.557 0.000425 ***</span>
<span class="co">#=> Month -0.3632285 0.0752403 -4.828 2.05e-06 ***</span>
<span class="co">#=> pressure_height -0.0218974 0.0051670 -4.238 2.87e-05 ***</span>
<span class="co">#=> Wind_speed -0.1738621 0.1207299 -1.440 0.150715 </span>
<span class="co">#=> Humidity 0.0817383 0.0132480 6.170 1.85e-09 ***</span>
<span class="co">#=> Temperature_Sandburg 0.1532862 0.0403667 3.797 0.000172 ***</span>
<span class="co">#=> Temperature_ElMonte 0.5149553 0.0686170 7.505 4.92e-13 ***</span>
<span class="co">#=> Inversion_base_height -0.0003529 0.0001743 -2.025 0.043629 * </span>
<span class="co">#=> ---</span>
<span class="co">#=> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</span>
<span class="co">#=> </span>
<span class="co">#=> Residual standard error: 4.233 on 358 degrees of freedom</span>
<span class="co">#=> Multiple R-squared: 0.7186, Adjusted R-squared: 0.7131 </span>
<span class="co">#=> F-statistic: 130.6 on 7 and 358 DF, p-value: < 2.2e-16</span>
all_vifs <-<span class="st"> </span>car::<span class="kw">vif</span>(selectedMod)
<span class="kw">print</span>(all_vifs)
<span class="co">#=> Month pressure_height Wind_speed Humidity </span>
<span class="co">#=> 1.377397 5.995937 1.330647 1.386716 </span>
<span class="co">#=> </span>
<span class="co">#=> Temperature_Sandburg Temperature_ElMonte Inversion_base_height </span>
<span class="co">#=> 6.781597 11.616208 1.926758</span></code></pre></div>
<h2>Multicollinearity and Statistical Significance</h2>
<p>Say, one of the methods discussed above or below has given us a best model based on a criteria such as Adj-Rsq. It is not guaranteed that the condition of multicollinearity (checked using <code>car::vif</code>) will be satisfied or even the model be statistically significant. To satisfy these two conditions, the below approach can be taken.</p>
<h4>Recursively remove variables with VIF > 4</h4>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">signif_all <-<span class="st"> </span><span class="kw">names</span>(all_vifs)
<span class="co"># Remove vars with VIF> 4 and re-build model until none of VIFs don't exceed 4.</span>
while(<span class="kw">any</span>(all_vifs ><span class="st"> </span><span class="dv">4</span>)){
var_with_max_vif <-<span class="st"> </span><span class="kw">names</span>(<span class="kw">which</span>(all_vifs ==<span class="st"> </span><span class="kw">max</span>(all_vifs))) <span class="co"># get the var with max vif</span>
signif_all <-<span class="st"> </span>signif_all[!(signif_all) %in%<span class="st"> </span>var_with_max_vif] <span class="co"># remove</span>
myForm <-<span class="st"> </span><span class="kw">as.formula</span>(<span class="kw">paste</span>(<span class="st">"ozone_reading ~ "</span>, <span class="kw">paste</span> (signif_all, <span class="dt">collapse=</span><span class="st">" + "</span>), <span class="dt">sep=</span><span class="st">""</span>)) <span class="co"># new formula</span>
selectedMod <-<span class="st"> </span><span class="kw">lm</span>(myForm, <span class="dt">data=</span>inputData) <span class="co"># re-build model with new formula</span>
all_vifs <-<span class="st"> </span>car::<span class="kw">vif</span>(selectedMod)
}
<span class="kw">summary</span>(selectedMod)
<span class="co"># Call:</span>
<span class="co"># lm(formula = myForm, data = inputData)</span>
<span class="co"># </span>
<span class="co"># Residuals:</span>
<span class="co"># Min 1Q Median 3Q Max </span>
<span class="co"># -15.5859 -3.4922 -0.3876 3.1741 16.7640 </span>
<span class="co"># </span>
<span class="co"># Coefficients:</span>
<span class="co"># Estimate Std. Error t value Pr(>|t|) </span>
<span class="co"># (Intercept) -2.007e+02 1.942e+01 -10.335 < 2e-16 ***</span>
<span class="co"># Month -2.322e-01 8.976e-02 -2.587 0.0101 * </span>
<span class="co"># pressure_height 3.607e-02 3.349e-03 10.773 < 2e-16 ***</span>
<span class="co"># Wind_speed 2.346e-01 1.423e-01 1.649 0.1001 </span>
<span class="co"># Humidity 1.391e-01 1.492e-02 9.326 < 2e-16 ***</span>
<span class="co"># Inversion_base_height -1.122e-03 1.975e-04 -5.682 2.76e-08 ***</span>
<span class="co"># ---</span>
<span class="co"># Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</span>
<span class="co"># </span>
<span class="co"># Residual standard error: 5.172 on 360 degrees of freedom</span>
<span class="co"># Multiple R-squared: 0.5776, Adjusted R-squared: 0.5717 </span>
<span class="co"># F-statistic: 98.45 on 5 and 360 DF, p-value: < 2.2e-16</span>
car::<span class="kw">vif</span>(selectedMod)
<span class="co"># Month pressure_height Wind_speed Humidity Inversion_base_height </span>
<span class="co"># 1.313154 1.687105 1.238613 1.178276 1.658603 </span></code></pre></div>
<p>The VIFs of all the X’s are below 2 now. So, the condition of multicollinearity is satisfied. But the variable <code>wind_speed</code> in the model with p value > .1 is not statistically significant. For this specific case, we could just re-build the model without <code>wind_speed</code> and check all variables are statistically significant. But, what if you had a different data that selected a model with 2 or more non-significant variables. What if, you had to select models for many such data. So, lets write a generic code for this.</p>
<h4>Recursively remove non-significant variables</h4>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">all_vars <-<span class="st"> </span><span class="kw">names</span>(selectedMod[[<span class="dv">1</span>]])[-<span class="dv">1</span>] <span class="co"># names of all X variables</span>
<span class="co"># Get the non-significant vars</span>
summ <-<span class="st"> </span><span class="kw">summary</span>(selectedMod) <span class="co"># model summary</span>
pvals <-<span class="st"> </span>summ[[<span class="dv">4</span>]][, <span class="dv">4</span>] <span class="co"># get all p values</span>
not_significant <-<span class="st"> </span><span class="kw">character</span>() <span class="co"># init variables that aren't statsitically significant</span>
not_significant <-<span class="st"> </span><span class="kw">names</span>(<span class="kw">which</span>(pvals ><span class="st"> </span><span class="fl">0.1</span>))
not_significant <-<span class="st"> </span>not_significant[!not_significant %in%<span class="st"> "(Intercept)"</span>] <span class="co"># remove 'intercept'. Optional!</span>
<span class="co"># If there are any non-significant variables, </span>
while(<span class="kw">length</span>(not_significant) ><span class="st"> </span><span class="dv">0</span>){
all_vars <-<span class="st"> </span>all_vars[!all_vars %in%<span class="st"> </span>not_significant[<span class="dv">1</span>]]
myForm <-<span class="st"> </span><span class="kw">as.formula</span>(<span class="kw">paste</span>(<span class="st">"ozone_reading ~ "</span>, <span class="kw">paste</span> (all_vars, <span class="dt">collapse=</span><span class="st">" + "</span>), <span class="dt">sep=</span><span class="st">""</span>)) <span class="co"># new formula</span>
selectedMod <-<span class="st"> </span><span class="kw">lm</span>(myForm, <span class="dt">data=</span>inputData) <span class="co"># re-build model with new formula</span>
<span class="co"># Get the non-significant vars.</span>
summ <-<span class="st"> </span><span class="kw">summary</span>(selectedMod)
pvals <-<span class="st"> </span>summ[[<span class="dv">4</span>]][, <span class="dv">4</span>]
not_significant <-<span class="st"> </span><span class="kw">character</span>()
not_significant <-<span class="st"> </span><span class="kw">names</span>(<span class="kw">which</span>(pvals ><span class="st"> </span><span class="fl">0.1</span>))
not_significant <-<span class="st"> </span>not_significant[!not_significant %in%<span class="st"> "(Intercept)"</span>]
}
<span class="kw">summary</span>(selectedMod)
<span class="co">#=> Call:</span>
<span class="co">#=> lm(formula = myForm, data = inputData)</span>
<span class="co"># </span>
<span class="co">#=> Residuals:</span>
<span class="co">#=> Min 1Q Median 3Q Max </span>
<span class="co">#=> -15.1537 -3.5541 -0.2294 3.2273 17.0106 </span>
<span class="co">#=> </span>
<span class="co">#=> Coefficients:</span>
<span class="co">#=> Estimate Std. Error t value Pr(>|t|) </span>
<span class="co">#=> (Intercept) -1.989e+02 1.944e+01 -10.234 < 2e-16 ***</span>
<span class="co">#=> Month -2.694e-01 8.709e-02 -3.093 0.00213 ** </span>
<span class="co">#=> pressure_height 3.589e-02 3.355e-03 10.698 < 2e-16 ***</span>
<span class="co">#=> Humidity 1.466e-01 1.426e-02 10.278 < 2e-16 ***</span>
<span class="co">#=> Inversion_base_height -1.047e-03 1.927e-04 -5.435 1.01e-07 ***</span>
<span class="co">#=> ---</span>
<span class="co">#=> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</span>
<span class="co">#=> </span>
<span class="co">#=> Residual standard error: 5.184 on 361 degrees of freedom</span>
<span class="co">#=> Multiple R-squared: 0.5744, Adjusted R-squared: 0.5697 </span>
<span class="co">#=> F-statistic: 121.8 on 4 and 361 DF, p-value: < 2.2e-16</span>
<span class="co">#=> car::vif(selectedMod)</span>
<span class="co">#=> Month pressure_height Humidity Inversion_base_height </span>
<span class="co">#=> 1.230346 1.685245 1.071214 1.570431 </span></code></pre></div>
<p>In the resulting model, both statistical significance and multicollinearity is acceptable.</p>
<h2>Best subsets</h2>
<p>Best subsets is a technique that relies on stepwise regression to search, find and visualise regression models. But unlike stepwise regression, you have more options to see what variables were included in various shortlisted models, force-in or force-out some of the explanatory variables and also visually inspect the model’s performance w.r.t Adj R-sq.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(leaps)
regsubsetsObj <-<span class="st"> </span><span class="kw">regsubsets</span>(<span class="dt">x=</span>predictors_df ,<span class="dt">y=</span>response_df, <span class="dt">nbest =</span> <span class="dv">2</span>, <span class="dt">really.big =</span> T)
<span class="kw">plot</span>(regsubsetsObj, <span class="dt">scale =</span> <span class="st">"adjr2"</span>) <span class="co"># regsubsets plot based on R-sq</span></code></pre></div>
<p><img src='screenshots/regsubsets-plot.png' width='528' height='349' /></p>
<h4>How to interpret the <code>regsubsets</code> plot?</h4>
<p>The <code>regsubsets</code> plot shows the adjusted R-sq along the Y-axis for many models created by combinations of variables shown on the X-axis. For instance, draw an imaginary horizontal line along the X-axis from any point along the Y-axis. That line would correspond to a linear model, where, the black boxes that line touches form the X variables. For example, the red line in the image touches the black boxes belonging to <code>Intercept</code>, <code>Month</code>, <code>pressure_height</code>, <code>Humidity</code>, <code>Temperature_Sandburg</code> and <code>Temperature_Elmonte</code>. The Adjusted R-sq for that model is the value at which the red line touches the Y-axis.</p>
<p>The caveat however is that it is not guaranteed that these models will be statistically significant.</p>
<h2>Leaps</h2>
<p>Leaps is similar to best subsets but is known to use a better algorithm to shortlist the models.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(leaps)
leapSet <-<span class="st"> </span><span class="kw">leaps</span>(<span class="dt">x=</span>predictors_df, <span class="dt">y=</span>inputData$ozone_reading, <span class="dt">nbest =</span> <span class="dv">1</span>, <span class="dt">method =</span> <span class="st">"adjr2"</span>) <span class="co"># criterion could be one of "Cp", "adjr2", "r2". Works for max of 32 predictors.</span>
<span class="co">#=> $which</span>
<span class="co">#=> 1 2 3 4 5 6 7 8 9 A B C</span>
<span class="co">#=> 1 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE</span>
<span class="co">#=> 2 FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE</span>
<span class="co">#=> 3 TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE</span>
<span class="co">#=> 4 TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE</span>
<span class="co">#=> 5 TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE</span>
<span class="co">#=> 6 TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE</span>
<span class="co">#=> 7 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE</span>
<span class="co">#=> 8 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE</span>
<span class="co">#=> 9 TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE</span>
<span class="co">#=> 10 TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE</span>
<span class="co">#=> 11 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE</span>
<span class="co">#=> 12 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE</span>
<span class="co">#=> </span>
<span class="co">#=> $adjr2</span>
<span class="co">#=> [1] 0.5945612 0.6544828 0.6899196 0.6998209 0.7079506 0.7122214 0.7130796 0.7134627 0.7130404 0.7125416</span>
<span class="co">#=> [11] 0.7119148 0.7112852</span>
<span class="co"># Suppose, we want to choose a model with 4 variables.</span>
selectVarsIndex <-<span class="st"> </span>leapSet$which[<span class="dv">4</span>, ] <span class="co"># pick selected vars</span>
newData <-<span class="st"> </span><span class="kw">cbind</span>(response_df, predictors_df[, selectVarsIndex]) <span class="co"># new data for building selected model</span>
selectedMod <-<span class="st"> </span><span class="kw">lm</span>(ozone_reading ~<span class="st"> </span>., <span class="dt">data=</span>newData) <span class="co"># build model</span>
<span class="kw">summary</span>(selectedMod)
<span class="co">#=> Call:</span>
<span class="co">#=> lm(formula = ozone_reading ~ ., data = newData)</span>
<span class="co">#=> </span>
<span class="co">#=> Residuals:</span>
<span class="co">#=> Min 1Q Median 3Q Max </span>
<span class="co">#=> -13.9636 -2.8928 -0.0581 2.8549 12.6286 </span>
<span class="co">#=> </span>
<span class="co">#=> Coefficients:</span>
<span class="co">#=> Estimate Std. Error t value Pr(>|t|) </span>
<span class="co">#=> (Intercept) 74.611786 27.188323 2.744 0.006368 ** </span>
<span class="co">#=> Month -0.426133 0.069892 -6.097 2.78e-09 ***</span>
<span class="co">#=> pressure_height -0.018478 0.005137 -3.597 0.000366 ***</span>
<span class="co">#=> Humidity 0.096978 0.012529 7.740 1.01e-13 ***</span>
<span class="co">#=> Temperature_ElMonte 0.704866 0.049984 14.102 < 2e-16 ***</span>
<span class="co">#=> ---</span>
<span class="co">#=> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</span>
<span class="co">#=> </span>
<span class="co">#=> Residual standard error: 4.33 on 361 degrees of freedom</span>
<span class="co">#=> Multiple R-squared: 0.7031, Adjusted R-squared: 0.6998 </span>
<span class="co">#=> F-statistic: 213.7 on 4 and 361 DF, p-value: < 2.2e-16</span></code></pre></div>
<h2>RegBest() from FactoMineR</h2>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(FactoMineR)
regMod <-<span class="st"> </span><span class="kw">RegBest</span>(<span class="dt">y=</span>inputData$ozone_reading, <span class="dt">x =</span> predictors_df)
regMod$all <span class="co"># summary of best model of all sizes based on Adj A-sq</span>
regMod$best <span class="co"># best model</span>
<span class="co">#=> Call:</span>
<span class="co">#=> lm(formula = as.formula(as.character(formul)), data = don)</span>
<span class="co">#=> </span>
<span class="co">#=> Residuals:</span>
<span class="co">#=> Min 1Q Median 3Q Max </span>
<span class="co">#=> -13.6805 -2.6589 -0.1952 2.6045 12.6521 </span>
<span class="co">#=> </span>
<span class="co">#=> Coefficients:</span>
<span class="co">#=> Estimate Std. Error t value Pr(>|t|) </span>
<span class="co">#=> (Intercept) 88.8519747 26.8386969 3.311 0.001025 ** </span>
<span class="co">#=> Month -0.3354044 0.0728259 -4.606 5.72e-06 ***</span>
<span class="co">#=> pressure_height -0.0202670 0.0050489 -4.014 7.27e-05 ***</span>
<span class="co">#=> Humidity 0.0784813 0.0130730 6.003 4.73e-09 ***</span>
<span class="co">#=> Temperature_Sandburg 0.1450456 0.0400188 3.624 0.000331 ***</span>
<span class="co">#=> Temperature_ElMonte 0.5069526 0.0684938 7.401 9.65e-13 ***</span>
<span class="co">#=> Inversion_base_height -0.0004224 0.0001677 -2.518 0.012221 * </span>
<span class="co">#=> ---</span>
<span class="co">#=> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</span>
<span class="co">#=> </span>
<span class="co">#=> Residual standard error: 4.239 on 359 degrees of freedom</span>
<span class="co">#=> Multiple R-squared: 0.717, Adjusted R-squared: 0.7122 </span>
<span class="co">#=> F-statistic: 151.6 on 6 and 359 DF, p-value: < 2.2e-16</span></code></pre></div>
<h2>Simulated Annealing</h2>
<p>Given a set of variables, a simulated annealing algorithm seeks a k-variable subset which is optimal, as a surrogate for the whole set, with respect to a given criterion. Annealing offers a method of finding the best subsets of predictor variables. Since the correlation or covariance matrix is a input to the <code>anneal()</code> function, only continuous variables are used to compute the best subsets.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(subselect)
results <-<span class="st"> </span><span class="kw">anneal</span>(<span class="kw">cor</span>(predictors_df), <span class="dt">kmin=</span><span class="dv">1</span>, <span class="dt">kmax=</span><span class="kw">ncol</span>(predictors_df)-<span class="dv">1</span>, <span class="dt">nsol=</span><span class="dv">4</span>, <span class="dt">niter=</span><span class="dv">10</span>, <span class="dt">setseed=</span><span class="ot">TRUE</span>) <span class="co"># perform annealing<</span>
<span class="kw">print</span>(results$bestsets)
<span class="co">#=> Var.1 Var.2 Var.3 Var.4 Var.5 Var.6 Var.7 Var.8 Var.9 Var.10 Var.11</span>
<span class="co">#=> Card.1 11 0 0 0 0 0 0 0 0 0 0</span>
<span class="co">#=> Card.2 7 10 0 0 0 0 0 0 0 0 0</span>
<span class="co">#=> Card.3 5 6 8 0 0 0 0 0 0 0 0</span>
<span class="co">#=> Card.4 1 2 6 11 0 0 0 0 0 0 0</span>
<span class="co">#=> Card.5 1 3 5 6 11 0 0 0 0 0 0</span>
<span class="co">#=> Card.6 2 3 5 6 9 11 0 0 0 0 0</span>
<span class="co">#=> Card.7 1 2 3 5 10 11 12 0 0 0 0</span>
<span class="co">#=> Card.8 1 2 3 4 5 6 8 12 0 0 0</span>
<span class="co">#=> Card.9 1 2 3 4 5 6 9 10 12 0 0</span>
<span class="co">#=> Card.10 1 2 3 4 5 6 8 9 10 12 0</span>
<span class="co">#=> Card.11 1 2 3 4 5 6 7 8 9 10 12</span></code></pre></div>
<p>The bestsets value in the output reveal the best variables to select for each cardinality (number of predictors). The values inside <code>results$bestsets</code> correspond to the column index position of <code>predicted_df</code>, that is, which variables are selected for each cardinality.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">num_vars <-<span class="st"> </span><span class="dv">3</span>
selectVarsIndex <-<span class="st"> </span>results$bestsets[num_vars, <span class="dv">1</span>:num_vars]
newData <-<span class="st"> </span><span class="kw">cbind</span>(response_df, predictors_df[, selectVarsIndex]) <span class="co"># new data for building selected model</span>
selectedMod <-<span class="st"> </span><span class="kw">lm</span>(ozone_reading ~<span class="st"> </span>., <span class="dt">data=</span>newData) <span class="co"># build model</span>
<span class="kw">summary</span>(selectedMod)
<span class="co">#=> Call:</span>
<span class="co">#=> lm(formula = ozone_reading ~ ., data = newData)</span>
<span class="co">#=> </span>
<span class="co">#=> Residuals:</span>
<span class="co">#=> Min 1Q Median 3Q Max </span>
<span class="co">#=> -14.6948 -2.7279 -0.3532 2.9004 13.4161 </span>
<span class="co">#=> </span>
<span class="co">#=> Coefficients:</span>
<span class="co">#=> Estimate Std. Error t value Pr(>|t|) </span>
<span class="co">#=> (Intercept) -23.98819 1.50057 -15.986 < 2e-16 ***</span>
<span class="co">#=> Wind_speed 0.08796 0.11989 0.734 0.464 </span>
<span class="co">#=> Humidity 0.11169 0.01319 8.468 6.34e-16 ***</span>
<span class="co">#=> Temperature_ElMonte 0.49985 0.02324 21.506 < 2e-16 ***</span>
<span class="co">#=> ---</span>
<span class="co">#=> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</span>
<span class="co"># </span>
<span class="co"># Residual standard error: 4.648 on 362 degrees of freedom</span>
<span class="co"># Multiple R-squared: 0.6569, Adjusted R-squared: 0.654 </span>
<span class="co"># F-statistic: 231 on 3 and 362 DF, p-value: < 2.2e-16</span></code></pre></div>
<p>Like other methods, <code>anneal()</code> does not guarantee that the model be statistically significant.</p>
<h2>Comparing Models Using ANOVA</h2>
<p>If you have two or more models that are subsets of a larger model, you can use <code>anova()</code> to check if the additional variable(s) contribute to the predictive ability of the model. In below example, the <code>baseMod</code> is a model built with 7 explanatory variables, while, <code>mod1</code> through <code>mod5</code> contain one predictor less than the previous model.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># ANOVA</span>
baseMod <-<span class="st"> </span><span class="kw">lm</span>(ozone_reading ~<span class="st"> </span>Month +<span class="st"> </span>pressure_height +<span class="st"> </span>Humidity +<span class="st"> </span>Temperature_Sandburg +<span class="st"> </span>Temperature_ElMonte +<span class="st"> </span>Inversion_base_height +<span class="st"> </span>Wind_speed, <span class="dt">data=</span>inputData)
mod1 <-<span class="st"> </span><span class="kw">lm</span>(ozone_reading ~<span class="st"> </span>Month +<span class="st"> </span>pressure_height +<span class="st"> </span>Humidity +<span class="st"> </span>Temperature_Sandburg +<span class="st"> </span>Temperature_ElMonte +<span class="st"> </span>Inversion_base_height, <span class="dt">data=</span>inputData)
mod2 <-<span class="st"> </span><span class="kw">lm</span>(ozone_reading ~<span class="st"> </span>Month +<span class="st"> </span>pressure_height +<span class="st"> </span>Humidity +<span class="st"> </span>Temperature_Sandburg +<span class="st"> </span>Temperature_ElMonte, <span class="dt">data=</span>inputData)
mod3 <-<span class="st"> </span><span class="kw">lm</span>(ozone_reading ~<span class="st"> </span>Month +<span class="st"> </span>pressure_height +<span class="st"> </span>Humidity +<span class="st"> </span>Temperature_ElMonte, <span class="dt">data=</span>inputData)
mod4 <-<span class="st"> </span><span class="kw">lm</span>(ozone_reading ~<span class="st"> </span>Month +<span class="st"> </span>pressure_height +<span class="st"> </span>Temperature_ElMonte, <span class="dt">data=</span>inputData)
<span class="kw">anova</span>(baseMod, mod1, mod2, mod3, mod4)
<span class="co">#=> Model 1: ozone_reading ~ Month + pressure_height + Humidity + Temperature_Sandburg + </span>
<span class="co">#=> Temperature_ElMonte + Inversion_base_height + Wind_speed</span>
<span class="co">#=> Model 2: ozone_reading ~ Month + pressure_height + Humidity + Temperature_Sandburg + </span>
<span class="co">#=> Temperature_ElMonte + Inversion_base_height</span>
<span class="co">#=> Model 3: ozone_reading ~ Month + pressure_height + Humidity + Temperature_Sandburg + </span>
<span class="co">#=> Temperature_ElMonte</span>
<span class="co">#=> Model 4: ozone_reading ~ Month + pressure_height + Humidity + Temperature_ElMonte</span>
<span class="co">#=> Model 5: ozone_reading ~ Month + pressure_height + Temperature_ElMonte</span>
<span class="co">#=></span>
<span class="co">#=> Res.Df RSS Df Sum of Sq F Pr(>F) </span>
<span class="co">#=> row 1 358 6414.4 </span>
<span class="co">#=> row 2 359 6451.5 -1 -37.16 2.0739 0.150715 </span>
<span class="co">#=> row 3 360 6565.5 -1 -113.98 6.3616 0.012095 * </span>
<span class="co">#=> row 4 361 6767.0 -1 -201.51 11.2465 0.000883 ***</span>
<span class="co">#=> row 5 362 7890.0 -1 -1123.00 62.6772 3.088e-14 ***</span>
---
Signif. codes:<span class="st"> </span><span class="dv">0</span> <span class="st">'***'</span> <span class="fl">0.001</span> <span class="st">'**'</span> <span class="fl">0.01</span> <span class="st">'*'</span> <span class="fl">0.05</span> <span class="st">'.'</span> <span class="fl">0.1</span> <span class="st">' '</span> <span class="dv">1</span></code></pre></div>
<p>For each row in the output, the <code>anova()</code> tests a hypothesis comparing two models. For instance, <code>row 2</code> compares <code>baseMod</code> (Model 1) and <code>mod1</code> (Model 2) in the output. The null hypothesis is that the two models are equal in fitting the data (i.e. the Y variable), while, the alternative hypothesis is that the full model is better (i.e. the additional X variable improves the model).</p>
<p>So what’s the inference? Except for row 2, all other rows have significant p values. This means all the additional variables in models 1, 2 and 3 are contributing to respective models. From row 1 output, the <code>Wind_speed</code> is not making the <code>baseMod</code> (Model 1) any better. So the best model we have amongst this set is <code>mod1</code> (Model1).</p>
</div>
</div>
<div class="footer">
<hr>
<p>© 2016-17 Selva Prabhakaran. Powered by <a href="http://jekyllrb.com/">jekyll</a>,
<a href="http://yihui.name/knitr/">knitr</a>, and
<a href="http://johnmacfarlane.net/pandoc/">pandoc</a>.
This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">Creative Commons License.</a>
</p>
</div>
</div> <!-- /container -->
<script src="//code.jquery.com/jquery.js"></script>
<script src="www/bootstrap.min.js"></script>
<script src="www/toc.js"></script>
<!-- MathJax Script -->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>
<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<!-- Google Analytics Code -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-69351797-1', 'auto');
ga('send', 'pageview');
</script>
<style type="text/css">
/* reduce spacing around math formula*/
.MathJax_Display {
margin: 0em 0em;
}
body {
font-family: 'Helvetica Neue', Roboto, Arial, sans-serif;
font-size: 16px;
line-height: 27px;
font-weight: 400;
}
blockquote p {
line-height: 1.75;
color: #717171;
}
.well li{
line-height: 28px;
}
li.dropdown-header {
display: block;
padding: 0px;
font-size: 14px;
}
</style>
</body>
</html>