-
Notifications
You must be signed in to change notification settings - Fork 16
/
Copy pathLogistic-Regression-With-R.html
456 lines (413 loc) · 32.6 KB
/
Logistic-Regression-With-R.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
<!DOCTYPE html>
<html>
<head>
<title>Logistic Regression With R</title>
<meta charset="utf-8">
<meta name="Description" content="R Language Tutorials for Advanced Statistics">
<meta name="Keywords" content="R, Tutorial, Machine learning, Statistics, Data Mining, Analytics, Data science, Linear Regression, Logistic Regression, Time series, Forecasting">
<meta name="Distribution" content="Global">
<meta name="Author" content="Selva Prabhakaran">
<meta name="Robots" content="index, follow">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="shortcut icon" href="/screenshots/iconb-64.png" type="image/x-icon" />
<link href="www/bootstrap.min.css" rel="stylesheet">
<link href="www/highlight.css" rel="stylesheet">
<link href='http://fonts.googleapis.com/css?family=Inconsolata:400,700'
rel='stylesheet' type='text/css'>
<!-- Color Script -->
<style type="text/css">
a {
color: #3675C5;
color: rgb(25, 145, 248);
color: #4582ec;
color: #3F73D8;
}
li {
line-height: 1.65;
}
/* reduce spacing around math formula*/
.MathJax_Display {
margin: 0em 0em;
}
</style>
<!-- Add Google search -->
<script language="Javascript" type="text/javascript">
function my_search_google()
{
var query = document.getElementById("my-google-search").value;
window.open("http://google.com/search?q=" + query
+ "%20site:" + "http://r-statistics.co");
}
</script>
</head>
<body>
<div class="container">
<div class="masthead">
<!--
<ul class="nav nav-pills pull-right">
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">
Table of contents<b class="caret"></b>
</a>
<ul class="dropdown-menu pull-right" role="menu">
<li class="dropdown-header"></li>
<li class="dropdown-header">Tutorial</li>
<li><a href="R-Tutorial.html">R Tutorial</a></li>
<li class="dropdown-header">ggplot2</li>
<li><a href="ggplot2-Tutorial-With-R.html">ggplot2 Short Tutorial</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part1-With-R-Code.html">ggplot2 Tutorial 1 - Intro</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html">ggplot2 Tutorial 2 - Theme</a></li>
<li><a href="Top50-Ggplot2-Visualizations-MasterList-R-Code.html">ggplot2 Tutorial 3 - Masterlist</a></li>
<li><a href="ggplot2-cheatsheet.html">ggplot2 Quickref</a></li>
<li class="dropdown-header">Foundations</li>
<li><a href="Linear-Regression.html">Linear Regression</a></li>
<li><a href="Statistical-Tests-in-R.html">Statistical Tests</a></li>
<li><a href="Missing-Value-Treatment-With-R.html">Missing Value Treatment</a></li>
<li><a href="Outlier-Treatment-With-R.html">Outlier Analysis</a></li>
<li><a href="Variable-Selection-and-Importance-With-R.html">Feature Selection</a></li>
<li><a href="Model-Selection-in-R.html">Model Selection</a></li>
<li><a href="Logistic-Regression-With-R.html">Logistic Regression</a></li>
<li><a href="Environments.html">Advanced Linear Regression</a></li>
<li class="dropdown-header">Advanced Regression Models</li>
<li><a href="adv-regression-models.html">Advanced Regression Models</a></li>
<li class="dropdown-header">Time Series</li>
<li><a href="Time-Series-Analysis-With-R.html">Time Series Analysis</a></li>
<li><a href="Time-Series-Forecasting-With-R.html">Time Series Forecasting </a></li>
<li><a href="Time-Series-Forecasting-With-R-part2.html">More Time Series Forecasting</a></li>
<li class="dropdown-header">High Performance Computing</li>
<li><a href="Parallel-Computing-With-R.html">Parallel computing</a></li>
<li><a href="Strategies-To-Improve-And-Speedup-R-Code.html">Strategies to Speedup R code</a></li>
<li class="dropdown-header">Useful Techniques</li>
<li><a href="Association-Mining-With-R.html">Association Mining</a></li>
<li><a href="Multi-Dimensional-Scaling-With-R.html">Multi Dimensional Scaling</a></li>
<li><a href="Profiling.html">Optimization</a></li>
<li><a href="Information-Value-With-R.html">InformationValue package</a></li>
</ul>
</li>
</ul>
-->
<ul class="nav nav-pills pull-right">
<div class="input-group">
<form onsubmit="my_search_google()">
<input type="text" class="form-control" id="my-google-search" placeholder="Search..">
<form>
</div><!-- /input-group -->
</ul><!-- /.col-lg-6 -->
<h3 class="muted"><a href="/">r-statistics.co</a><small> by Selva Prabhakaran</small></h3>
<hr>
</div>
<div class="row">
<div class="col-xs-12 col-sm-3" id="nav">
<div class="well">
<li>
<ul class="list-unstyled">
<li class="dropdown-header"></li>
<li class="dropdown-header">Tutorial</li>
<li><a href="R-Tutorial.html">R Tutorial</a></li>
<li class="dropdown-header">ggplot2</li>
<li><a href="ggplot2-Tutorial-With-R.html">ggplot2 Short Tutorial</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part1-With-R-Code.html">ggplot2 Tutorial 1 - Intro</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html">ggplot2 Tutorial 2 - Theme</a></li>
<li><a href="Top50-Ggplot2-Visualizations-MasterList-R-Code.html">ggplot2 Tutorial 3 - Masterlist</a></li>
<li><a href="ggplot2-cheatsheet.html">ggplot2 Quickref</a></li>
<li class="dropdown-header">Foundations</li>
<li><a href="Linear-Regression.html">Linear Regression</a></li>
<li><a href="Statistical-Tests-in-R.html">Statistical Tests</a></li>
<li><a href="Missing-Value-Treatment-With-R.html">Missing Value Treatment</a></li>
<li><a href="Outlier-Treatment-With-R.html">Outlier Analysis</a></li>
<li><a href="Variable-Selection-and-Importance-With-R.html">Feature Selection</a></li>
<li><a href="Model-Selection-in-R.html">Model Selection</a></li>
<li><a href="Logistic-Regression-With-R.html">Logistic Regression</a></li>
<li><a href="Environments.html">Advanced Linear Regression</a></li>
<li class="dropdown-header">Advanced Regression Models</li>
<li><a href="adv-regression-models.html">Advanced Regression Models</a></li>
<li class="dropdown-header">Time Series</li>
<li><a href="Time-Series-Analysis-With-R.html">Time Series Analysis</a></li>
<li><a href="Time-Series-Forecasting-With-R.html">Time Series Forecasting </a></li>
<li><a href="Time-Series-Forecasting-With-R-part2.html">More Time Series Forecasting</a></li>
<li class="dropdown-header">High Performance Computing</li>
<li><a href="Parallel-Computing-With-R.html">Parallel computing</a></li>
<li><a href="Strategies-To-Improve-And-Speedup-R-Code.html">Strategies to Speedup R code</a></li>
<li class="dropdown-header">Useful Techniques</li>
<li><a href="Association-Mining-With-R.html">Association Mining</a></li>
<li><a href="Multi-Dimensional-Scaling-With-R.html">Multi Dimensional Scaling</a></li>
<li><a href="Profiling.html">Optimization</a></li>
<li><a href="Information-Value-With-R.html">InformationValue package</a></li>
</ul>
</li>
</div>
<div class="well">
<p>Stay up-to-date. <a href="https://docs.google.com/forms/d/1xkMYkLNFU9U39Dd8S_2JC0p8B5t6_Yq6zUQjanQQJpY/viewform">Subscribe!</a></p>
<p><a href="https://docs.google.com/forms/d/13GrkCFcNa-TOIllQghsz2SIEbc-YqY9eJX02B19l5Ow/viewform">Chat!</a></p>
</div>
<h4>Contents</h4>
<ul class="list-unstyled" id="toc"></ul>
<!--
<hr>
<p><a href="/contribute.html">How to contribute</a></p>
<p><a class="btn btn-primary" href="">Edit this page</a></p>
-->
</div>
<div id="content" class="col-xs-12 col-sm-8 pull-right">
<h1>Logistic Regression</h1>
<blockquote>
<p>If linear regression serves to predict continuous Y variables, logistic regression is used for binary classification.</p>
</blockquote>
<p>If we use linear regression to model a dichotomous variable (as <span class="math inline"><em>Y</em></span>), the resulting model might not restrict the predicted <span class="math inline"><em>Y</em><em>s</em></span> within 0 and 1. Besides, other <a href="Assumptions-of-Linear-Regression.html">assumptions of linear regression</a> such as normality of errors may get violated. So instead, we model the log odds of the event <span class="math inline">$ln \left( P \over 1-P \right)$</span>, where, <span class="math inline"><em>P</em></span> is the probability of event.</p>
<p><br /><span class="math display">$$Z_{i} = ln{\left(P_{i} \over 1-P_{i} \right)} = \beta_{0} + \beta_{1} x_{1} + . . + \beta_{n} x_{n} $$</span><br /></p>
<p>The above equation can be modeled using the <code>glm()</code> by setting the <code>family</code> argument to <code>"binomial"</code>. But we are more interested in the probability of the event, than the log odds of the event. So, the predicted values from the above model, i.e. the log odds of the event, can be converted to probability of event as follows:</p>
<p><br /><span class="math display">$$P_{i} = 1 - {\left( 1 \over 1 + e^z_{i}\right)}$$</span><br /></p>
<p>This conversion is achieved using the <code>plogis()</code> function, as shown below when we <a href="Logistic-Regression-With-R.html#Build%20Logit%20Models%20and%20Predict">build logit models and predict</a>.</p>
<h2>Example Problem</h2>
<p>Lets try and predict if an individual will earn more than $50K using logistic regression based on demographic variables available in the <a href="http://rstatistics.net/wp-content/uploads/2015/09/adult.csv"><code>adult</code> data</a>. In this process, we will:</p>
<ol style="list-style-type: decimal">
<li>Import the data</li>
<li>Check for class bias</li>
<li>Create training and test samples</li>
<li>Compute information value to find out important variables</li>
<li>Build logit models and predict on test data</li>
<li>Do model diagnostics</li>
</ol>
<h2>Import data</h2>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">inputData <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"http://rstatistics.net/wp-content/uploads/2015/09/adult.csv"</span>)
<span class="kw">head</span>(inputData)
<span class="co">#=> AGE WORKCLASS FNLWGT EDUCATION EDUCATIONNUM MARITALSTATUS</span>
<span class="co">#=> 1 39 State-gov 77516 Bachelors 13 Never-married</span>
<span class="co">#=> 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse</span>
<span class="co">#=> 3 38 Private 215646 HS-grad 9 Divorced</span>
<span class="co">#=> 4 53 Private 234721 11th 7 Married-civ-spouse</span>
<span class="co">#=> 5 28 Private 338409 Bachelors 13 Married-civ-spouse</span>
<span class="co">#=> 6 37 Private 284582 Masters 14 Married-civ-spouse</span>
<span class="co"># OCCUPATION RELATIONSHIP RACE SEX CAPITALGAIN CAPITALLOSS</span>
<span class="co">#=> 1 Adm-clerical Not-in-family White Male 2174 0</span>
<span class="co">#=> 2 Exec-managerial Husband White Male 0 0</span>
<span class="co">#=> 3 Handlers-cleaners Not-in-family White Male 0 0</span>
<span class="co">#=> 4 Handlers-cleaners Husband Black Male 0 0</span>
<span class="co">#=> 5 Prof-specialty Wife Black Female 0 0</span>
<span class="co">#=> 6 Exec-managerial Wife White Female 0 0</span>
<span class="co"># HOURSPERWEEK NATIVECOUNTRY ABOVE50K</span>
<span class="co">#=> 1 40 United-States 0</span>
<span class="co">#=> 2 13 United-States 0</span>
<span class="co">#=> 3 40 United-States 0</span>
<span class="co">#=> 4 40 United-States 0</span>
<span class="co">#=> 5 40 Cuba 0</span>
<span class="co">#=> 6 40 United-States 0</span></code></pre></div>
<h2>Check Class bias</h2>
<p>Ideally, the proportion of events and non-events in the Y variable should approximately be the same. So, lets first check the proportion of classes in the dependent variable <code>ABOVE50K</code>.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">table</span>(inputData$ABOVE50K)
<span class="co"># 0 1 </span>
<span class="co"># 24720 7841 </span></code></pre></div>
<p>Clearly, there is a class bias, a condition observed when the proportion of events is much smaller than proportion of non-events. So we must sample the observations in approximately equal proportions to get better models.</p>
<h2>Create Training and Test Samples</h2>
<p>One way to address the problem of class bias is to draw the 0’s and 1’s for the <code>trainingData</code> (development sample) in equal proportions. In doing so, we will put rest of the <code>inputData</code> not included for training into <code>testData</code> (validation sample). As a result, the size of development sample will be smaller that validation, which is okay, because, there are large number of observations (>10K).</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Create Training Data</span>
input_ones <-<span class="st"> </span>inputData[<span class="kw">which</span>(inputData$ABOVE50K ==<span class="st"> </span><span class="dv">1</span>), ] <span class="co"># all 1's</span>
input_zeros <-<span class="st"> </span>inputData[<span class="kw">which</span>(inputData$ABOVE50K ==<span class="st"> </span><span class="dv">0</span>), ] <span class="co"># all 0's</span>
<span class="kw">set.seed</span>(<span class="dv">100</span>) <span class="co"># for repeatability of samples</span>
input_ones_training_rows <-<span class="st"> </span><span class="kw">sample</span>(<span class="dv">1</span>:<span class="kw">nrow</span>(input_ones), <span class="fl">0.7</span>*<span class="kw">nrow</span>(input_ones)) <span class="co"># 1's for training</span>
input_zeros_training_rows <-<span class="st"> </span><span class="kw">sample</span>(<span class="dv">1</span>:<span class="kw">nrow</span>(input_zeros), <span class="fl">0.7</span>*<span class="kw">nrow</span>(input_ones)) <span class="co"># 0's for training. Pick as many 0's as 1's</span>
training_ones <-<span class="st"> </span>input_ones[input_ones_training_rows, ]
training_zeros <-<span class="st"> </span>input_zeros[input_zeros_training_rows, ]
trainingData <-<span class="st"> </span><span class="kw">rbind</span>(training_ones, training_zeros) <span class="co"># row bind the 1's and 0's </span>
<span class="co"># Create Test Data</span>
test_ones <-<span class="st"> </span>input_ones[-input_ones_training_rows, ]
test_zeros <-<span class="st"> </span>input_zeros[-input_zeros_training_rows, ]
testData <-<span class="st"> </span><span class="kw">rbind</span>(test_ones, test_zeros) <span class="co"># row bind the 1's and 0's </span></code></pre></div>
<p>Next it is desirable to find the <a href="Variable-Selection-and-Importance-With-R.html#7.%20Information%20value%20and%20Weight%20of%20evidence">information value</a> of variables to get an idea of how valuable they are in explaining the dependent variable (ABOVE50K).</p>
<h2>Create WOE for categorical variables (optional)</h2>
<p>Optionally, we can create <code>WOE</code> equivalents for all categorical variables. This is only an optional step, for simplicity, this step is NOT run for this analysis.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">for(factor_var in factor_vars){
inputData[[factor_var]] <-<span class="st"> </span><span class="kw">WOE</span>(<span class="dt">X=</span>inputData[, factor_var], <span class="dt">Y=</span>inputData$ABOVE50K)
}
<span class="kw">head</span>(inputData)
<span class="co">#> AGE WORKCLASS FNLWGT EDUCATION EDUCATIONNUM MARITALSTATUS OCCUPATION</span>
<span class="co">#> 1 39 0.1608547 77516 0.7974104 13 -1.8846680 -0.713645</span>
<span class="co">#> 2 50 0.2254209 83311 0.7974104 13 0.9348331 1.084280</span>
<span class="co">#> 3 38 -0.1278453 215646 -0.5201257 9 -1.0030638 -1.555142</span>
<span class="co">#> 4 53 -0.1278453 234721 -1.7805021 7 0.9348331 -1.555142</span>
<span class="co">#> 5 28 -0.1278453 338409 0.7974104 13 0.9348331 0.943671</span>
<span class="co">#> 6 37 -0.1278453 284582 1.3690863 14 0.9348331 1.084280</span>
<span class="co">#> RELATIONSHIP RACE SEX CAPITALGAIN CAPITALLOSS HOURSPERWEEK</span>
<span class="co">#> 1 -1.015318 0.08064715 0.3281187 2174 0 40</span>
<span class="co">#> 2 0.941801 0.08064715 0.3281187 0 0 13</span>
<span class="co">#> 3 -1.015318 0.08064715 0.3281187 0 0 40</span>
<span class="co">#> 4 0.941801 -0.80794676 0.3281187 0 0 40</span>
<span class="co">#> 5 1.048674 -0.80794676 -0.9480165 0 0 40</span>
<span class="co">#> 6 1.048674 0.08064715 -0.9480165 0 0 40</span>
<span class="co">#> NATIVECOUNTRY ABOVE50K</span>
<span class="co">#> 1 0.02538318 0</span>
<span class="co">#> 2 0.02538318 0</span>
<span class="co">#> 3 0.02538318 0</span>
<span class="co">#> 4 0.02538318 0</span>
<span class="co">#> 5 0.11671564 0</span>
<span class="co">#> 6 0.02538318 0</span></code></pre></div>
<h2>Compute Information Values</h2>
<p>The <code>smbinning::smbinning</code> function converts a continuous variable into a categorical variable using recursive partitioning. We will first convert them to categorical variables and then, capture the information values for all variables in <code>iv_df</code></p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(smbinning)
<span class="co"># segregate continuous and factor variables</span>
factor_vars <-<span class="st"> </span><span class="kw">c</span> (<span class="st">"WORKCLASS"</span>, <span class="st">"EDUCATION"</span>, <span class="st">"MARITALSTATUS"</span>, <span class="st">"OCCUPATION"</span>, <span class="st">"RELATIONSHIP"</span>, <span class="st">"RACE"</span>, <span class="st">"SEX"</span>, <span class="st">"NATIVECOUNTRY"</span>)
continuous_vars <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"AGE"</span>, <span class="st">"FNLWGT"</span>,<span class="st">"EDUCATIONNUM"</span>, <span class="st">"HOURSPERWEEK"</span>, <span class="st">"CAPITALGAIN"</span>, <span class="st">"CAPITALLOSS"</span>)
iv_df <-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">VARS=</span><span class="kw">c</span>(factor_vars, continuous_vars), <span class="dt">IV=</span><span class="kw">numeric</span>(<span class="dv">14</span>)) <span class="co"># init for IV results</span>
<span class="co"># compute IV for categoricals</span>
for(factor_var in factor_vars){
smb <-<span class="st"> </span><span class="kw">smbinning.factor</span>(trainingData, <span class="dt">y=</span><span class="st">"ABOVE50K"</span>, <span class="dt">x=</span>factor_var) <span class="co"># WOE table</span>
if(<span class="kw">class</span>(smb) !=<span class="st"> "character"</span>){ <span class="co"># heck if some error occured</span>
iv_df[iv_df$VARS ==<span class="st"> </span>factor_var, <span class="st">"IV"</span>] <-<span class="st"> </span>smb$iv
}
}
<span class="co"># compute IV for continuous vars</span>
for(continuous_var in continuous_vars){
smb <-<span class="st"> </span><span class="kw">smbinning</span>(trainingData, <span class="dt">y=</span><span class="st">"ABOVE50K"</span>, <span class="dt">x=</span>continuous_var) <span class="co"># WOE table</span>
if(<span class="kw">class</span>(smb) !=<span class="st"> "character"</span>){ <span class="co"># any error while calculating scores.</span>
iv_df[iv_df$VARS ==<span class="st"> </span>continuous_var, <span class="st">"IV"</span>] <-<span class="st"> </span>smb$iv
}
}
iv_df <-<span class="st"> </span>iv_df[<span class="kw">order</span>(-iv_df$IV), ] <span class="co"># sort</span>
iv_df
<span class="co">#> VARS IV</span>
<span class="co">#> RELATIONSHIP 1.5739</span>
<span class="co">#> MARITALSTATUS 1.3356</span>
<span class="co">#> AGE 1.1748</span>
<span class="co">#> CAPITALGAIN 0.8389</span>
<span class="co">#> OCCUPATION 0.8259</span>
<span class="co">#> EDUCATIONNUM 0.7776</span>
<span class="co">#> EDUCATION 0.7774</span>
<span class="co">#> HOURSPERWEEK 0.4682</span>
<span class="co">#> SEX 0.3087</span>
<span class="co">#> WORKCLASS 0.1633</span>
<span class="co">#> CAPITALLOSS 0.1507</span>
<span class="co">#> NATIVECOUNTRY 0.0815</span>
<span class="co">#> RACE 0.0607</span>
<span class="co">#> FNLWGT 0.0000</span></code></pre></div>
<h2>Build Logit Models and Predict</h2>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">logitMod <-<span class="st"> </span><span class="kw">glm</span>(ABOVE50K ~<span class="st"> </span>RELATIONSHIP +<span class="st"> </span>AGE +<span class="st"> </span>CAPITALGAIN +<span class="st"> </span>OCCUPATION +<span class="st"> </span>EDUCATIONNUM, <span class="dt">data=</span>trainingData, <span class="dt">family=</span><span class="kw">binomial</span>(<span class="dt">link=</span><span class="st">"logit"</span>))
predicted <-<span class="st"> </span><span class="kw">plogis</span>(<span class="kw">predict</span>(logitMod, testData)) <span class="co"># predicted scores</span>
<span class="co"># or</span>
predicted <-<span class="st"> </span><span class="kw">predict</span>(logitMod, testData, <span class="dt">type=</span><span class="st">"response"</span>) <span class="co"># predicted scores</span></code></pre></div>
<p>A quick note about the <code>plogis</code> function: The <code>glm()</code> procedure with <code>family="binomial"</code> will build the logistic regression model on the given formula. When we use the <code>predict</code> function on this model, it will predict the log(odds) of the Y variable. This is not what we ultimately want because, the predicted values may not lie within the 0 and 1 range as expected. So, to convert it into prediction probability scores that is bound between 0 and 1, we use the <code>plogis()</code>.</p>
<h4>Decide on optimal prediction probability cutoff for the model</h4>
<p>The default cutoff prediction probability score is 0.5 or the ratio of 1’s and 0’s in the training data. But sometimes, tuning the probability cutoff can improve the accuracy in both the development and validation samples. The <code>InformationValue::optimalCutoff</code> function provides ways to find the optimal cutoff to improve the prediction of 1’s, 0’s, both 1’s and 0’s and o reduce the misclassification error. Lets compute the optimal score that minimizes the misclassification error for the above model.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(InformationValue)
optCutOff <-<span class="st"> </span><span class="kw">optimalCutoff</span>(testData$ABOVE50K, predicted)[<span class="dv">1</span>]
<span class="co">#=> 0.71</span></code></pre></div>
<h2>Model Diagnostics</h2>
<p>The <code>summary(logitMod)</code> gives the beta coefficients, Standard error, z Value and p Value. If your model had categorical variables with multiple levels, you will find a row-entry for each category of that variable. That is because, each individual category is considered as an independent binary variable by the <code>glm()</code>. In this case it is ok if few of the categories in a multi-category variable don’t turn out to be significant in the model (i.e. p Value turns out greater than significance level of 0.5).</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">summary</span>(logitMod)
<span class="co">#> Call:</span>
<span class="co">#> glm(formula = ABOVE50K ~ RELATIONSHIP + AGE + CAPITALGAIN + OCCUPATION + </span>
<span class="co">#> EDUCATIONNUM, family = "binomial", data = trainingData)</span>
<span class="co">#> </span>
<span class="co">#> Deviance Residuals: </span>
<span class="co">#> Min 1Q Median 3Q Max </span>
<span class="co">#> -3.8380 -0.5319 -0.0073 0.6267 3.2847 </span>
<span class="co">#> </span>
<span class="co">#> Coefficients:</span>
<span class="co">#> Estimate Std. Error z value Pr(>|z|) </span>
<span class="co">#> (Intercept) -4.57657130 0.24641856 -18.572 < 0.0000000000000002 ***</span>
<span class="co">#> RELATIONSHIP Not-in-family -2.27712854 0.07205131 -31.604 < 0.0000000000000002 ***</span>
<span class="co">#> RELATIONSHIP Other-relative -2.72926866 0.27075521 -10.080 < 0.0000000000000002 ***</span>
<span class="co">#> RELATIONSHIP Own-child -3.56051255 0.17892546 -19.899 < 0.0000000000000002 ***</span>
<span class="co">#> ...</span>
<span class="co">#> ...</span>
<span class="co">#> Null deviance: 15216.0 on 10975 degrees of freedom</span>
<span class="co">#> Residual deviance: 8740.9 on 10953 degrees of freedom</span>
<span class="co">#> AIC: 8786.9</span>
<span class="co">#> </span>
<span class="co">#> Number of Fisher Scoring iterations: 8</span></code></pre></div>
<h4>VIF</h4>
<p>Like in case of linear regression, we should check for multicollinearity in the model. As seen below, all X variables in the model have VIF well below 4.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">vif</span>(logitMod)
<span class="co">#> GVIF Df GVIF^(1/(2*Df))</span>
<span class="co">#> RELATIONSHIP 1.340895 5 1.029768</span>
<span class="co">#> AGE 1.119782 1 1.058198</span>
<span class="co">#> CAPITALGAIN 1.023506 1 1.011685</span>
<span class="co">#> OCCUPATION 1.733194 14 1.019836</span>
<span class="co">#> EDUCATIONNUM 1.454267 1 1.205930</span></code></pre></div>
<h4>Misclassification Error</h4>
<p>Misclassification error is the percentage mismatch of predcited vs actuals, irrespective of 1’s or 0’s. The lower the misclassification error, the better is your model.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">misClassError</span>(testData$ABOVE50K, predicted, <span class="dt">threshold =</span> optCutOff)
<span class="co">#=> 0.0899</span></code></pre></div>
<h4>ROC</h4>
<p>Receiver Operating Characteristics Curve traces the percentage of true positives accurately predicted by a given logit model as the prediction probability cutoff is lowered from 1 to 0. For a good model, as the cutoff is lowered, it should mark more of actual 1’s as positives and lesser of actual 0’s as 1’s. So for a good model, the curve should rise steeply, indicating that the TPR (Y-Axis) increases faster than the FPR (X-Axis) as the cutoff score decreases. Greater the area under the ROC curve, better the predictive ability of the model.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">plotROC</span>(testData$ABOVE50K, predicted)</code></pre></div>
<p><img src='screenshots/ROC-Curve.png' width='528' height='528' /></p>
<p>The above model has area under ROC curve 88.78%, which is pretty good.</p>
<h4>Concordance</h4>
<p>Ideally, the model-calculated-probability-scores of all actual Positive’s, (aka Ones) should be greater than the model-calculated-probability-scores of ALL the Negatives (aka Zeroes). Such a model is said to be perfectly concordant and a highly reliable one. This phenomenon can be measured by Concordance and Discordance.</p>
<p>In simpler words, of all combinations of 1-0 pairs (actuals), <em>Concordance</em> is the percentage of pairs, whose scores of actual positive’s are greater than the scores of actual negative’s. For a perfect model, this will be 100%. So, the higher the concordance, the better is the quality of model.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">Concordance</span>(testData$ABOVE50K, predicted)
<span class="co">#> 0.8915</span></code></pre></div>
<p>The above model with a concordance of 89.2% is indeed a good quality model.</p>
<h4>Specificity and Sensitivity</h4>
<p>Sensitivity (or True Positive Rate) is the percentage of 1’s (actuals) correctly predicted by the model, while, specificity is the percentage of 0’s (actuals) correctly predicted. Specificity can also be calculated as <span class="math inline">1 − <em>F</em><em>a</em><em>l</em><em>s</em><em>e</em> <em>P</em><em>o</em><em>s</em><em>i</em><em>t</em><em>i</em><em>v</em><em>e</em> <em>R</em><em>a</em><em>t</em><em>e</em></span>.</p>
<p><br /><span class="math display">$$Sensitivity = \frac{\# \ Actual \ 1's \ and \ Predicted \ as \ 1's}{\# \ of \ Actual \ 1's}$$</span><br /></p>
<p><br /><span class="math display">$$Specificity = {\# \ Actual \ 0's \ and \ Predicted \ as \ 0's \over \# \ of \ Actual \ 0's}$$</span><br /></p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">sensitivity</span>(testData$ABOVE50K, predicted, <span class="dt">threshold =</span> optCutOff)
<span class="co">#> 0.3089</span>
<span class="kw">specificity</span>(testData$ABOVE50K, predicted, <span class="dt">threshold =</span> optCutOff)
<span class="co">#> 0.9836</span></code></pre></div>
<p>The above numbers are calculated on the validation sample that was not used for training the model. So, a truth detection rate of 31% on test data is good.</p>
<h4>Confusion Matrix</h4>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">confusionMatrix</span>(testData$ABOVE50K, predicted, <span class="dt">threshold =</span> optCutOff)
<span class="co"># The columns are actuals, while rows are predicteds.</span>
<span class="co">#> 0 1</span>
<span class="co">#> 0 18918 1626</span>
<span class="co">#> 1 314 727</span></code></pre></div>
</div>
</div>
<div class="footer">
<hr>
<p>© 2016-17 Selva Prabhakaran. Powered by <a href="http://jekyllrb.com/">jekyll</a>,
<a href="http://yihui.name/knitr/">knitr</a>, and
<a href="http://johnmacfarlane.net/pandoc/">pandoc</a>.
This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">Creative Commons License.</a>
</p>
</div>
</div> <!-- /container -->
<script src="//code.jquery.com/jquery.js"></script>
<script src="www/bootstrap.min.js"></script>
<script src="www/toc.js"></script>
<!-- MathJax Script -->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>
<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<!-- Google Analytics Code -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-69351797-1', 'auto');
ga('send', 'pageview');
</script>
<style type="text/css">
/* reduce spacing around math formula*/
.MathJax_Display {
margin: 0em 0em;
}
body {
font-family: 'Helvetica Neue', Roboto, Arial, sans-serif;
font-size: 16px;
line-height: 27px;
font-weight: 400;
}
blockquote p {
line-height: 1.75;
color: #717171;
}
.well li{
line-height: 28px;
}
li.dropdown-header {
display: block;
padding: 0px;
font-size: 14px;
}
</style>
</body>
</html>