-
Notifications
You must be signed in to change notification settings - Fork 16
/
Copy pathVariable-Selection-and-Importance-With-R.html
437 lines (398 loc) · 32.1 KB
/
Variable-Selection-and-Importance-With-R.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
<!DOCTYPE html>
<html>
<head>
<title>Feature Selection With R | Boruta</title>
<meta charset="utf-8">
<meta name="Description" content="R Language Tutorials for Advanced Statistics">
<meta name="Keywords" content="R, Tutorial, Machine learning, Statistics, Data Mining, Analytics, Data science, Linear Regression, Logistic Regression, Time series, Forecasting">
<meta name="Distribution" content="Global">
<meta name="Author" content="Selva Prabhakaran">
<meta name="Robots" content="index, follow">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="shortcut icon" href="/screenshots/iconb-64.png" type="image/x-icon" />
<link href="www/bootstrap.min.css" rel="stylesheet">
<link href="www/highlight.css" rel="stylesheet">
<link href='http://fonts.googleapis.com/css?family=Inconsolata:400,700'
rel='stylesheet' type='text/css'>
<!-- Color Script -->
<style type="text/css">
a {
color: #3675C5;
color: rgb(25, 145, 248);
color: #4582ec;
color: #3F73D8;
}
li {
line-height: 1.65;
}
/* reduce spacing around math formula*/
.MathJax_Display {
margin: 0em 0em;
}
</style>
<!-- Add Google search -->
<script language="Javascript" type="text/javascript">
function my_search_google()
{
var query = document.getElementById("my-google-search").value;
window.open("http://google.com/search?q=" + query
+ "%20site:" + "http://r-statistics.co");
}
</script>
</head>
<body>
<div class="container">
<div class="masthead">
<!--
<ul class="nav nav-pills pull-right">
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">
Table of contents<b class="caret"></b>
</a>
<ul class="dropdown-menu pull-right" role="menu">
<li class="dropdown-header"></li>
<li class="dropdown-header">Tutorial</li>
<li><a href="R-Tutorial.html">R Tutorial</a></li>
<li class="dropdown-header">ggplot2</li>
<li><a href="ggplot2-Tutorial-With-R.html">ggplot2 Short Tutorial</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part1-With-R-Code.html">ggplot2 Tutorial 1 - Intro</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html">ggplot2 Tutorial 2 - Theme</a></li>
<li><a href="Top50-Ggplot2-Visualizations-MasterList-R-Code.html">ggplot2 Tutorial 3 - Masterlist</a></li>
<li><a href="ggplot2-cheatsheet.html">ggplot2 Quickref</a></li>
<li class="dropdown-header">Foundations</li>
<li><a href="Linear-Regression.html">Linear Regression</a></li>
<li><a href="Statistical-Tests-in-R.html">Statistical Tests</a></li>
<li><a href="Missing-Value-Treatment-With-R.html">Missing Value Treatment</a></li>
<li><a href="Outlier-Treatment-With-R.html">Outlier Analysis</a></li>
<li><a href="Variable-Selection-and-Importance-With-R.html">Feature Selection</a></li>
<li><a href="Model-Selection-in-R.html">Model Selection</a></li>
<li><a href="Logistic-Regression-With-R.html">Logistic Regression</a></li>
<li><a href="Environments.html">Advanced Linear Regression</a></li>
<li class="dropdown-header">Advanced Regression Models</li>
<li><a href="adv-regression-models.html">Advanced Regression Models</a></li>
<li class="dropdown-header">Time Series</li>
<li><a href="Time-Series-Analysis-With-R.html">Time Series Analysis</a></li>
<li><a href="Time-Series-Forecasting-With-R.html">Time Series Forecasting </a></li>
<li><a href="Time-Series-Forecasting-With-R-part2.html">More Time Series Forecasting</a></li>
<li class="dropdown-header">High Performance Computing</li>
<li><a href="Parallel-Computing-With-R.html">Parallel computing</a></li>
<li><a href="Strategies-To-Improve-And-Speedup-R-Code.html">Strategies to Speedup R code</a></li>
<li class="dropdown-header">Useful Techniques</li>
<li><a href="Association-Mining-With-R.html">Association Mining</a></li>
<li><a href="Multi-Dimensional-Scaling-With-R.html">Multi Dimensional Scaling</a></li>
<li><a href="Profiling.html">Optimization</a></li>
<li><a href="Information-Value-With-R.html">InformationValue package</a></li>
</ul>
</li>
</ul>
-->
<ul class="nav nav-pills pull-right">
<div class="input-group">
<form onsubmit="my_search_google()">
<input type="text" class="form-control" id="my-google-search" placeholder="Search..">
<form>
</div><!-- /input-group -->
</ul><!-- /.col-lg-6 -->
<h3 class="muted"><a href="/">r-statistics.co</a><small> by Selva Prabhakaran</small></h3>
<hr>
</div>
<div class="row">
<div class="col-xs-12 col-sm-3" id="nav">
<div class="well">
<li>
<ul class="list-unstyled">
<li class="dropdown-header"></li>
<li class="dropdown-header">Tutorial</li>
<li><a href="R-Tutorial.html">R Tutorial</a></li>
<li class="dropdown-header">ggplot2</li>
<li><a href="ggplot2-Tutorial-With-R.html">ggplot2 Short Tutorial</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part1-With-R-Code.html">ggplot2 Tutorial 1 - Intro</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html">ggplot2 Tutorial 2 - Theme</a></li>
<li><a href="Top50-Ggplot2-Visualizations-MasterList-R-Code.html">ggplot2 Tutorial 3 - Masterlist</a></li>
<li><a href="ggplot2-cheatsheet.html">ggplot2 Quickref</a></li>
<li class="dropdown-header">Foundations</li>
<li><a href="Linear-Regression.html">Linear Regression</a></li>
<li><a href="Statistical-Tests-in-R.html">Statistical Tests</a></li>
<li><a href="Missing-Value-Treatment-With-R.html">Missing Value Treatment</a></li>
<li><a href="Outlier-Treatment-With-R.html">Outlier Analysis</a></li>
<li><a href="Variable-Selection-and-Importance-With-R.html">Feature Selection</a></li>
<li><a href="Model-Selection-in-R.html">Model Selection</a></li>
<li><a href="Logistic-Regression-With-R.html">Logistic Regression</a></li>
<li><a href="Environments.html">Advanced Linear Regression</a></li>
<li class="dropdown-header">Advanced Regression Models</li>
<li><a href="adv-regression-models.html">Advanced Regression Models</a></li>
<li class="dropdown-header">Time Series</li>
<li><a href="Time-Series-Analysis-With-R.html">Time Series Analysis</a></li>
<li><a href="Time-Series-Forecasting-With-R.html">Time Series Forecasting </a></li>
<li><a href="Time-Series-Forecasting-With-R-part2.html">More Time Series Forecasting</a></li>
<li class="dropdown-header">High Performance Computing</li>
<li><a href="Parallel-Computing-With-R.html">Parallel computing</a></li>
<li><a href="Strategies-To-Improve-And-Speedup-R-Code.html">Strategies to Speedup R code</a></li>
<li class="dropdown-header">Useful Techniques</li>
<li><a href="Association-Mining-With-R.html">Association Mining</a></li>
<li><a href="Multi-Dimensional-Scaling-With-R.html">Multi Dimensional Scaling</a></li>
<li><a href="Profiling.html">Optimization</a></li>
<li><a href="Information-Value-With-R.html">InformationValue package</a></li>
</ul>
</li>
</div>
<div class="well">
<p>Stay up-to-date. <a href="https://docs.google.com/forms/d/1xkMYkLNFU9U39Dd8S_2JC0p8B5t6_Yq6zUQjanQQJpY/viewform">Subscribe!</a></p>
<p><a href="https://docs.google.com/forms/d/13GrkCFcNa-TOIllQghsz2SIEbc-YqY9eJX02B19l5Ow/viewform">Chat!</a></p>
</div>
<h4>Contents</h4>
<ul class="list-unstyled" id="toc"></ul>
<!--
<hr>
<p><a href="/contribute.html">How to contribute</a></p>
<p><a class="btn btn-primary" href="">Edit this page</a></p>
-->
</div>
<div id="content" class="col-xs-12 col-sm-8 pull-right">
<h1>Feature Selection Approaches</h1>
<blockquote>
<p>Finding the most important predictor variables (of features) that explains major part of variance of the response variable is key to identify and build high performing models.</p>
</blockquote>
<h2>Import Data</h2>
<p>For illustrating the various methods, we will use the ‘Ozone’ data from ‘mlbench’ package, except for Information value method which is applicable for binary categorical response variables.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">inputData <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"http://rstatistics.net/wp-content/uploads/2015/09/ozone1.csv"</span>, <span class="dt">stringsAsFactors=</span>F)</code></pre></div>
<h2>1. Random Forest Method</h2>
<p>Random forest can be very effective to find a set of predictors that best explains the variance in the response variable.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(party)
cf1 <-<span class="st"> </span><span class="kw">cforest</span>(ozone_reading ~<span class="st"> </span>. , <span class="dt">data=</span> inputData, <span class="dt">control=</span><span class="kw">cforest_unbiased</span>(<span class="dt">mtry=</span><span class="dv">2</span>,<span class="dt">ntree=</span><span class="dv">50</span>)) <span class="co"># fit the random forest</span>
<span class="kw">varimp</span>(cf1) <span class="co"># get variable importance, based on mean decrease in accuracy</span>
<span class="co">#=> Month Day_of_month Day_of_week </span>
<span class="co">#=> 0.689167598 0.115937291 -0.004641633 </span>
<span class="co">#=> pressure_height Wind_speed Humidity </span>
<span class="co">#=> 5.519633507 0.125868789 3.474611356 </span>
<span class="co">#=> Temperature_Sandburg Temperature_ElMonte Inversion_base_height </span>
<span class="co">#=> 12.878794481 14.175901506 4.276103121 </span>
<span class="co">#=> Pressure_gradient Inversion_temperature Visibility </span>
<span class="co">#=> 3.234732558 11.738969777 2.283430842</span>
<span class="kw">varimp</span>(cf1, <span class="dt">conditional=</span><span class="ot">TRUE</span>) <span class="co"># conditional=True, adjusts for correlations between predictors</span>
<span class="co">#=> Month Day_of_month Day_of_week </span>
<span class="co">#=> 0.08899435 0.19311805 0.02526252 </span>
<span class="co">#=> pressure_height Wind_speed Humidity </span>
<span class="co">#=> 0.35458493 -0.19089686 0.14617239 </span>
<span class="co">#=> Temperature_Sandburg Temperature_ElMonte Inversion_base_height </span>
<span class="co">#=> 0.74640367 1.19786882 0.69662788 </span>
<span class="co">#=> Pressure_gradient Inversion_temperature Visibility </span>
<span class="co">#=> 0.58295887 0.65507322 0.05380003</span>
<span class="kw">varimpAUC</span>(cf1) <span class="co"># more robust towards class imbalance.</span>
<span class="co">#=> Month Day_of_month Day_of_week </span>
<span class="co">#=> 1.12821259 -0.04079495 0.07800158 </span>
<span class="co">#=> pressure_height Wind_speed Humidity </span>
<span class="co">#=> 5.85160593 0.11250973 3.32289714 </span>
<span class="co">#=> Temperature_Sandburg Temperature_ElMonte Inversion_base_height </span>
<span class="co">#=> 11.97425093 13.66085973 3.70572939 </span>
<span class="co">#=> Pressure_gradient Inversion_temperature Visibility </span>
<span class="co">#=> 3.05169171 11.48762432 2.04145930</span></code></pre></div>
<h2>2. Relative Importance</h2>
<p>Using <code>calc.relimp</code> {relaimpo}, the relative importance of variables fed into a lm model can be determined as a relative percentage.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(relaimpo)
lmMod <-<span class="st"> </span><span class="kw">lm</span>(ozone_reading ~<span class="st"> </span>. , <span class="dt">data =</span> inputData) <span class="co"># fit lm() model</span>
relImportance <-<span class="st"> </span><span class="kw">calc.relimp</span>(lmMod, <span class="dt">type =</span> <span class="st">"lmg"</span>, <span class="dt">rela =</span> <span class="ot">TRUE</span>) <span class="co"># calculate relative importance scaled to 100</span>
<span class="kw">sort</span>(relImportance$lmg, <span class="dt">decreasing=</span><span class="ot">TRUE</span>) <span class="co"># relative importance</span>
<span class="co">#=> Temperature_ElMonte Temperature_Sandburg Inversion_temperature </span>
<span class="co">#=> 0.2297491560 0.2095385438 0.1692950876 </span>
<span class="co">#=> pressure_height Inversion_base_height Humidity </span>
<span class="co">#=> 0.1104276154 0.1000912612 0.0833080699 </span>
<span class="co">#=> Visibility Pressure_gradient Month </span>
<span class="co">#=> 0.0433277042 0.0320457048 0.0164342902 </span>
<span class="co">#=> Wind_speed Day_of_month Day_of_week </span>
<span class="co">#=> 0.0034984964 0.0016927799 0.0005912906</span></code></pre></div>
<h2>4. MARS</h2>
<p>The earth package implements variable importance based on Generalized cross validation (GCV), number of subset models the variable occurs (nsubsets) and residual sum of squares (RSS).</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(earth)
marsModel <-<span class="st"> </span><span class="kw">earth</span>(ozone_reading ~<span class="st"> </span>., <span class="dt">data=</span>inputData) <span class="co"># build model</span>
ev <-<span class="st"> </span><span class="kw">evimp</span> (marsModel) <span class="co"># estimate variable importance</span>
<span class="co">#=> nsubsets gcv rss</span>
<span class="co">#=> Temperature_ElMonte 29 100.0 100.0</span>
<span class="co">#=> Pressure_gradient 28 42.5 48.4</span>
<span class="co">#=> pressure_height 26 30.1 38.1</span>
<span class="co">#=> Month9 25 26.1 34.8</span>
<span class="co">#=> Month5 24 21.9 31.7</span>
<span class="co">#=> Month4 23 19.9 30.0</span>
<span class="co">#=> Month3 22 17.6 28.3</span>
<span class="co">#=> Inversion_base_height 21 14.4 26.1</span>
<span class="co">#=> Month11 19 12.3 24.1</span>
<span class="co">#=> Visibility 18 11.4 23.2</span>
<span class="co">#=> Day_of_month23 14 8.9 19.8</span>
<span class="co">#=> Humidity 13 7.4 18.7</span>
<span class="co">#=> Month6 11 5.1 16.6</span>
<span class="co">#=> Temperature_Sandburg 9 7.0 15.6</span>
<span class="co">#=> Wind_speed 7 5.1 13.4</span>
<span class="co">#=> Month12 6 4.2 12.3</span>
<span class="co">#=> Day_of_month9 3 4.6 9.1</span>
<span class="co">#=> Day_of_week4 2 -3.9 5.9</span>
<span class="co">#=> Day_of_month7-unused 1 -4.7 2.8</span>
<span class="kw">plot</span>(ev)</code></pre></div>
<p><img src='screenshots/variable-importance-mars.png' width='528' height='349' /></p>
<h2>5. Step-wise Regression</h2>
<p>If you have large number of predictors (> 15), split the inputData in chunks of 10 predictors with each chunk holding the <code>responseVar</code>.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">base.mod <-<span class="st"> </span><span class="kw">lm</span>(ozone_reading ~<span class="st"> </span><span class="dv">1</span> , <span class="dt">data=</span> inputData) <span class="co"># base intercept only model</span>
all.mod <-<span class="st"> </span><span class="kw">lm</span>(ozone_reading ~<span class="st"> </span>. , <span class="dt">data=</span> inputData) <span class="co"># full model with all predictors</span>
stepMod <-<span class="st"> </span><span class="kw">step</span>(base.mod, <span class="dt">scope =</span> <span class="kw">list</span>(<span class="dt">lower =</span> base.mod, <span class="dt">upper =</span> all.mod), <span class="dt">direction =</span> <span class="st">"both"</span>, <span class="dt">trace =</span> <span class="dv">0</span>, <span class="dt">steps =</span> <span class="dv">1000</span>) <span class="co"># perform step-wise algorithm</span>
shortlistedVars <-<span class="st"> </span><span class="kw">names</span>(<span class="kw">unlist</span>(stepMod[[<span class="dv">1</span>]])) <span class="co"># get the shortlisted variable.</span>
shortlistedVars <-<span class="st"> </span>shortlistedVars[!shortlistedVars %in%<span class="st"> "(Intercept)"</span>] <span class="co"># remove intercept </span>
<span class="kw">print</span>(shortlistedVars)
<span class="co">#=> [1] "Temperature_Sandburg" "Humidity" "Temperature_ElMonte" </span>
<span class="co">#=> [4] "Month" "pressure_height" "Inversion_base_height"</span></code></pre></div>
<p>The output could includes levels within categorical variables, since ‘stepwise’ is a linear regression based technique, as seen above.</p>
<p>If you have a large number of predictor variables (100+), the above code may need to be placed in a loop that will run stepwise on sequential chunks of predictors. The shortlisted variables can be accumulated for further analysis towards the end of each iteration. This can be very effective method, if you want to (i) be highly selective about discarding valuable predictor variables. (ii) build multiple models on the response variable.</p>
<h2>6. Boruta</h2>
<p>The ‘Boruta’ method can be used to decide if a variable is important or not.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(Boruta)
<span class="co"># Decide if a variable is important or not using Boruta</span>
boruta_output <-<span class="st"> </span><span class="kw">Boruta</span>(ozone_reading ~<span class="st"> </span>., <span class="dt">data=</span><span class="kw">na.omit</span>(inputData), <span class="dt">doTrace=</span><span class="dv">2</span>) <span class="co"># perform Boruta search</span>
<span class="co"># Confirmed 10 attributes: Humidity, Inversion_base_height, Inversion_temperature, Month, Pressure_gradient and 5 more.</span>
<span class="co"># Rejected 3 attributes: Day_of_month, Day_of_week, Wind_speed.</span>
boruta_signif <-<span class="st"> </span><span class="kw">names</span>(boruta_output$finalDecision[boruta_output$finalDecision %in%<span class="st"> </span><span class="kw">c</span>(<span class="st">"Confirmed"</span>, <span class="st">"Tentative"</span>)]) <span class="co"># collect Confirmed and Tentative variables</span>
<span class="kw">print</span>(boruta_signif) <span class="co"># significant variables</span>
<span class="co">#=> [1] "Month" "ozone_reading" "pressure_height" </span>
<span class="co">#=> [4] "Humidity" "Temperature_Sandburg" "Temperature_ElMonte" </span>
<span class="co">#=> [7] "Inversion_base_height" "Pressure_gradient" "Inversion_temperature"</span>
<span class="co">#=> [10] "Visibility"</span>
<span class="kw">plot</span>(boruta_output, <span class="dt">cex.axis=</span>.<span class="dv">7</span>, <span class="dt">las=</span><span class="dv">2</span>, <span class="dt">xlab=</span><span class="st">""</span>, <span class="dt">main=</span><span class="st">"Variable Importance"</span>) <span class="co"># plot variable importance</span></code></pre></div>
<p><img src='screenshots/boruta-variable-importance.png' width='528' height='305' /></p>
<h2>7. Information value and Weight of evidence</h2>
<p>The <a href="https://cran.r-project.org/web/packages/InformationValue/vignettes/InformationValue.html">InformationValue package</a> provides convenient functions to compute <em>weights of evidence</em> and <em>information value</em> for categorical variables.</p>
<p><strong>Weights of Evidence (WOE)</strong> provides a method of recoding a categorical X variable to a continuous variable. For each category of a categorical variable, the <strong>WOE</strong> is calculated as:</p>
<p><br /><span class="math display">$$WOE = ln \left(\frac{percentage\ good\ of\ all\ goods}{percentage\ bad\ of\ all\ bads}\right)$$</span><br /></p>
<p>In above formula, ‘goods’ is same as ‘ones’ and ‘bads’ is same as ‘zeros’.</p>
<p><strong>Information Value (IV)</strong> is a measure of the predictive capability of a categorical <code>x</code> variable to accurately predict the goods and bads. For each category of <code>x</code>, information value is computed as:</p>
<p><br /><span class="math display">$$Information Value_{category} = {percentage\ good\ of\ all\ goods - percentage\ bad\ of\ all\ bads \over WOE} $$</span><br /></p>
<p>The total IV of a variable is the sum of IV’s of its categories.</p>
<h3>Example</h3>
<p>Let me demonstrate how to create the weights of evidence for categorical variables using the <code>WOE</code> function in <code>InformationValue</code> pkg. For this we will use the <code>adult</code> data as imported below. The response variable in <code>adult</code> is the <code>ABOVE50K</code> which indicates if the yearly salary of the individual in that row exceeds $50K. We have a number of predictor variables originally, out of which few of them are categorical variables. On these categorical variables, we will derive the respective <code>WOE</code>s using the <code>InformationValue::WOE</code> function. Then, lets find out the <code>InformationValue:IV</code> of all categorical variables.</p>
<h4>Install package from github</h4>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(devtools)
<span class="kw">install_github</span>(<span class="st">"selva86/InformationValue"</span>)</code></pre></div>
<h4>Import the data</h4>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(InformationValue)
inputData <-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">"http://rstatistics.net/wp-content/uploads/2015/09/adult.csv"</span>)
<span class="kw">head</span>(inputData)
<span class="co">#=> AGE WORKCLASS FNLWGT EDUCATION EDUCATIONNUM MARITALSTATUS</span>
<span class="co">#=> 1 39 State-gov 77516 Bachelors 13 Never-married</span>
<span class="co">#=> 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse</span>
<span class="co">#=> 3 38 Private 215646 HS-grad 9 Divorced</span>
<span class="co">#=> 4 53 Private 234721 11th 7 Married-civ-spouse</span>
<span class="co">#=> 5 28 Private 338409 Bachelors 13 Married-civ-spouse</span>
<span class="co">#=> 6 37 Private 284582 Masters 14 Married-civ-spouse</span>
<span class="co"># OCCUPATION RELATIONSHIP RACE SEX CAPITALGAIN CAPITALLOSS</span>
<span class="co">#=> 1 Adm-clerical Not-in-family White Male 2174 0</span>
<span class="co">#=> 2 Exec-managerial Husband White Male 0 0</span>
<span class="co">#=> 3 Handlers-cleaners Not-in-family White Male 0 0</span>
<span class="co">#=> 4 Handlers-cleaners Husband Black Male 0 0</span>
<span class="co">#=> 5 Prof-specialty Wife Black Female 0 0</span>
<span class="co">#=> 6 Exec-managerial Wife White Female 0 0</span>
<span class="co"># HOURSPERWEEK NATIVECOUNTRY ABOVE50K</span>
<span class="co">#=> 1 40 United-States 0</span>
<span class="co">#=> 2 13 United-States 0</span>
<span class="co">#=> 3 40 United-States 0</span>
<span class="co">#=> 4 40 United-States 0</span>
<span class="co">#=> 5 40 Cuba 0</span>
<span class="co">#=> 6 40 United-States 0</span></code></pre></div>
<h4>Calculate the Information Values</h4>
<p>Below, the information value of each categorical variable is calculated using the <code>InformationValue::IV</code> and the strength of each variable is contained within the <code>howgood</code> attribute in the returned result. If you are want to dig further into the <code>IV</code> of individual categories within a categorical variable, the <a href="https://cran.r-project.org/web/packages/InformationValue/vignettes/InformationValue.html#woetable"><code>InformationValue::WOETable</code></a> will be helpful.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">factor_vars <-<span class="st"> </span><span class="kw">c</span> (<span class="st">"WORKCLASS"</span>, <span class="st">"EDUCATION"</span>, <span class="st">"MARITALSTATUS"</span>, <span class="st">"OCCUPATION"</span>, <span class="st">"RELATIONSHIP"</span>, <span class="st">"RACE"</span>, <span class="st">"SEX"</span>, <span class="st">"NATIVECOUNTRY"</span>) <span class="co"># get all categorical variables</span>
all_iv <-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">VARS=</span>factor_vars, <span class="dt">IV=</span><span class="kw">numeric</span>(<span class="kw">length</span>(factor_vars)), <span class="dt">STRENGTH=</span><span class="kw">character</span>(<span class="kw">length</span>(factor_vars)), <span class="dt">stringsAsFactors =</span> F) <span class="co"># init output dataframe</span>
for (factor_var in factor_vars){
all_iv[all_iv$VARS ==<span class="st"> </span>factor_var, <span class="st">"IV"</span>] <-<span class="st"> </span>InformationValue::<span class="kw">IV</span>(<span class="dt">X=</span>inputData[, factor_var], <span class="dt">Y=</span>inputData$ABOVE50K)
all_iv[all_iv$VARS ==<span class="st"> </span>factor_var, <span class="st">"STRENGTH"</span>] <-<span class="st"> </span><span class="kw">attr</span>(InformationValue::<span class="kw">IV</span>(<span class="dt">X=</span>inputData[, factor_var], <span class="dt">Y=</span>inputData$ABOVE50K), <span class="st">"howgood"</span>)
}
all_iv <-<span class="st"> </span>all_iv[<span class="kw">order</span>(-all_iv$IV), ] <span class="co"># sort</span>
<span class="co">#> VARS IV STRENGTH</span>
<span class="co">#> RELATIONSHIP 1.53560810 Highly Predictive</span>
<span class="co">#> MARITALSTATUS 1.33882907 Highly Predictive</span>
<span class="co">#> OCCUPATION 0.77622839 Highly Predictive</span>
<span class="co">#> EDUCATION 0.74105372 Highly Predictive</span>
<span class="co">#> SEX 0.30328938 Highly Predictive</span>
<span class="co">#> WORKCLASS 0.16338802 Highly Predictive</span>
<span class="co">#> NATIVECOUNTRY 0.07939344 Somewhat Predictive</span>
<span class="co">#> RACE 0.06929987 Somewhat Predictive</span></code></pre></div>
<h4>Compute the weights of evidence (optional)</h4>
<p>Optionally, we could create the weights of evidence for the factor variables and use it as continuous variables in place of the factors.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">for(factor_var in factor_vars){
inputData[[factor_var]] <-<span class="st"> </span><span class="kw">WOE</span>(<span class="dt">X=</span>inputData[, factor_var], <span class="dt">Y=</span>inputData$ABOVE50K)
}
<span class="co">#> AGE WORKCLASS FNLWGT EDUCATION EDUCATIONNUM MARITALSTATUS OCCUPATION</span>
<span class="co">#> 1 39 0.1608547 77516 0.7974104 13 -1.8846680 -0.713645</span>
<span class="co">#> 2 50 0.2254209 83311 0.7974104 13 0.9348331 1.084280</span>
<span class="co">#> 3 38 -0.1278453 215646 -0.5201257 9 -1.0030638 -1.555142</span>
<span class="co">#> 4 53 -0.1278453 234721 -1.7805021 7 0.9348331 -1.555142</span>
<span class="co">#> 5 28 -0.1278453 338409 0.7974104 13 0.9348331 0.943671</span>
<span class="co">#> 6 37 -0.1278453 284582 1.3690863 14 0.9348331 1.084280</span>
<span class="co">#> RELATIONSHIP RACE SEX CAPITALGAIN CAPITALLOSS HOURSPERWEEK</span>
<span class="co">#> 1 -1.015318 0.08064715 0.3281187 2174 0 40</span>
<span class="co">#> 2 0.941801 0.08064715 0.3281187 0 0 13</span>
<span class="co">#> 3 -1.015318 0.08064715 0.3281187 0 0 40</span>
<span class="co">#> 4 0.941801 -0.80794676 0.3281187 0 0 40</span>
<span class="co">#> 5 1.048674 -0.80794676 -0.9480165 0 0 40</span>
<span class="co">#> 6 1.048674 0.08064715 -0.9480165 0 0 40</span>
<span class="co">#> NATIVECOUNTRY ABOVE50K</span>
<span class="co">#> 1 0.02538318 0</span>
<span class="co">#> 2 0.02538318 0</span>
<span class="co">#> 3 0.02538318 0</span>
<span class="co">#> 4 0.02538318 0</span>
<span class="co">#> 5 0.11671564 0</span>
<span class="co">#> 6 0.02538318 0</span></code></pre></div>
<p>The newly created woe variables can alternatively be in place of the original factor variables.</p>
</div>
</div>
<div class="footer">
<hr>
<p>© 2016-17 Selva Prabhakaran. Powered by <a href="http://jekyllrb.com/">jekyll</a>,
<a href="http://yihui.name/knitr/">knitr</a>, and
<a href="http://johnmacfarlane.net/pandoc/">pandoc</a>.
This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">Creative Commons License.</a>
</p>
</div>
</div> <!-- /container -->
<script src="//code.jquery.com/jquery.js"></script>
<script src="www/bootstrap.min.js"></script>
<script src="www/toc.js"></script>
<!-- MathJax Script -->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>
<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<!-- Google Analytics Code -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-69351797-1', 'auto');
ga('send', 'pageview');
</script>
<style type="text/css">
/* reduce spacing around math formula*/
.MathJax_Display {
margin: 0em 0em;
}
body {
font-family: 'Helvetica Neue', Roboto, Arial, sans-serif;
font-size: 16px;
line-height: 27px;
font-weight: 400;
}
blockquote p {
line-height: 1.75;
color: #717171;
}
.well li{
line-height: 28px;
}
li.dropdown-header {
display: block;
padding: 0px;
font-size: 14px;
}
</style>
</body>
</html>