Variable-Selection-and-Importance-With-R.html

<!DOCTYPE html>
<html>
  <head>
    <title>Feature Selection With R | Boruta</title>
    <meta charset="utf-8">
    <meta name="Description" content="R Language Tutorials for Advanced Statistics">
    <meta name="Keywords" content="R, Tutorial, Machine learning, Statistics, Data Mining, Analytics, Data science, Linear Regression, Logistic Regression, Time series, Forecasting">
    <meta name="Distribution" content="Global">
    <meta name="Author" content="Selva Prabhakaran">
    <meta name="Robots" content="index, follow">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link rel="shortcut icon" href="/screenshots/iconb-64.png" type="image/x-icon" />
    <link href="www/bootstrap.min.css" rel="stylesheet">
    <link href="www/highlight.css" rel="stylesheet">

    <link href='http://fonts.googleapis.com/css?family=Inconsolata:400,700'
      rel='stylesheet' type='text/css'>
    <!-- Color Script -->
    <style type="text/css">
      a {
       color: #3675C5;
       color: rgb(25, 145, 248);
       color: #4582ec;
       color: #3F73D8;
      }
      li {
        line-height: 1.65;
      }
      /* reduce spacing around math formula*/
      .MathJax_Display {
        margin: 0em 0em;
      }   
    </style>
    <!-- Add Google search -->
    <script language="Javascript" type="text/javascript">
  function my_search_google()
  {
    var query = document.getElementById("my-google-search").value;
    window.open("http://google.com/search?q=" + query
	+ "%20site:" + "http://r-statistics.co");
  }
</script>
  </head>

  <body>

    <div class="container">

      <div class="masthead">
       <!--
        <ul class="nav nav-pills pull-right">
          <li class="dropdown">
            <a href="#" class="dropdown-toggle" data-toggle="dropdown">
              Table of contents<b class="caret"></b>
            </a>
            <ul class="dropdown-menu pull-right" role="menu">
              <li class="dropdown-header"></li>
<li class="dropdown-header">Tutorial</li>
<li><a href="R-Tutorial.html">R Tutorial</a></li>
<li class="dropdown-header">ggplot2</li>
<li><a href="ggplot2-Tutorial-With-R.html">ggplot2 Short Tutorial</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part1-With-R-Code.html">ggplot2 Tutorial 1 - Intro</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html">ggplot2 Tutorial 2 - Theme</a></li>
<li><a href="Top50-Ggplot2-Visualizations-MasterList-R-Code.html">ggplot2 Tutorial 3 - Masterlist</a></li>
<li><a href="ggplot2-cheatsheet.html">ggplot2 Quickref</a></li>
<li class="dropdown-header">Foundations</li>
<li><a href="Linear-Regression.html">Linear Regression</a></li>
<li><a href="Statistical-Tests-in-R.html">Statistical Tests</a></li>
<li><a href="Missing-Value-Treatment-With-R.html">Missing Value Treatment</a></li>
<li><a href="Outlier-Treatment-With-R.html">Outlier Analysis</a></li>
<li><a href="Variable-Selection-and-Importance-With-R.html">Feature Selection</a></li>
<li><a href="Model-Selection-in-R.html">Model Selection</a></li>
<li><a href="Logistic-Regression-With-R.html">Logistic Regression</a></li>
<li><a href="Environments.html">Advanced Linear Regression</a></li>

<li class="dropdown-header">Advanced Regression Models</li>
<li><a href="adv-regression-models.html">Advanced Regression Models</a></li>

<li class="dropdown-header">Time Series</li>
<li><a href="Time-Series-Analysis-With-R.html">Time Series Analysis</a></li>
<li><a href="Time-Series-Forecasting-With-R.html">Time Series Forecasting </a></li>
<li><a href="Time-Series-Forecasting-With-R-part2.html">More Time Series Forecasting</a></li>

<li class="dropdown-header">High Performance Computing</li>
<li><a href="Parallel-Computing-With-R.html">Parallel computing</a></li>
<li><a href="Strategies-To-Improve-And-Speedup-R-Code.html">Strategies to Speedup R code</a></li>

<li class="dropdown-header">Useful Techniques</li>
<li><a href="Association-Mining-With-R.html">Association Mining</a></li>
<li><a href="Multi-Dimensional-Scaling-With-R.html">Multi Dimensional Scaling</a></li>
<li><a href="Profiling.html">Optimization</a></li>
<li><a href="Information-Value-With-R.html">InformationValue package</a></li>
            </ul>
          </li>
        </ul>
        -->

        <ul class="nav nav-pills pull-right">
          <div class="input-group">
            <form onsubmit="my_search_google()">
                <input type="text" class="form-control" id="my-google-search" placeholder="Search..">
            <form>
          </div><!-- /input-group -->
        </ul><!-- /.col-lg-6 -->

        <h3 class="muted"><a href="/">r-statistics.co</a><small> by Selva Prabhakaran</small></h3>
        <hr>
      </div>

      <div class="row">
        <div class="col-xs-12 col-sm-3" id="nav">
        <div class="well">
          <li>
            <ul class="list-unstyled">
                <li class="dropdown-header"></li>
<li class="dropdown-header">Tutorial</li>
<li><a href="R-Tutorial.html">R Tutorial</a></li>
<li class="dropdown-header">ggplot2</li>
<li><a href="ggplot2-Tutorial-With-R.html">ggplot2 Short Tutorial</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part1-With-R-Code.html">ggplot2 Tutorial 1 - Intro</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html">ggplot2 Tutorial 2 - Theme</a></li>
<li><a href="Top50-Ggplot2-Visualizations-MasterList-R-Code.html">ggplot2 Tutorial 3 - Masterlist</a></li>
<li><a href="ggplot2-cheatsheet.html">ggplot2 Quickref</a></li>
<li class="dropdown-header">Foundations</li>
<li><a href="Linear-Regression.html">Linear Regression</a></li>
<li><a href="Statistical-Tests-in-R.html">Statistical Tests</a></li>
<li><a href="Missing-Value-Treatment-With-R.html">Missing Value Treatment</a></li>
<li><a href="Outlier-Treatment-With-R.html">Outlier Analysis</a></li>
<li><a href="Variable-Selection-and-Importance-With-R.html">Feature Selection</a></li>
<li><a href="Model-Selection-in-R.html">Model Selection</a></li>
<li><a href="Logistic-Regression-With-R.html">Logistic Regression</a></li>
<li><a href="Environments.html">Advanced Linear Regression</a></li>

<li class="dropdown-header">Advanced Regression Models</li>
<li><a href="adv-regression-models.html">Advanced Regression Models</a></li>

<li class="dropdown-header">Time Series</li>
<li><a href="Time-Series-Analysis-With-R.html">Time Series Analysis</a></li>
<li><a href="Time-Series-Forecasting-With-R.html">Time Series Forecasting </a></li>
<li><a href="Time-Series-Forecasting-With-R-part2.html">More Time Series Forecasting</a></li>

<li class="dropdown-header">High Performance Computing</li>
<li><a href="Parallel-Computing-With-R.html">Parallel computing</a></li>
<li><a href="Strategies-To-Improve-And-Speedup-R-Code.html">Strategies to Speedup R code</a></li>

<li class="dropdown-header">Useful Techniques</li>
<li><a href="Association-Mining-With-R.html">Association Mining</a></li>
<li><a href="Multi-Dimensional-Scaling-With-R.html">Multi Dimensional Scaling</a></li>
<li><a href="Profiling.html">Optimization</a></li>
<li><a href="Information-Value-With-R.html">InformationValue package</a></li>
            </ul>
          </li>
        </div>

        <div class="well">
          <p>Stay up-to-date. <a href="https://docs.google.com/forms/d/1xkMYkLNFU9U39Dd8S_2JC0p8B5t6_Yq6zUQjanQQJpY/viewform">Subscribe!</a></p>
          <p><a href="https://docs.google.com/forms/d/13GrkCFcNa-TOIllQghsz2SIEbc-YqY9eJX02B19l5Ow/viewform">Chat!</a></p>
        </div>

        
        <h4>Contents</h4>
        

          <ul class="list-unstyled" id="toc"></ul>
        <!--
          <hr>
          <p><a href="/contribute.html">How to contribute</a></p>

          <p><a class="btn btn-primary" href="">Edit this page</a></p>
        -->  
        </div>

        <div id="content" class="col-xs-12 col-sm-8 pull-right">

          <h1>Feature Selection Approaches</h1>
<blockquote>
<p>Finding the most important predictor variables (of features) that explains major part of variance of the response variable is key to identify and build high performing models.</p>
</blockquote>
<h2>Import Data</h2>
<p>For illustrating the various methods, we will use the ‘Ozone’ data from ‘mlbench’ package, except for Information value method which is applicable for binary categorical response variables.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">inputData &lt;-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">&quot;http://rstatistics.net/wp-content/uploads/2015/09/ozone1.csv&quot;</span>, <span class="dt">stringsAsFactors=</span>F)</code></pre></div>
<h2>1. Random Forest Method</h2>
<p>Random forest can be very effective to find a set of predictors that best explains the variance in the response variable.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(party)
cf1 &lt;-<span class="st"> </span><span class="kw">cforest</span>(ozone_reading ~<span class="st"> </span>. , <span class="dt">data=</span> inputData, <span class="dt">control=</span><span class="kw">cforest_unbiased</span>(<span class="dt">mtry=</span><span class="dv">2</span>,<span class="dt">ntree=</span><span class="dv">50</span>)) <span class="co"># fit the random forest</span>
<span class="kw">varimp</span>(cf1) <span class="co"># get variable importance, based on mean decrease in accuracy</span>
<span class="co">#=&gt;                 Month          Day_of_month           Day_of_week </span>
<span class="co">#=&gt;           0.689167598           0.115937291          -0.004641633 </span>
<span class="co">#=&gt;       pressure_height            Wind_speed              Humidity </span>
<span class="co">#=&gt;           5.519633507           0.125868789           3.474611356 </span>
<span class="co">#=&gt;  Temperature_Sandburg   Temperature_ElMonte Inversion_base_height </span>
<span class="co">#=&gt;          12.878794481          14.175901506           4.276103121 </span>
<span class="co">#=&gt;     Pressure_gradient Inversion_temperature            Visibility </span>
<span class="co">#=&gt;           3.234732558          11.738969777           2.283430842</span>
<span class="kw">varimp</span>(cf1, <span class="dt">conditional=</span><span class="ot">TRUE</span>)  <span class="co"># conditional=True, adjusts for correlations between predictors</span>
<span class="co">#=&gt;                 Month          Day_of_month           Day_of_week </span>
<span class="co">#=&gt;            0.08899435            0.19311805            0.02526252 </span>
<span class="co">#=&gt;       pressure_height            Wind_speed              Humidity </span>
<span class="co">#=&gt;            0.35458493           -0.19089686            0.14617239 </span>
<span class="co">#=&gt;  Temperature_Sandburg   Temperature_ElMonte Inversion_base_height </span>
<span class="co">#=&gt;            0.74640367            1.19786882            0.69662788 </span>
<span class="co">#=&gt;     Pressure_gradient Inversion_temperature            Visibility </span>
<span class="co">#=&gt;            0.58295887            0.65507322            0.05380003</span>
<span class="kw">varimpAUC</span>(cf1)  <span class="co"># more robust towards class imbalance.</span>
<span class="co">#=&gt;                 Month          Day_of_month           Day_of_week </span>
<span class="co">#=&gt;            1.12821259           -0.04079495            0.07800158 </span>
<span class="co">#=&gt;       pressure_height            Wind_speed              Humidity </span>
<span class="co">#=&gt;            5.85160593            0.11250973            3.32289714 </span>
<span class="co">#=&gt;  Temperature_Sandburg   Temperature_ElMonte Inversion_base_height </span>
<span class="co">#=&gt;           11.97425093           13.66085973            3.70572939 </span>
<span class="co">#=&gt;     Pressure_gradient Inversion_temperature            Visibility </span>
<span class="co">#=&gt;            3.05169171           11.48762432            2.04145930</span></code></pre></div>
<h2>2. Relative Importance</h2>
<p>Using <code>calc.relimp</code> {relaimpo}, the relative importance of variables fed into a lm model can be determined as a relative percentage.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(relaimpo)
lmMod &lt;-<span class="st"> </span><span class="kw">lm</span>(ozone_reading ~<span class="st"> </span>. , <span class="dt">data =</span> inputData)  <span class="co"># fit lm() model</span>
relImportance &lt;-<span class="st"> </span><span class="kw">calc.relimp</span>(lmMod, <span class="dt">type =</span> <span class="st">&quot;lmg&quot;</span>, <span class="dt">rela =</span> <span class="ot">TRUE</span>)  <span class="co"># calculate relative importance scaled to 100</span>
<span class="kw">sort</span>(relImportance$lmg, <span class="dt">decreasing=</span><span class="ot">TRUE</span>)  <span class="co"># relative importance</span>
<span class="co">#=&gt;   Temperature_ElMonte  Temperature_Sandburg Inversion_temperature </span>
<span class="co">#=&gt;          0.2297491560          0.2095385438          0.1692950876 </span>
<span class="co">#=&gt;       pressure_height Inversion_base_height              Humidity </span>
<span class="co">#=&gt;          0.1104276154          0.1000912612          0.0833080699 </span>
<span class="co">#=&gt;            Visibility     Pressure_gradient                 Month </span>
<span class="co">#=&gt;          0.0433277042          0.0320457048          0.0164342902 </span>
<span class="co">#=&gt;            Wind_speed          Day_of_month           Day_of_week </span>
<span class="co">#=&gt;          0.0034984964          0.0016927799          0.0005912906</span></code></pre></div>
<h2>4. MARS</h2>
<p>The earth package implements variable importance based on Generalized cross validation (GCV), number of subset models the variable occurs (nsubsets) and residual sum of squares (RSS).</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(earth)
marsModel &lt;-<span class="st"> </span><span class="kw">earth</span>(ozone_reading ~<span class="st"> </span>., <span class="dt">data=</span>inputData) <span class="co"># build model</span>
ev &lt;-<span class="st"> </span><span class="kw">evimp</span> (marsModel) <span class="co"># estimate variable importance</span>
<span class="co">#=&gt;                       nsubsets   gcv    rss</span>
<span class="co">#=&gt; Temperature_ElMonte         29 100.0  100.0</span>
<span class="co">#=&gt; Pressure_gradient           28  42.5   48.4</span>
<span class="co">#=&gt; pressure_height             26  30.1   38.1</span>
<span class="co">#=&gt; Month9                      25  26.1   34.8</span>
<span class="co">#=&gt; Month5                      24  21.9   31.7</span>
<span class="co">#=&gt; Month4                      23  19.9   30.0</span>
<span class="co">#=&gt; Month3                      22  17.6   28.3</span>
<span class="co">#=&gt; Inversion_base_height       21  14.4   26.1</span>
<span class="co">#=&gt; Month11                     19  12.3   24.1</span>
<span class="co">#=&gt; Visibility                  18  11.4   23.2</span>
<span class="co">#=&gt; Day_of_month23              14   8.9   19.8</span>
<span class="co">#=&gt; Humidity                    13   7.4   18.7</span>
<span class="co">#=&gt; Month6                      11   5.1   16.6</span>
<span class="co">#=&gt; Temperature_Sandburg         9   7.0   15.6</span>
<span class="co">#=&gt; Wind_speed                   7   5.1   13.4</span>
<span class="co">#=&gt; Month12                      6   4.2   12.3</span>
<span class="co">#=&gt; Day_of_month9                3   4.6    9.1</span>
<span class="co">#=&gt; Day_of_week4                 2  -3.9    5.9</span>
<span class="co">#=&gt; Day_of_month7-unused         1  -4.7    2.8</span>

<span class="kw">plot</span>(ev)</code></pre></div>
<p><img src='screenshots/variable-importance-mars.png' width='528' height='349' /></p>
<h2>5. Step-wise Regression</h2>
<p>If you have large number of predictors (&gt; 15), split the inputData in chunks of 10 predictors with each chunk holding the <code>responseVar</code>.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">base.mod &lt;-<span class="st"> </span><span class="kw">lm</span>(ozone_reading ~<span class="st"> </span><span class="dv">1</span> , <span class="dt">data=</span> inputData)  <span class="co"># base intercept only model</span>
all.mod &lt;-<span class="st"> </span><span class="kw">lm</span>(ozone_reading ~<span class="st"> </span>. , <span class="dt">data=</span> inputData) <span class="co"># full model with all predictors</span>
stepMod &lt;-<span class="st"> </span><span class="kw">step</span>(base.mod, <span class="dt">scope =</span> <span class="kw">list</span>(<span class="dt">lower =</span> base.mod, <span class="dt">upper =</span> all.mod), <span class="dt">direction =</span> <span class="st">&quot;both&quot;</span>, <span class="dt">trace =</span> <span class="dv">0</span>, <span class="dt">steps =</span> <span class="dv">1000</span>)  <span class="co"># perform step-wise algorithm</span>
shortlistedVars &lt;-<span class="st"> </span><span class="kw">names</span>(<span class="kw">unlist</span>(stepMod[[<span class="dv">1</span>]])) <span class="co"># get the shortlisted variable.</span>
shortlistedVars &lt;-<span class="st"> </span>shortlistedVars[!shortlistedVars %in%<span class="st"> &quot;(Intercept)&quot;</span>]  <span class="co"># remove intercept </span>
<span class="kw">print</span>(shortlistedVars)
<span class="co">#=&gt; [1] &quot;Temperature_Sandburg&quot;  &quot;Humidity&quot;              &quot;Temperature_ElMonte&quot;  </span>
<span class="co">#=&gt; [4] &quot;Month&quot;                 &quot;pressure_height&quot;       &quot;Inversion_base_height&quot;</span></code></pre></div>
<p>The output could includes levels within categorical variables, since ‘stepwise’ is a linear regression based technique, as seen above.</p>
<p>If you have a large number of predictor variables (100+), the above code may need to be placed in a loop that will run stepwise on sequential chunks of predictors. The shortlisted variables can be accumulated for further analysis towards the end of each iteration. This can be very effective method, if you want to (i) be highly selective about discarding valuable predictor variables. (ii) build multiple models on the response variable.</p>
<h2>6. Boruta</h2>
<p>The ‘Boruta’ method can be used to decide if a variable is important or not.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(Boruta)
<span class="co"># Decide if a variable is important or not using Boruta</span>
boruta_output &lt;-<span class="st"> </span><span class="kw">Boruta</span>(ozone_reading ~<span class="st"> </span>., <span class="dt">data=</span><span class="kw">na.omit</span>(inputData), <span class="dt">doTrace=</span><span class="dv">2</span>)  <span class="co"># perform Boruta search</span>
<span class="co"># Confirmed 10 attributes: Humidity, Inversion_base_height, Inversion_temperature, Month, Pressure_gradient and 5 more.</span>
<span class="co"># Rejected 3 attributes: Day_of_month, Day_of_week, Wind_speed.</span>
boruta_signif &lt;-<span class="st"> </span><span class="kw">names</span>(boruta_output$finalDecision[boruta_output$finalDecision %in%<span class="st"> </span><span class="kw">c</span>(<span class="st">&quot;Confirmed&quot;</span>, <span class="st">&quot;Tentative&quot;</span>)])  <span class="co"># collect Confirmed and Tentative variables</span>
<span class="kw">print</span>(boruta_signif)  <span class="co"># significant variables</span>
<span class="co">#=&gt; [1] &quot;Month&quot;                 &quot;ozone_reading&quot;         &quot;pressure_height&quot;      </span>
<span class="co">#=&gt; [4] &quot;Humidity&quot;              &quot;Temperature_Sandburg&quot;  &quot;Temperature_ElMonte&quot;  </span>
<span class="co">#=&gt; [7] &quot;Inversion_base_height&quot; &quot;Pressure_gradient&quot;     &quot;Inversion_temperature&quot;</span>
<span class="co">#=&gt; [10] &quot;Visibility&quot;</span>
<span class="kw">plot</span>(boruta_output, <span class="dt">cex.axis=</span>.<span class="dv">7</span>, <span class="dt">las=</span><span class="dv">2</span>, <span class="dt">xlab=</span><span class="st">&quot;&quot;</span>, <span class="dt">main=</span><span class="st">&quot;Variable Importance&quot;</span>)  <span class="co"># plot variable importance</span></code></pre></div>
<p><img src='screenshots/boruta-variable-importance.png' width='528' height='305' /></p>
<h2>7. Information value and Weight of evidence</h2>
<p>The <a href="https://cran.r-project.org/web/packages/InformationValue/vignettes/InformationValue.html">InformationValue package</a> provides convenient functions to compute <em>weights of evidence</em> and <em>information value</em> for categorical variables.</p>
<p><strong>Weights of Evidence (WOE)</strong> provides a method of recoding a categorical X variable to a continuous variable. For each category of a categorical variable, the <strong>WOE</strong> is calculated as:</p>
<p><br /><span class="math display">$$WOE = ln \left(\frac{percentage\ good\ of\ all\ goods}{percentage\ bad\ of\ all\ bads}\right)$$</span><br /></p>
<p>In above formula, ‘goods’ is same as ‘ones’ and ‘bads’ is same as ‘zeros’.</p>
<p><strong>Information Value (IV)</strong> is a measure of the predictive capability of a categorical <code>x</code> variable to accurately predict the goods and bads. For each category of <code>x</code>, information value is computed as:</p>
<p><br /><span class="math display">$$Information Value_{category} = {percentage\ good\ of\ all\ goods - percentage\ bad\ of\ all\ bads \over WOE} $$</span><br /></p>
<p>The total IV of a variable is the sum of IV’s of its categories.</p>
<h3>Example</h3>
<p>Let me demonstrate how to create the weights of evidence for categorical variables using the <code>WOE</code> function in <code>InformationValue</code> pkg. For this we will use the <code>adult</code> data as imported below. The response variable in <code>adult</code> is the <code>ABOVE50K</code> which indicates if the yearly salary of the individual in that row exceeds $50K. We have a number of predictor variables originally, out of which few of them are categorical variables. On these categorical variables, we will derive the respective <code>WOE</code>s using the <code>InformationValue::WOE</code> function. Then, lets find out the <code>InformationValue:IV</code> of all categorical variables.</p>
<h4>Install package from github</h4>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(devtools)
<span class="kw">install_github</span>(<span class="st">&quot;selva86/InformationValue&quot;</span>)</code></pre></div>
<h4>Import the data</h4>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(InformationValue)
inputData &lt;-<span class="st"> </span><span class="kw">read.csv</span>(<span class="st">&quot;http://rstatistics.net/wp-content/uploads/2015/09/adult.csv&quot;</span>)
<span class="kw">head</span>(inputData)
<span class="co">#=&gt;   AGE         WORKCLASS FNLWGT  EDUCATION EDUCATIONNUM       MARITALSTATUS</span>
<span class="co">#=&gt; 1  39         State-gov  77516  Bachelors           13       Never-married</span>
<span class="co">#=&gt; 2  50  Self-emp-not-inc  83311  Bachelors           13  Married-civ-spouse</span>
<span class="co">#=&gt; 3  38           Private 215646    HS-grad            9            Divorced</span>
<span class="co">#=&gt; 4  53           Private 234721       11th            7  Married-civ-spouse</span>
<span class="co">#=&gt; 5  28           Private 338409  Bachelors           13  Married-civ-spouse</span>
<span class="co">#=&gt; 6  37           Private 284582    Masters           14  Married-civ-spouse</span>
<span class="co">#             OCCUPATION   RELATIONSHIP   RACE     SEX CAPITALGAIN CAPITALLOSS</span>
<span class="co">#=&gt; 1       Adm-clerical  Not-in-family  White    Male        2174           0</span>
<span class="co">#=&gt; 2    Exec-managerial        Husband  White    Male           0           0</span>
<span class="co">#=&gt; 3  Handlers-cleaners  Not-in-family  White    Male           0           0</span>
<span class="co">#=&gt; 4  Handlers-cleaners        Husband  Black    Male           0           0</span>
<span class="co">#=&gt; 5     Prof-specialty           Wife  Black  Female           0           0</span>
<span class="co">#=&gt; 6    Exec-managerial           Wife  White  Female           0           0</span>
<span class="co">#     HOURSPERWEEK  NATIVECOUNTRY ABOVE50K</span>
<span class="co">#=&gt; 1           40  United-States        0</span>
<span class="co">#=&gt; 2           13  United-States        0</span>
<span class="co">#=&gt; 3           40  United-States        0</span>
<span class="co">#=&gt; 4           40  United-States        0</span>
<span class="co">#=&gt; 5           40           Cuba        0</span>
<span class="co">#=&gt; 6           40  United-States        0</span></code></pre></div>
<h4>Calculate the Information Values</h4>
<p>Below, the information value of each categorical variable is calculated using the <code>InformationValue::IV</code> and the strength of each variable is contained within the <code>howgood</code> attribute in the returned result. If you are want to dig further into the <code>IV</code> of individual categories within a categorical variable, the <a href="https://cran.r-project.org/web/packages/InformationValue/vignettes/InformationValue.html#woetable"><code>InformationValue::WOETable</code></a> will be helpful.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">factor_vars &lt;-<span class="st"> </span><span class="kw">c</span> (<span class="st">&quot;WORKCLASS&quot;</span>, <span class="st">&quot;EDUCATION&quot;</span>, <span class="st">&quot;MARITALSTATUS&quot;</span>, <span class="st">&quot;OCCUPATION&quot;</span>, <span class="st">&quot;RELATIONSHIP&quot;</span>, <span class="st">&quot;RACE&quot;</span>, <span class="st">&quot;SEX&quot;</span>, <span class="st">&quot;NATIVECOUNTRY&quot;</span>)  <span class="co"># get all categorical variables</span>
all_iv &lt;-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">VARS=</span>factor_vars, <span class="dt">IV=</span><span class="kw">numeric</span>(<span class="kw">length</span>(factor_vars)), <span class="dt">STRENGTH=</span><span class="kw">character</span>(<span class="kw">length</span>(factor_vars)), <span class="dt">stringsAsFactors =</span> F)  <span class="co"># init output dataframe</span>
for (factor_var in factor_vars){
  all_iv[all_iv$VARS ==<span class="st"> </span>factor_var, <span class="st">&quot;IV&quot;</span>] &lt;-<span class="st"> </span>InformationValue::<span class="kw">IV</span>(<span class="dt">X=</span>inputData[, factor_var], <span class="dt">Y=</span>inputData$ABOVE50K)
  all_iv[all_iv$VARS ==<span class="st"> </span>factor_var, <span class="st">&quot;STRENGTH&quot;</span>] &lt;-<span class="st"> </span><span class="kw">attr</span>(InformationValue::<span class="kw">IV</span>(<span class="dt">X=</span>inputData[, factor_var], <span class="dt">Y=</span>inputData$ABOVE50K), <span class="st">&quot;howgood&quot;</span>)
}

all_iv &lt;-<span class="st"> </span>all_iv[<span class="kw">order</span>(-all_iv$IV), ]  <span class="co"># sort</span>
<span class="co">#&gt;           VARS         IV            STRENGTH</span>
<span class="co">#&gt;   RELATIONSHIP 1.53560810   Highly Predictive</span>
<span class="co">#&gt;  MARITALSTATUS 1.33882907   Highly Predictive</span>
<span class="co">#&gt;     OCCUPATION 0.77622839   Highly Predictive</span>
<span class="co">#&gt;      EDUCATION 0.74105372   Highly Predictive</span>
<span class="co">#&gt;            SEX 0.30328938   Highly Predictive</span>
<span class="co">#&gt;      WORKCLASS 0.16338802   Highly Predictive</span>
<span class="co">#&gt;  NATIVECOUNTRY 0.07939344 Somewhat Predictive</span>
<span class="co">#&gt;           RACE 0.06929987 Somewhat Predictive</span></code></pre></div>
<h4>Compute the weights of evidence (optional)</h4>
<p>Optionally, we could create the weights of evidence for the factor variables and use it as continuous variables in place of the factors.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">for(factor_var in factor_vars){
  inputData[[factor_var]] &lt;-<span class="st"> </span><span class="kw">WOE</span>(<span class="dt">X=</span>inputData[, factor_var], <span class="dt">Y=</span>inputData$ABOVE50K)
}
<span class="co">#&gt;   AGE  WORKCLASS FNLWGT  EDUCATION EDUCATIONNUM MARITALSTATUS OCCUPATION</span>
<span class="co">#&gt; 1  39  0.1608547  77516  0.7974104           13    -1.8846680  -0.713645</span>
<span class="co">#&gt; 2  50  0.2254209  83311  0.7974104           13     0.9348331   1.084280</span>
<span class="co">#&gt; 3  38 -0.1278453 215646 -0.5201257            9    -1.0030638  -1.555142</span>
<span class="co">#&gt; 4  53 -0.1278453 234721 -1.7805021            7     0.9348331  -1.555142</span>
<span class="co">#&gt; 5  28 -0.1278453 338409  0.7974104           13     0.9348331   0.943671</span>
<span class="co">#&gt; 6  37 -0.1278453 284582  1.3690863           14     0.9348331   1.084280</span>

<span class="co">#&gt;   RELATIONSHIP        RACE        SEX CAPITALGAIN CAPITALLOSS HOURSPERWEEK</span>
<span class="co">#&gt; 1    -1.015318  0.08064715  0.3281187        2174           0           40</span>
<span class="co">#&gt; 2     0.941801  0.08064715  0.3281187           0           0           13</span>
<span class="co">#&gt; 3    -1.015318  0.08064715  0.3281187           0           0           40</span>
<span class="co">#&gt; 4     0.941801 -0.80794676  0.3281187           0           0           40</span>
<span class="co">#&gt; 5     1.048674 -0.80794676 -0.9480165           0           0           40</span>
<span class="co">#&gt; 6     1.048674  0.08064715 -0.9480165           0           0           40</span>

<span class="co">#&gt;   NATIVECOUNTRY ABOVE50K</span>
<span class="co">#&gt; 1    0.02538318        0</span>
<span class="co">#&gt; 2    0.02538318        0</span>
<span class="co">#&gt; 3    0.02538318        0</span>
<span class="co">#&gt; 4    0.02538318        0</span>
<span class="co">#&gt; 5    0.11671564        0</span>
<span class="co">#&gt; 6    0.02538318        0</span></code></pre></div>
<p>The newly created woe variables can alternatively be in place of the original factor variables.</p>


        </div>
      </div>

      <div class="footer">
        <hr>
        <p>&copy; 2016-17 Selva Prabhakaran. Powered by <a href="http://jekyllrb.com/">jekyll</a>,
          <a href="http://yihui.name/knitr/">knitr</a>, and
          <a href="http://johnmacfarlane.net/pandoc/">pandoc</a>.
          This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">Creative Commons License.</a>
        </p>
      </div>

    </div> <!-- /container -->

  <script src="//code.jquery.com/jquery.js"></script>
  <script src="www/bootstrap.min.js"></script>
  <script src="www/toc.js"></script>
  <!-- MathJax Script -->
  
  <script type="text/x-mathjax-config">
    MathJax.Hub.Config({
      tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
    });
  </script>
  <script type="text/javascript"
    src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
  </script>
  
  <!-- Google Analytics Code -->
  <script>
    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
    (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
    m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
    })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

    ga('create', 'UA-69351797-1', 'auto');
    ga('send', 'pageview');
  </script>

  <style type="text/css">
  /* reduce spacing around math formula*/
    .MathJax_Display {
      margin: 0em 0em;
    }
    body {
      font-family: 'Helvetica Neue', Roboto, Arial, sans-serif;
      font-size: 16px;
      line-height: 27px;
      font-weight: 400;
    }

    blockquote p {
      line-height: 1.75;
      color: #717171;
    }

    .well li{
      line-height: 28px;
    }

    li.dropdown-header {
      display: block;
      padding: 0px;
      font-size: 14px;
    }
  </style>
  </body>
</html>