-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy path300_basicPreprocessing.Rmd
383 lines (237 loc) · 14.3 KB
/
300_basicPreprocessing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
# (PART) Preprocessing {-}
# Preprocessing
Following the data mining process, we describe what is meant by preprocessing, classical supervised models, unsupervised models and evaluation in the context of software engineering with examples
This task is probably the hardest and where most of effort is spend in the data mining process. It is quite typical to transform the data, for example, finding inconsistencies, normalising, imputing missing values, transforming input data, merging variables, etc.
Typically, pre-processing consist of the following tasks (subprocesses):
+ Data cleaning (consistency, noise detection, outliers)
+ Data integration
+ Data transformation (normalisation, discretisation) and derivation of new attributes from existing ones (e.g., population density from population and area)
+ Missing data imputation
+ Data reduction (feature selection and instace selection)
## Data
_Consistent_ data are semantically correct based on real-world knowledge of the problem, i.e., no constrains are violated and data that can be used for inducing models and analysis. For example, the LoC or effort is constrained to non-negative values. We can also consider that to multiple attributes are consistent among them, and even datasets (e.g., same metrics but collected by different tools)
## Missing values
_Missing values_ will have a negative effect when analysing the data or learning models. The results can be biased when compared with the models induced from the complete data, the results can be harder to analyse, it may be needed to discard records with missing values depending on the algorithm and this can be an important problems with small datasets such as the effort estimation ones.
Missing data is typically classified into:
* MCAR (Missing Completely at Random) or MAR (Missing At Random) where there is no reason for those missing values and we can assume that the distribution could follow the attribute's distribution.
* MNAR (Missing Not At Random) where there is a pattern for those missing values and it may may be advisable to check the data gathering process to try to understand why such information is missing.
_Imputation_ consists in replacing missing values for estimates of those missing values. Many algorithms do cannot handle missing values and therefore, imputation methods are needed. We can use simple approaches such as the replacing the missing values with the mean or mode of the attribute. More elaborated approaches include:
* EM (Expectation-Maximisation)
* Distance-based
+ $k$-NN ($k$-Nearest Neighbours)
+ Clustering
In R, a missing value is represented with `NA` and the analyst must decide what to do with missing data. The simplest approach is to leave out instances (ignore missing -IM-) with with missing data. This functionality is supported by many base functions through the `na.rm` option.
The `mice` R package. MICE (Multivariate Imputation via Chained Equations) assumes that data are missing at random. Other packages include `Amelia`, `missForest`, `Hmisc` and `mi`.
## Noise
Imperfections of the real-world data that influences negatively in the induced machine learning models. Approaches to deal with noisy data include:
* Robust learners capable of handling noisy data (e.g., C4.5 through pruning strategies)
* Data polishing methods which aim to correct noisy instances prior training
* Noise filters which are used to identify and eliminate noisy instances from the training data.
Types of noise data:
* Class Noise (aka label noise).
+ There can be contradictory cases (all attributes have the same value except the class)
+ Misclassifications. The class attribute is not labeled with the true label (golden truth)
* Attribute Noise. Values of attributes that are noise, missing or unknown.
## Outliers
There is a large amount of literature related to outlier detection, and furthermore several definitions of outlier exist.
```{r message=FALSE, warning=FALSE}
library(DMwR2)
library(foreign)
kc1 <- read.arff("./datasets/defectPred/D1/KC1.arff")
```
The LOF algorithm (`lofactor`), given a data set it produces a vector of local outlier factors for each case.
```{r message=FALSE, warning=FALSE}
kc1num <- kc1[,1:21]
outlier.scores <- lofactor(kc1num, k=5)
plot(density(na.omit(outlier.scores)))
outliers <- order(outlier.scores, decreasing=T)[1:5]
print(outliers)
```
Another simple method of Hiridoglou and Berthelot for positive observations.
## Feature selection
Feature Selection (FS) aims at identifying the most relevant attributes from a dataset. It is important in different ways:
* A reduced volume of data allows different data mining or searching techniques to be applied.
* Irrelevant and redundant attributes can generate less accurate and more complex models. Furthermore, data mining algorithms can be executed faster.
* It avoids the collection of data for those irrelevant and redundant attributes in the future.
The problem of FS received a thorough treatment in pattern recognition and machine learning. Most of the FS algorithms tackle the task as a _search_ problem, where each
state in the search specifies a distinct subset of the possible attributes [@BL97]. The search procedure is combined with a criterion to evaluate the merit of each candidate subset of attributes. There are a multiple possible combinations between each procedure search and each attribute measure [@LY05].
There are two major approaches in FS from the method's output point of view:
* _Feature subset selection_ (FSS)
* _Feature ranking_ in which attributes are ranked as a list of features which are ordered according to evaluation measures (a subset of features is often selected from the top of the ranking list).
FFS algorithms designed with different evaluation criteria broadly fall into two categories:
* The _filter_ model relies on general characteristics of the data to evaluate and select feature subsets without involving any data mining algorithm.
* The _wrapper_ model requires one predetermined mining algorithm and uses its performance as the evaluation criterion. It searches for features better suited to the mining algorithm aiming to improve mining performance, but it also tends to be more computationally expensive than filter model [@KJ97,@Lan94].
Feature subset algorithms search through candidate feature subsets guide by a certain evaluation measure [@LM98] which captures the goodness of each subset. An optimal (or near optimal) subset is selected when the search stops.
Some existing evaluation measures that have been shown effective in removing both irrelevant and redundant features include the consistency measure [@DLM00], the correlation measure [@Hal99] and the estimated accuracy of a learning algorithm [@KJ97].
+ _Consistency_ measure attempts to find a minimum number of features that separate classes as consistently as the full set of features can. An inconsistency is defined as to instances having the same
feature values but different class labels.
+ _Correlation_ measure evaluates the goodness of feature subsets based on the hypothesis that good feature subsets contain features highly correlated to the class, yet uncorrelated to each other.
+ _Wrapper-based_ attribute selection uses the target learning algorithm to estimate the worth of attribute subsets. The feature subset selection algorithm conducts a search for a good subset using
the induction algorithm itself as part of the evaluation function.
Langley [-@Lan94] notes that feature selection algorithms that search through the space of feature subsets must address four main issues: (i) the starting point of the search, (ii) the organization of the search, (iii) the evaluation of features subsets and (iv) the criterion used to terminate the search. Different algorithms address theses issues differently.
It is impractical to look at all possible feature subsets, even with a small number of attributes. Feature selection algorithms usually proceed greedily and are be classified into those that add features to an initially empty set (_forward selection_) and those that remove features from an initially complete set (_backwards elimination_). Hybrids both add and remove features as the algorithm progresses. Forward selection is much faster than backward elimination and therefore scales better to large data sets. A wide range of search strategies can be used: best-first, branch-and-bound, simulated annealing, genetic algorithms (see Kohavi and John [-@KJ97] for a review).
### FSelector package in R
The FSelector package in R implements many algorithms available in Weka
```{r message=FALSE, warning=FALSE}
library(FSelector)
library(foreign)
cm1 <- read.arff("./datasets/defectPred/D1/CM1.arff")
cm1RFWeigths <- random.forest.importance(Defective ~ ., cm1)
cutoff.biggest.diff(cm1RFWeigths)
```
Using the Information Gain measure as ranking:
```{r message=FALSE, warning=FALSE}
cm1GRWeights <- gain.ratio(Defective ~ ., cm1)
cm1GRWeights
cutoff.biggest.diff(cm1GRWeights)
# After assigning weights, we can select the statistaclly significant ones
cm1X2Weights <- chi.squared(Defective ~ ., cm1)
cutoff.biggest.diff(cm1X2Weights)
```
Using CFS attribute selection
```{r message=FALSE, warning=FALSE}
library(FSelector)
library(foreign)
cm1 <- read.arff("./datasets/defectPred/D1/CM1.arff")
result <- cfs(Defective ~ ., cm1)
f <- as.simple.formula(result, "Defective")
f
```
Other packages for Feature selection in R include `FSelectorRccp` which re-implments the FSlector without WEKA dependencies.
Another popular package is `Boruta`, which is based on selection based on Random Forest.
## Instance selection
Removal of samples (complementary to the removal of attributes) in order to scale down the dataset prior to learning a model so that there is (almost) no performance loss.
There are two types of processes:
* _Prototype Selection_ (PS) [@GDCH12] when the subset is used with a distance based method (kNN)
* _Training Set Selection_ (TSS) [@CanoHL07] in which an actual model is learned.
It is also a search problem as with _feature selection_. Garcia et al. [-@GDCH12] provide a comprehensive overview of the topic.
## Discretization
This process transforms continuous attributes into discrete ones, by associating categorical values to intervals and thus transforming quantitative data into qualitative data.
## Correlation Coefficient and Covariance for Numeric Data
Two random variables $x$ and $y$ are called independent if the probability distribution of one variable is not affected by the presence of another.
$\tilde{\chi}^2=\frac{1}{d}\sum_{k=1}^{n} \frac{(O_k - E_k)^2}{E_k}$
```{r message=FALSE, warning=FALSE}
chisq.test(kc1$LOC_BLANK,kc1$BRANCH_TOTAL)
chisq.test(kc1$DESIGN_COMPLEXITY,kc1$CYCLOMATIC_COMPLEXITY)
```
## Normalization
### Min-Max Normalization
$z_i=\frac{x_i-\min(x)}{\max(x)-\min(x)}$
```{r message=FALSE, warning=FALSE}
library(caret)
preObj <- preProcess(kc1[, -22], method=c("center", "scale"))
```
### Z-score normalization
TBD
## Transformations
### Linear Transformations and Quadratic Trans formations
TBD
### Box-cox transformation
TBD
### Nominal to Binary tranformations
TBD
## Preprocessing in R
### The `dplyr` package
The *[dplyr](https://cran.r-project.org/web/packages/dplyr/index.html)* package created by Hadley Wickham. Some functions are similar to SQL syntax. key functions in dplyr include:
+ select: select columns from a dataframe
+ filter: select rows from a dataframe
+ summarize: allows us to do summary stats based upon the grouped variable
+ group_by: group by a factor variable
+ arrange: order the dataset
+ joins: as in sql left join
Tutorial:
[https://github.com/justmarkham/dplyr-tutorial](https://github.com/justmarkham/dplyr-tutorial)
Examples
```{r message=FALSE, warning=FALSE}
library(dplyr)
```
Describe the dataframe:
```{r}
str(kc1)
```
`tbl_df` creates a “local data frame” as a wrapper for better printing
```{r}
kc1_tbl <- tbl_df(kc1) #deprecated
kc1_tbl <- tibble(kc1)
```
Filter:
```{r}
# Filter rows: use comma or & to represent AND condition
filter(kc1_tbl, Defective == "Y" & LOC_BLANK != 0)
```
Another operator is `%in%`.
Select:
```{r}
select(kc1_tbl, contains("LOC"), Defective)
```
Now, `kc1_tbl` contains("LOC"), Defective
Filter and Select together:
```{r}
# nesting method
filter(select(kc1_tbl, contains("LOC"), Defective), Defective !=0)
```
It is easier usign the chaining method:
```{r}
# chaining method
kc1_tbl %>%
select(contains("LOC"), Defective) %>%
filter(Defective !=0)
```
Arrange ascending
```{r}
#
kc1_tbl %>%
select(LOC_TOTAL, Defective) %>%
arrange(LOC_TOTAL)
```
Arrange descending:
```{r}
kc1_tbl %>%
select(LOC_TOTAL, Defective) %>%
arrange(desc(LOC_TOTAL))
```
Mutate:
```{r}
kc1_tbl %>%
filter(Defective == "Y") %>%
select(NUM_OPERANDS, NUM_OPERATORS, Defective) %>%
mutate(HalsteadLength = NUM_OPERANDS + NUM_OPERATORS)
```
`summarise`: Reduce variables to values
```{r}
# Create a table grouped by Defective, and then summarise each group by taking the mean of loc
kc1_tbl %>%
group_by(Defective) %>%
summarise(avg_loc = mean(LOC_TOTAL, na.rm=TRUE))
```
```{r}
# Create a table grouped by Defective, and then summarise each group by taking the mean of loc
kc1_tbl %>%
group_by(Defective) %>%
summarise_each(funs(mean, min, max), BRANCH_COUNT, LOC_TOTAL)
```
It seems than the number of _Defective_ modules is larger than the _Non-Defective_ ones. We can count them with:
```{r}
# n() or tally
kc1_tbl %>%
group_by(Defective) %>%
tally()
```
It seems that it's an imbalanced dataset...
```{r}
# randomly sample a fixed number of rows, without replacement
kc1_tbl %>% sample_n(2)
# randomly sample a fraction of rows, with replacement
kc1_tbl %>% sample_frac(0.05, replace=TRUE)
# Better formatting adapted to the screen width
glimpse(kc1_tbl)
```
## Other libraries and tricks
The `lubridate` package contains a number of functions facilitating the conversion of text to
POSIX dates. As an example, consider the following code. We may use this, for example, with time series.
For example [https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf](https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf)
```{r message=FALSE, warning=FALSE}
library(lubridate)
dates <- c("15/02/2013", "15 Feb 13", "It happened on 15 02 '13")
dmy(dates)
```