-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy path200_exploratoryDataAnalysis.Rmd
275 lines (203 loc) · 11 KB
/
200_exploratoryDataAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
# (PART) Exploratory and Descriptive Data analysis {-}
# Exploratory Data Analysis
## Descriptive statistics
The first task to do with any dataset is to characterize it in terms of summary statistics and graphics.
Displaying information graphically will help us to identify the main characteristics of the data. To describe a distribution we often want to know where it is centered and and what the spread is (mean, median, quantiles)
## Basic Plots
* _Histogram_ defines a sequence of breaks and then counts the number of observations in the bins formed by the breaks.
* _Boxplot_ used to summarize data succinctly, quickly displaying if the data is symmetric or has suspected outliers.

* _Q-Q plot_ is used to determine if the data is close to being normally distributed. The quantiles of the standard normal distribution is represented by a straight line. The normality of the data can be evaluated by observing the extent in which the points appear on the line. When the data is normally distributed around the mean, then the mean and the median should be equal. Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way.

* _Scatterplot_ provides a graphical view of the relationship between two sets of numbers: one numerical variable against another.
* _Kernel Density_ plot visualizes the underlying distribution of a variable. Kernel density estimation is a non-parametric method of estimating the probability density function of continuous random variable. It helps to identify the distribution of the variable.
* _Violin plot_ is a combination of a boxplot and a kernel density plot.
## Normality
* A normal distribution is an arrangement of a data set in which most values cluster in the middle of the range
* A graphical representation of a normal distribution is sometimes called a *bell curve* because of its shape.
* Many procedures in statistics are based on this property. *Parametric* procedures require the normality property.
* In a normal distribution about 95% of the probability lies within 2 Standard Deviations of the mean.
* Two examples: one population with mean 60 and the standard deviation of 1, and the other with mean 60 and $sd=4$ (means shifted to 0)
```{r SDPlotExample, fig.cap="Plot exaple of the area within 2 and 4SD of the mean respectively", tidy=TRUE}
# Area within 2SD of the mean
par(mfrow=c(1,2))
plot(function(x) dnorm(x, mean = 0, sd = 1),
xlim = c(-3, 3), main = "SD 1", xlab = "x",
ylab = "", cex = 2)
segments(-2, 0, -2, 0.4)
segments(2, 0, 2, 0.4)
# Area within 4SD of the mean
plot(function(x) dnorm(x, mean = 0, sd = 4),
xlim = c(-12, 12), main = "SD 4", xlab = "x",
ylab = "", cex = 2)
segments(-8, 0, -8, 0.1)
segments(8, 0, 8, 0.1)
```
- if we sample from this population we get "another population":
```{r, tidy=TRUE}
#tidy uses the package formatR to format the code
sample.means <- rep(NA, 1000)
for (i in 1:1000) {
sample.40 <- rnorm(40, mean = 60, sd = 4)
#rnorm generates random numbers from normal distribution
sample.means[i] <- mean(sample.40)
}
means40 <- mean(sample.means)
sd40 <- sd(sample.means)
means40
sd40
```
- These sample means are another "population". The sampling distribution of the sample mean is normally distributed meaning that the "mean of a representative sample provides an estimate of the unknown population mean". This is shown in Figure \@ref(fig:sampleMeansExample)
```{r sampleMeansExample, fig.cap="Sample means histogram"}
hist(sample.means)
```
## Using a running Example to visualise the different plots
As a running example we do next:
1. Set the path to to the file
2. Read the _Telecom1_ dataset and print out the summary statistics with the command `summary`
```{r telecomExample, tidy=TRUE}
options(digits=3)
telecom1 <- read.table("./datasets/effortEstimation/Telecom1.csv", sep=",",header=TRUE, stringsAsFactors=FALSE, dec = ".") #read data
summary(telecom1)
```
* We see that this dataset has three variables (or parameters) and few data points (18)
+ *size*: the independent variable
+ *effort*: the dependent variable
+ *EstTotal*: the estimates coming from an estimation method
* Basic Plots
```{r message=FALSE, warning=FALSE, tidy=TRUE}
par(mfrow=c(1,2)) #n figures per row
size_telecom1 <- telecom1$size
effort_telecom1 <- telecom1$effort
hist(size_telecom1, col="blue", xlab='size', ylab = 'Probability', main = 'Histogram of project Size')
lines(density(size_telecom1, na.rm = T, from = 0, to = max(size_telecom1)))
plot(density(size_telecom1))
hist(effort_telecom1, col="blue")
plot(density(effort_telecom1))
boxplot(size_telecom1)
boxplot(effort_telecom1)
# violin plots for those two variables
library(vioplot)
vioplot(size_telecom1, names = '')
title("Violin Plot of Project Size")
vioplot(effort_telecom1, names = '')
title("Violin Plot of Project Effort")
par(mfrow=c(1,1))
qqnorm(size_telecom1, main="Q-Q Plot of 'size'")
qqline(size_telecom1, col=2, lwd=2, lty=2) #draws a line through the first and third quartiles
qqnorm(effort_telecom1, main="Q-Q Plot of 'effort'")
qqline(effort_telecom1)
```
* We can observe the non-normality of the data.
* We may look the possible relationship between size and effort with a scatter plot
```{r scatterplotExample, fig.cap="Scatterplot. Relationship between size and effort", tidy=TRUE}
plot(size_telecom1, effort_telecom1)
```
### Example with the China dataset
```{r message=FALSE, warning=FALSE}
library(foreign)
china <- read.arff("./datasets/effortEstimation/china.arff")
china_size <- china$AFP
summary(china_size)
china_effort <- china$Effort
summary(china_effort)
par(mfrow=c(1,2))
hist(china_size, col="blue", xlab="Adjusted Function Points", main="Distribution of AFP")
hist(china_effort, col="blue",xlab="Effort", main="Distribution of Effort")
boxplot(china_size)
boxplot(china_effort)
qqnorm(china_size)
qqline(china_size)
qqnorm(china_effort)
qqline(china_effort)
```
* We observe the non-normality of the data.
#### Normality. Galton data
It is the data based on the famous 1885 Francis Galton's study about the relationship between the heights of adult children and the heights of their parents.
```{r galtonData, echo=FALSE, message=FALSE, warning=FALSE}
# library(UsingR); data(galton)
galton <- read.csv("./datasets/other/galton.csv")
par(mfrow=c(1,2))
hist(galton$child,col="blue") #,breaks=100)
hist(galton$parent,col="blue") #,breaks=100)
```
#### Normalization
Take $log$s in both independent variables. For example, with the _China_ dataset.
```{r logExample, echo=FALSE, tidy=TRUE}
par(mfrow=c(1,2))
logchina_size = log(china_size)
hist(logchina_size, col="blue", xlab="log Adjusted Function Points", main="Distribution of log AFP")
logchina_effort = log(china_effort)
hist(logchina_effort, col="blue",xlab="Effort", main="Distribution of log Effort")
qqnorm(logchina_size)
qqnorm(logchina_effort)
```
* If the $log$ transformation is used, then the estimation equation is:
$$y= e^{b_0 + b_1 log(x)} $$
## Correlation
_Correlation_ is a statistical relationship between two sets of data. With the whole dataset we may check for the linear Correlation of the variables we are interested in.
As an example with the China dataset
```{r correlationChinaDataset}
par(mfrow=c(1,1))
plot(china_size,china_effort)
cor(china_size,china_effort)
cor.test(china_size,china_effort)
cor(china_size,china_effort, method="spearman")
cor(china_size,china_effort, method="kendall")
```
## Confidence Intervals. Bootstrap
* Until now we have generated point estimates
* A _confidence interval_ (CI) is an interval estimate of a population parameter. The parameter can be the mean, the median or other. The frequentist CI is an observed interval that is different from sample to sample. It frequently includes the value of the unobservable parameter of interest if the experiment is repeated. The *confidence level* is the value that measures the frequency that the constructed intervals contain the true value of the parameter.
* The construction of a confidence interval with an exact value of confidence level for a distribution requires some statistical properties. Usually, *normality* is one of the properties required for computing confidence intervals.
+ Not all confidence intervals contain the true value of the parameter.
+ Simulation of confidence intervals
An example from Ugarte et al. [@ugarte2015probability]
```{r confiedenceIntervals, echo=FALSE, tidy=TRUE}
# code from the book by Ugarte et al.
norsim <- function(sims = 100, n = 36, mu = 100, sigma = 18, conf.level = 0.95){
alpha <- 1 - conf.level
CL <- conf.level * 100
ll <- numeric(sims)
ul <- numeric(sims)
for (i in 1:sims){
xbar <- mean(rnorm(n , mu, sigma))
ll[i] <- xbar - qnorm(1 - alpha/2)*sigma/sqrt(n)
ul[i] <- xbar + qnorm(1 - alpha/2)*sigma/sqrt(n)
}
notin <- sum((ll > mu) + (ul < mu))
percentage <- round((notin/sims) * 100, 2)
SCL <- 100 - percentage
plot(ll, type = "n", ylim = c(min(ll), max(ul)), xlab = " ",
ylab = " ")
for (i in 1:sims) {
low <- ll[i]
high <- ul[i]
if (low < mu & high > mu) {
segments(i, low, i, high)
}
else if (low > mu & high > mu) {
segments(i, low, i, high, col = "red", lwd = 5)
}
else {
segments(i, low, i, high, col = "blue", lwd = 5)
}
}
abline(h = mu)
# cat(SCL, "\b% of the random confidence intervals contain Mu =", mu, "\b.", "\n")
}
```
```{r tidy=TRUE}
set.seed(10)
norsim(sims = 100, n = 36, mu = 100, sigma = 18, conf.level = 0.95)
```
* The range defined by the confidence interval will vary with each sample, because the sample size will vary each time and the standard deviation will vary too.
* 95% confidence interval: it is the probability that the hypothetical confidence intervals (that would be computed from the hypothetical repeated samples) will contain the population mean.
* the particular interval that we compute on one sample does not mean that the population mean lies within that interval with a probability of 95%.
* Recommended reading: [@Hoekstra2014] *Robust misinterpretation of confidence intervals*
## Nonparametric Bootstrap
* For computing CIs the important thing is to know the assumptions that are made to “know” the
distribution of the statistic.
* There is a way to compute confidence intervals without meeting the requirements of parametric methods.
* **Resampling** or **bootstraping** is a method to calculate estimates of a parameter taking samples from the original data and using those *resamples* to calculate statistics. Using the resamples usually gives more accurate results than using the original single sample to calculate an estimate of a parameter.

- An example of bootstrap CI can be found in Chapter \@ref(evaluationSE), "Evaluation in Software Engineering"