-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathModule1_June_2017.Rmd
573 lines (427 loc) · 18.4 KB
/
Module1_June_2017.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
---
title: "MODULE I. Descriptive Statistics & Intro to Probability. Hands-on practicum. "
author: "Julia Ponomarenko, CRG Bioinformatics core facility."
date: "June 6, 2017"
output:
html_document:
toc: true
toc_depth: 4
theme: readable
---
1. Descriptive statistics
--------------------------
"Descriptive statistics aim to summarize **a sample**, rather than use the data to learn about **the population** that the sample of data is thought to represent. This generally means that descriptive statistics, unlike inferential statistics, are not developed on the basis of probability theory." (Wikipedia)
### Explore data
We will use the package "car". For details on this package, see https://cran.r-project.org/web/packages/car/car.pdf
```{r, results="hide", warning = F}
# install.packages("car") - run this code if you do not have car" package installed
library(car)
```
<br/>
Let's explore the dataset "Davis" from the car package. It is called "Self-Reports of Height and Weight Description".
The Davis data frame has 200 rows and 5 columns. The subjects were men and women engaged in regular exercise. There are some missing data. This data frame contains the following columns:
<br/>sex - A factor with levels: F, female; M, male.
<br/>weight - Measured weight in kg.
<br/>height - Measured height in cm.
<br/>repwt - Reported weight in kg.
<br/>repht - Reported height in cm.
```{r, results="hide"}
data <- Davis
dim(data) # first thing to do to see how many rows and columns
names(data) # columns names
class(data) # should be "data.frame""
sapply(data[1,], class) # to see types of variables in each column
head(data)
summary(data) # shows quantiles for each column and how many NA's !!!!
quantile(data$repwt, na.rm = TRUE) # shows quantiles for values in a specific column, ignoring 'NA'
unique(data$sex) # shows unique values for a specified column
unique(data$weight)
length(unique(data$weight))
table(data$sex) # table() builds a contingency table of the counts at each combination of factor levels
table(data$sex, data$weight) # shows relationships between variables
table(data$sex, data$weight < 80) # gives the 2x2 contingency table
intervals <- cut(data$weight, quantile(data$weight)) # cut() assigns to each value a quantile - shows intervals in the order of data$weight
table(intervals, data$sex) # contigency table of sex by intervals (weight quantiles)
# contigency table of quantiles of weight versus repwt
x <- data$weight
y <- data$repwt
table(cut(x, quantile(x)), cut(y, quantile(y, na.rm = TRUE)))
```
<br/>
####Exercise with table(), cut(), and quantile()
**Who is more inclined to report lower weight, males or females?**
```{r }
#males only
d <- data[data$sex == "M",]
x <- d$weight
y <- d$repwt
table(cut(x, quantile(x)), cut(y, quantile(y, na.rm = TRUE)))
#males only
d <- data[data$sex == "F",]
x <- d$weight
y <- d$repwt
table(cut(x, quantile(x)), cut(y, quantile(y, na.rm = TRUE)))
```
**Answer**: This is a rough observation from looking at the contingency table, where rows represent quantiles for measured weight and columns, reported weight, it can be seen that females have a tendendy to report lower weight. Thus, 9 females versus 4 males reported their weight being in the 1st quantile whicle their measured weight was in the 2nd quantile. And in total, there were 20 females and 7 males whose weights were below diagonal.
<br/>
#### Looking at subset of data
```{r, results="hide"}
d <- data[data$weight < 60,] # rows with weight below 60 kg
summary(d)
summary(d$sex)
dim(d)
nrow(d)
data[data$weight < 60 & data$repwt >= 60,] # the results looks strange !!!
data[which(data$weight < 60 & data$repwt >= 60),] # use which() to exclude NA values
```
<br/>
#### Missing (NA) values
```{r}
sum(is.na(data$repwt)) # the number of missing values for repwt
table(is.na(data$repht)) # how many complete data (FALSE) and how many missing (TRUE) for data$repht
#by default table() doesn't show missing value; e.g.
table(data$repht)
table(data$repht, useNA = "ifany") #now it shows
#How many rows contain missing values (i.e., at least one 'NA')?
sum(!complete.cases(data))
data[!complete.cases(data), ] # here they are
```
<br/>
#### Exercise on missing values
**What is the proportion of females and males who didn't report the weight?**
```{r}
f <- data[which(data$sex == "F"), ]
n <- nrow(f) #total number of females in the original data frame
y <- sum(is.na(f$repwt))/n
round(y, digits = 3)
m <- data[which(data$sex == "M"), ]
n <- nrow(m) #total number of males in the original data frame
y <- sum(is.na(m$repwt))/n
round(y, digits = 3)
```
<br/>
<br/>
### Summary statistics
Basic R functions: mean, sd, var, min, max, median, range, IQR, and quantile. All these function require special treatment for missing values, using parameter na.rm = TRUE. While summary(data) does not.
Let's look at the row with the maximum weight.
```{r}
max_weight <- max(data$weight)
data[which(data$weight == max_weight), ] # full info (row) on a person with the max_weight
```
<br/>
#### Outliers
Can be defined (by the Tuley's test) as values in the sample that differ from the Q1 (25% quantile) or Q3 (75% quantile) by more than 1.5*IQR (where IQR = Q3 - Q1). <br/>However, there is no formal definition of outliers; they need to be treated subjectively.
```{r}
quantile(data$weight)
q <- unname(quantile(data$weight)) # unname() removes names and gives the vector of quantile values that we use below
iqr <- q[4] - q[2] # or use IQR(data$weight) instead
upper_limit <- q[4] + 1.5 * iqr # values above this limit are outliers
upper_limit
max_weight > upper_limit # Is the maximum weight an outlier? TRUE or FALSE?
```
<br/>
How many "formal" outliers by weight in the dataset?
```{r}
data[which(data$weight > upper_limit), ]
```
<br/>
Let's remove these outliers and check which of the summary statistics change.
```{r}
d <- data[which(!data$weight > upper_limit), ] # d is our data frame without outliers; we will use it further in this practicum
# Did mean change?
mean(d$weight)
mean(data$weight)
# Did median change?
median(data$weight)
median(d$weight)
# Did SD change?
sd(data$weight)
sd(d$weight)
# Did quantiles change?
quantile(data$weight)
quantile(d$weight)
```
<br/>
### Plots
#### Box-plot
```{r}
boxplot(data$weight) # we can clearly see three outliers as individual dots
data[which(data$weight > upper_limit), ] # here they are
#boxplots of weight values with and w/o outliers side by side
par(mfrow = c(1,2)) # par() allows to plot graphs in a table; e.g., two plots in one row
boxplot(data$weight)
boxplot(d$weight) # now we got another outliers because IQR changed
par(mfrow = c(1,1)) # restore the default plot layout
# Let's explore these "new" outliers
quantile(d$weight)
q <- unname(quantile(d$weight)) # the vector of quantile values
upper_limit <- q[4] + 1.5 * IQR(d$weight) # values above this limit are
d[which(d$weight > upper_limit), ]
#It is up to a researcher to remove those outliers or not. Let's not remove them.
```
<br/>
**How to plot box-plots side-by-side on one graph**
```{r}
boxplot(data$weight ~ data$sex)
boxplot(data$weight ~ data$sex, outline = FALSE) # boxplot has a parameter outline to not show outliers! (But it doesn't change your data)
```
<br/>
**Bonus: How to plot together data from two or more vectors of different lengths?**
```{r, results="hide", warning = FALSE}
#install.packages("reshape2")
library(reshape2) # for melt() to use below for transforming a data frame from the wide to the long format
```
<br/>
```{r}
male <- d[which(d$sex == "M"),] # we use data with outliers removed
female <- d[which(d$sex == "F"),]
m <- male$weight
f <- female$weight
length(m)
length(f)
# to make a data frame from these vectors, we first need to make them of the same length; that is, fill missing values in a smaller vector with 'NA'
len = max(length(m), length(f))
Female = c(f, rep(NA, len - length(f)))
Male = c(m, rep(NA, len - length(m)))
df <- structure(data.frame(Male, Female))
meltdf <- melt(df) # uses reshape2 package; transforms a data frame from the wide to the long format
# "No id variables; using all as measure variables" is a message that can be ignored
boxplot(data = meltdf, value ~ variable)
```
<br/>
**Bonus: Now to make the above plot more informative and look "publication-ready"?**
```{r, results="hide"}
# Define the plot parameters
y_limits <- c(30, 130)
colors <- c("blue", "red")
ylab <- "Weight, kg"
title <- "Distribution of weight by sex"
boxplot(data = meltdf, value ~ variable, outline = FALSE,
#pars = list(boxwex = .4),
ylab = ylab,
cex.lab = 1.5, #to change (multiply) the font size of the axes legends
cex.axis = 1.2, #to change (multiply) the font size of the axes
ylim = y_limits,
border = colors, #color the boxplot borders
boxwex = 0.6,
staplewex = 0.4,
frame.plot = FALSE, #this removes upper and right borders on the plot area
outwex = 0.5,
cex.main = 1.5, #to change (multiply) the size of the title
main = title # the title of the graph
)
stripchart(data = meltdf, value ~ variable,
col = colors,
method = "jitter", jitter = .2,
pch = c(16, 15), cex = c(1.0, 1.0), #different points and of different size can be used
vertical = TRUE, add = TRUE)
# let's show the (yet unknown) significance level on the plot
text <- "p-value < ???"
y <- 120 # y position of the horizontal line
offset <- 5 # length of vertical segments
x <- 1
segments(x, y, x + 1, y)
segments(x, y - offset, x, y)
segments(x + 1, y - offset, x + 1, y)
text(x + 0.5, y + offset, paste(text), cex = 1) #cex defines the font size of the text
# you can also plot the means as sick (lwd = 3) horizontal black lines
y <- mean(Male, na.rm = T)
segments(x - 0.35, y, x + 0.35, y, lwd = 3)
y <- mean(Female, na.rm = T)
segments(x + 1 - 0.35, y, x + 1 + 0.35, y, lwd = 3)
```
<br/>
#### Histogram
```{r, results="hide"}
hist(data$weight)
hist(d$weight)
```
<br/>
You can control for the size of bins!
```{r}
bin_size <- 5
start <- 30
end <- 110
bins <- seq(start, end, by = bin_size)
hist(d$weight, breaks = bins, col = "blue")
```
<br/>
**Bonus: Two overlaying histograms on one graph**
```{r, results="hide"}
xlim = c(start, end)
title <- "Distribution of weights for males and females"
colors <- c("red", rgb(0, 0, 1, 0.7)) # the last number in rgb() is for the transparency of the second color
hist(Female, col = colors[1], xlim = xlim, xlab = "Weight, kg", main = title)
hist(Male, add = TRUE, col = colors[2], xlim = xlim)
legend("topright", legend = c("Females", "Males"), fill = colors, bty = "n", border = NA)
```
<br/>
####Exercise with hist()
**Draw the same graph as above with bins of size 2 and using colors <- c(rgb(1, 0, 1, 0.7), rgb(0, 0, 1, 0.5))**
```{r, echo = F, results = "hide"}
bin_size <- 2
start <- min(min(m), min(f)) - 3*bin_size
end <- max(max(m), max(f)) + 3*bin_size
bins <- seq(start, end, by = bin_size)
xlim = c(start, end)
title <- "Distribution of weights for males and females"
colors <- c(rgb(1, 0, 1, 0.7), rgb(0, 0, 1, 0.5)) # the last number in rgb() is for transparency
hist(f, breaks = bins, col = colors[1], xlim = xlim, xlab = "Weight, kg", main = title)
hist(m, breaks = bins, add = TRUE, col = colors[2], xlim = xlim)
legend("topright", legend = c("Females", "Males"), fill = colors, bty = "n", border = NA)
```
<br/>
#### Scatterplot
```{r, results="hide"}
plot(d$weight, d$repwt, pch = 20) # Just a simple plot of reported weights against measured weights
plot(d$weight, d$repwt, pch = 20, col = d$sex) # it can be colored by sex, for example
legend("topleft",legend = c("Females", "Males"), col = 1:2, pch = 20)
```
<br/>
Colors are taken in the order from the currently setup **palette()** .
```{r}
palette()
```
<br/>
It can be changed to control for colors using palette() and then reset back to default, using palette("default").
```{r, results="hide"}
palette(c("orange", "darkgreen")) # changing palette()
plot(d$weight, d$repwt, col = d$sex, pch = 20, xlab = "Measured weight, kg", ylab = "Reported weight, kg", main = "Reported versus measured weights by sex")
legend("bottomright",legend = c("Females", "Males"), col = 1:2, pch = 20, bty = "n")
palette("default") # reset back to the default
```
<br/>
#### Combining plots
Refer to http://www.statmethods.net/advgraphs/layout.html for using layout() function.
<br/>
#### Empirical cumulative distribution functions (eCDFs)
**eCDF(x) = the proportion of observations which values are <= x**
```{r, results="hide"}
weight.ecdf = ecdf(d$weight) # obtain empirical CDF values
plot(weight.ecdf, xlab = "Quantiles of weight, kg", main = "Empirical cumulative distribution of weights")
```
<br/>
Let's draw vertical lines depicting median, Q1 and Q2, using abline(v = ...).
```{r}
plot(weight.ecdf, xlab = "Quantiles of weight, kg", main = "Empirical cumulative distribution of weights")
summary(d$weight)
x <- unname(summary(d$weight)) # a vector of 6 elements {min, Q1, median, mean, Q3, max}
abline(v = median(d$weight), lwd = 2) # vertical line for median
abline(v = x[2], col = "blue", lwd = 2) # vertical line for Q1
abline(v = x[5], col = "red", lwd = 2) # vertical line for Q3
```
<br/>
Let's draw horizontal lines for quantiles 0.5, 0.25, 0.75, using abline(h = ...).
```{r, results="hide"}
plot(weight.ecdf, xlab = "Quantiles of weight, kg", main = "Empirical cumulative distribution of weights")
x <- unname(summary(d$weight)) # a vector of 6 elements {min, Q1, median, mean, Q3, max}
abline(v = median(d$weight), lwd = 2) # vertical line for median
abline(v = x[2], col = "blue", lwd = 2) # vertical line for Q1
abline(v = x[5], col = "red", lwd = 2) # vertical line for Q3
abline(h = 0.5) # median
abline(h = 0.25, col = "blue")
abline(h = 0.75, col = "red")
```
<br/>
It can be seen that the CDF graph allows to identify the weight in kg for any quantile. For example, the lowest 10% of observations (CDF < 0.1) have weight below ~50 kg. Likewise, the top 10% of heavest people (CDF > 0.9) have weight above ~82 kg.
<br/>
####Exercise with eCDF
**Plot eCDFs for male and female weights in different colors on the same graph**
```{r, include = F}
plot(ecdf(Male), col = "blue", xlim = c(35, 110), xlab = "Quantiles of weight, kg", main = "Empirical cumulative distribution of weights")
plot(ecdf(Female), xlim = c(35, 110), col = "red", add = T)
legend(x = 90, y = 0.5, legend = c("Females", "Males"), col = c("red", "blue"), pch = 20, bty = "n")
```
<br/>
2. Distributions
--------------------------
### Normal distribution
**We will learn four important functions dnorm(), rnorm(), pnorm(), and qnorm().**
See in addition https://www.r-bloggers.com/normal-distribution-functions/
<br/>
####Probability Density function (PDF)
Let's look at three normal distributions with means and standard deviations of (0,1), and (8,0.5).
<br/>We will use the following function:
<br/>**dnorm() - Probability Density Function (PDF)**
```{r, results="hide"}
x <- seq(-5, 15, length.out = 200)
y1 <- dnorm(x, mean = 0, sd = 1) # dnorm() !!!
y2 <- dnorm(x, mean = 8, sd = 0.5)
plot(x = x, y = y1, type = "l", lwd = 3, col = "black", ylim = c(0, 1), xlab = "X", ylab = "Density", yaxs = "i") # the last parameter forces the x-axes to meet the y-axes at 0
lines(x = x, y = y2, lwd = 3, col = "blue")
```
<br/>
####Z-transformation of a random sample from a normal distribution
<br/>We will use the following function:
<br/>**rnorm() - generates random sample from a normal distribution**
```{r, results="hide"}
mean <- 8
sd <- 0.5
x.sample <- rnorm(n = 5000, mean = mean, sd = sd) # rnorm() generates a random sample of size 500 from a normal distribution N(mean,sd)
z.sample <- (x.sample - mean) / sd # Z-transformation of x.sample
hist(x.sample, freq = F, ylim = c(0, 1), xlim = c(-5, 15), col = rgb(0,0,1,0.3), yaxs = "i")
hist(z.sample, freq = F, add = T)
```
<br/>
Now if we overlay these histograms with the density functions plotted for normal distributions obtained before, we will see that sample distributions can be modelled by normal distributions: N(mean=8, sd=0.5) for x.sample and N(mean=0, sd=1) for a transformed z.sample. The black distribution is for the **standard normal distribution**.
```{r, results="hide"}
hist(x.sample, freq = F, ylim = c(0, 1), xlim = c(-5, 15), col = rgb(0,0,1,0.3), yaxs = "i")
hist(z.sample, freq = F, add = T)
lines(x = x, y = y1, type = "l", lwd = 3)
lines(x = x, y = y2, lwd = 3, col = "blue")
```
<br/>
####Cumulative Distribution Function (CDF)
**pnorm(z, mu, sd) - gives an area under the standard normal curve to the left of z**
```{r}
z <- seq(-4, 4, 0.01)
cdf <- pnorm(z, 0, 1)
plot(z, cdf, col = "blue", xlab = "z", ylab = "Cumulative Probability", type = "l", lwd = 3, cex = 2, main = "CDF of Standard Normal Distribution", cex.axis = .8, yaxs = "i")
```
<br/>
####The 68-95-99.7 rule
For the Normal distribution with mean **mu** and standard deviation **sd**,
**~68% of the observations fall within the range [mu - sd; mu + sd].**
```{r}
z <- seq(-4, 4, 0.01)
dens <- dnorm(z)
plot(x = z, y = dens, type = "l", ylim = c(0,0.45), lwd = 3, xlab = "z", ylab = "Standard Normal density", yaxs = "i")
sd <- 1 # !!!!
xs <- c(seq(-sd, sd, length.out = 200))
ys <- dnorm(xs)
polygon(c(-sd, xs, sd), c(0, ys, 0), col = "blue")
# What is the area of the blue area?
pnorm(sd) - pnorm(-sd)
```
<br/>
**~95.5% of the observations fall within the range [mu - 2sd; mu + 2sd].**
```{r}
plot(x = z, y = dens, type = "l", ylim = c(0,0.45), lwd = 3, xlab = "z", ylab = "Standard Normal density", yaxs = "i")
sd <- 2
xs <- c(seq(-sd, sd, length.out = 200))
ys <- dnorm(xs)
polygon(c(-sd, xs, sd), c(0, ys, 0), col = "green")
# What is the area of the green area?
pnorm(sd) - pnorm(-sd)
```
<br/>
**~99.7% of the observations fall within the range [mu - 3sd; mu + 3sd].**
```{r}
plot(x = z, y = dens, type = "l", ylim = c(0,0.45), lwd = 3, xlab = "z", ylab = "Standard Normal density", yaxs = "i")
sd <- 3
xs <- c(seq(-sd, sd, length.out = 200))
ys <- dnorm(xs)
polygon(c(-sd, xs, sd), c(0, ys, 0), col = "red")
# What is the area of the red area?
pnorm(sd) - pnorm(-sd)
```
<br/>
####Quantile function, or How to obtain the critical values of -z and z for a specified area under the standard normal curve.
**qnorm(p, mu, sd) - gives the value of z at which CDF (area on the left of z) is equal p**
```{r}
#Find the critical values -z and z for the area under the standard normal curve of 95.5%
p <- 0.955
qnorm(p) #gives z such that P(X < z) = p
qnorm(p + (1 - p) / 2) #gives a value of z such that P(-z < X < z) = p
```
It can be seen that this z = 2, or two standard deviations (sd = 1).