-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathweek9-data-wrangling.Rmd
568 lines (447 loc) · 13.7 KB
/
week9-data-wrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
---
title: "Week 9: Intro to piping and tidyverse data wrangling"
output:
html_document:
toc: true
include:
after_body: footer.html
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
This week I will give a brief introduction to piping and data wrangling with tidyr and tidyverse verbs. If you want to learn more, there are many resources on the web. I like [this course book](https://dcl-wrangle.stanford.edu/) as a beginning, [this course](https://cengel.github.io/R-data-wrangling/) goes into more depth, and [R for Data Science](https://r4ds.had.co.nz/) has much more.
Today's goal is to teach you 6 data "verbs" that are easy to use and that will cover many of your data-wrangling needs.
```{r message=FALSE, warning=FALSE}
library(dplyr)
library(tidyr)
library(ggplot2)
```
# Piping
You already use the concept of "piping", but you do it with parentheses. Your pipe in this case goes from inside to outside.
```
fun4(fun3(fun2(fun1(x))))
```
`x` is our initial object. We apply `fun1()` to that. Then that result goes into `fun2()`. The result of that goes into `fun3()`. etc. "Pipe" is referring to the flow of information, the "goes into" part.
This has many problems.
* You will lose track of your parentheses and waste much time sorting out where you lost one of your parentheses.
* You are limited in how many levels you can add without your code being unreadable as it stretches quickly to multiple lines.
* We don't read inside to outside though from our math education, we are used to that (sort of).
Piping is just a different way to write the same thing but left to right like we read (in English).
```
x %>% fun1() %>% fun2() %>% fun3() %>% fun4()
```
Example:
```{r}
1:10 %>% sqrt() %>% mean() %>% round(digits=1)
```
That's the same as
```{r}
round(mean(sqrt(1:10)),digits=1)
```
## Upsides
* It is used a lot in the tidyverse so you should be aware of how it works.
* It works well in the tidyverse.
* Can be more readable.
* Makes it easy to avoid intermediate variables (`tmp` and `foo`).
## Downsides
* It's is easy to write code that is hard to understand. It tends to hide where the object on the left is going to in the function on the right.
So you see code like this
```{r}
fun <- function(a=1, b=10){c(max(a), min(b))}
a <- 2:10
a %>% fun(a^2)
```
and you have to go research `fun` to discover that this is `fun(a=a, b=a^2)`.
This kind of mistake is so easy to make. You want say, `rnorm(n=a)`, and accidentally write that in your pipe.
```{r}
a <- 3
a %>% rnorm(n=a)
a %>% rnorm()
```
The left flows into the first function spot, UNLESS you specified that and then it flows to the second or third or whatever is "open".
* You can spend hours trying to solve your piping issue when you could immediately write down code using intermediate values.
* It's really slow.
```{r message=FALSE}
library(microbenchmark)
bm <- microbenchmark(1:10 %>% sqrt() %>% mean() %>% round(digits=1), round(mean(sqrt(1:10)),digits=1) )
ggplot2::autoplot(bm)
```
* You must stay hyper-alert as to the names in the data frame you are working with. You might not have created this data frame. This may be a function you wrote to take any data frame from a user.... Otherwise, bad things like this can happen.
```{r}
fun <- function(x, cyl=1){
x %>% transform(new = mpg^cyl) %>% select(mpg, new)
}
```
Works great!
```{r}
df <- data.frame(mpg=runif(10, 10, 30))
fun(df, cyl=0.5)
```
Why, why? It didn't complain but this is wrong!
```{r}
fun(mtcars[1:5,], cyl=0.5)
```
# Tidyverse verbs
Read up on tidyverse verbs [here](https://dplyr.tidyverse.org/).
NOTE:
* I am using piping because that is how you will see these verbs used in online material. They are just functions and you do not need to use them with piping.
* They are very slow. For loops are not the place for these functions. `apply()` in base R is much faster.
* Tidyverse is maturing and the semantics across functions is not quite the same. Careful.
## Workflow
```
data.frame %>%
select columns or filter rows %>%
summarize or mutate
```
```
data.frame %>%
group data %>%
summarize or mutate
```
Verbs that we will learn:
* `select()` select columns
* `filter()` select rows
* `group_by()` apply the function differently within groups.
* `summarize()` apply function over columns -> fewer rows. Like `apply(df,2,function)`
* `mutate()` apply function to columns and make new column. Like `apply(df,1,function)`
What we won't learn: applying a function across a group of columns. See ?across to learn about that.
## Selecting and filtering
```
data.frame %>%
select or filter
```
### `select()`
Select columns.
```{r}
head(mtcars)
```
```{r}
mtcars %>% select(mpg)
```
```{r}
mtcars %>% select(-mpg, -qsec, -gear, -cyl)
```
```{r}
mtcars %>% select(disp:qsec)
```
```{r}
mtcars %>% select(!disp:qsec)
```
```{r}
df %>% select(where(is.numeric))
```
### `filter()`
Select rows.
```{r}
mtcars %>% filter(cyl==4)
```
Same as `subset(df, cyl==4)` or `df[df$cyl==4,]`
```{r}
mtcars %>% filter(cyl==4, hp>90)
```
Same as `subset(df, cyl==4 & hp>90)` or `df[df$cyl==4 & df$hp>90,]`.
## `summarize()`
`summarize()` summarizes down columns and leads to fewer rows. **You must tell it the columns to work on.**
```
data.frame %>%
summarize
```
Apply a function to columns.
```{r}
mtcars %>%
summarize(mean.wt=mean(wt))
```
I can add on more functions.
```{r}
mtcars %>%
summarize(mean.wt=mean(wt)) %>%
round(digits=2)
```
Same as
```{r}
round(mean(mtcars$wt), digits=2)
```
You can specify multiple columns. You have to put in `c()` in this case.
```{r}
mtcars %>%
summarize(mean.wt=mean(c(mpg, hp)))
```
Same as
```{r}
mean(c(mtcars$mpg, mtcars$hp))
```
Sadly this does NOT work for specifying a range. It computes something but not what you'd think. **This is the `apply across columns` case.** Read up on that in `?across`.
```{r}
mtcars %>%
summarize(mean.wt=mean(mpg:hp))
```
This is computing the mean of this:
```{r}
mtcars[1,"mpg"]:mtcars[1,"hp"]
```
which is
```{r}
mean(mtcars[1,"mpg"]:mtcars[1,"hp"])
```
## `summarize()` by column
`summarize_all()` and `summarize_at()`
Apply to a range of columns or all columns. Note the `_at()` and `_all()` functions have been superseded with `across()` but I find `across()` really cumbersome. I personally would use `select()` and `summarize_all()`.
```{r}
mtcars %>% summarize_at(vars(mpg:hp), mean)
```
```
data.frame %>%
select columns %>%
use summarize_all()
```
```{r}
mtcars %>%
select(mpg:hp) %>%
summarize_all(mean)
```
```{r}
apply(mtcars[,1:4],2,mean)
```
`summarize_all()` is less picky than `apply()`
```{r}
df <- mtcars[,1:3]; df$label="test"
df %>% summarize_all(mean)
```
while `apply()` gives all NA.
```{r}
apply(df,2,mean)
```
Another example.
```
data.frame %>%
select rows %>%
select columns %>%
use summarize_all() verb to apply %>%
format with round()
```
```{r}
mtcars %>%
filter(cyl==4) %>%
select(mpg:hp) %>%
summarize_all(mean) %>%
round(digits=2) %>%
paste(collapse=" -- ")
```
## `summarize()` by group
```
data.frame %>%
define groups %>%
apply summarize within groups
```
```{r}
df2 <- data.frame(id=paste0(rep("s",8),1:2),
pop=paste0(rep("r",8),rep(1:2,each=4)),
Week1=rnorm(8), Week2=rnorm(8), Week3=rnorm(8), Week4=rnorm(8))
head(df2)
```
The `group_by()` function allows you to do more complicated things. Take the mean by pop.
```{r}
df2 %>% group_by(pop) %>% summarize_all(mean)
```
```{r}
apply(df2[df2$pop=="r1",3:6],2,mean)
```
Take the mean for all weeks by pop and id. Notice that I have to specify columns for `summarize()`.
```{r}
df2 %>%
group_by(pop, id) %>%
summarize(mean=mean(c(Week1, Week2, Week3, Week4)))
```
## Combine group and select
Select based on properties. For some reason you have to wrap in `where()`.
```{r}
df2 %>% group_by(pop) %>%
select(where(is.numeric)) %>%
summarize_all(mean)
```
Select based on name.
```{r}
df2 %>% group_by(pop) %>%
select(ends_with("2")) %>%
summarize_all(mean)
```
## `mutate()`
Apply a function across rows and return a new column. Like `apply(df,1,fun)`.
```{r}
mtcars %>%
select(mpg:wt) %>%
mutate(wt.hp=wt/hp)
```
You can write custom functions to apply to the groups.
```{r}
fun <- function(x,y){fit <- lm(y~x); coef(fit)[1]} #return interept
mtcars %>%
group_by(carb) %>%
summarize(lmm=fun(mpg,hp))
```
Use `mutate()` to add column. **This will not change the number of rows.**
```{r}
mtcars %>%
group_by(carb) %>%
mutate(lmm=fun(mpg,hp)) %>%
select(mpg, hp, lmm)
```
### Careful `mutate()` guesses
That works.
```{r}
mtcars %>%
select(mpg:wt) %>%
mutate(mean=sqrt(wt))
```
Yipes! It decided *you must want me to take the mean in the whole column*.
```{r}
mtcars %>%
select(mpg:wt) %>%
mutate(mean=mean(wt))
```
### Careful regarding *across*
It is super easy to try to use this to apply functions across multiple columns. Doesn't work. You need to use `?across` (or `group_by()` with a row factor).
```{r}
mtcars %>%
select(mpg:wt) %>%
mutate(mean1 = (wt+hp)/2, mean2=mean(c(wt,hp)))
```
## Tidyverse verbs are slow
For loops are not the place for these verbs!
```{r message=FALSE}
bm <- microbenchmark(
apply(mtcars,2,sum),
summarize_all(mtcars, sum)
)
ggplot2::autoplot(bm)
```
```{r message=FALSE}
bm <- microbenchmark(
mtcars[mtcars$cyl==4,],
filter(mtcars, cyl==4),
subset(mtcars, cyl==4)
)
ggplot2::autoplot(bm)
```
# Long and wide data frames
Wide is what you need for say `lm()`. You have your response variable in one column and then your predictor variables in other columns.
```{r}
df <- subset(mtcars, select=c(mpg, wt, hp, carb))
df$model <- rownames(mtcars)
head(df)
```
Fit with `lm()`.
```{r}
fit <- lm(mpg ~ wt + hp, data=df)
summary(fit)
```
Long form is what ggplot wants. Usually I have to ponder a bit before I decide how to make my long form data frame. There is not just "one" that you could make from your wide form data frame.
Let's make a plot of the red, green and blue columns. We want the points to be red, green or blue.
```{r}
plot(df$mpg, col="red", ylim=c(0,200))
points(df$wt, col="green")
points(df$hp, col="blue")
```
Same idea but let's plot versus the carburator number.
Let's make a plot of the red, green and blue columns. We want the points to be red, green or blue.
```{r}
plot(df$carb, df$mpg, col="red", ylim=c(0,200))
points(df$carb, df$wt, col="green")
points(df$carb, df$hp, col="blue")
```
Now let's do this in ggplot. What values do we want in the `aes()` function? Those need to be columns. `aes(x=..., y=..., col=....)`
```
data.frame(
col=name.of.thing.we.plot,
y=value.of.thing.we.plot,
x=x.value
)
```
So we'll have a column of x (carb or model or 1:n), a column for our y (mpg,wt,hp) and a column for names ("mpg","wt","hb").
```{r}
df.long <- data.frame()
a <- data.frame(model=df$model, val=df$mpg, name="mpg", carb=df$carb)
df.long <- rbind(df.long, a)
a <- data.frame(model=df$model, val=df$wt, name="wt", carb=df$carb)
df.long <- rbind(df.long, a)
a <- data.frame(model=df$model, val=df$hp, name="hp", carb=df$carb)
df.long <- rbind(df.long, a)
df.long
```
Now we can plot with ggplot.
```{r}
ggplot(df.long, aes(x=model, y=val, col=name)) + geom_point()
```
And we can easily do panels.
```{r}
ggplot(df.long, aes(x=model, y=val, col=name)) + geom_point() +
facet_wrap(~name)
```
## Pivoting
So creating your long data frames like that would get tidieous and inefficient.
The tidyr package has [functions to make this easier](https://dcl-wrangle.stanford.edu/pivot-basic.html).
### `pivot_longer`
```
pivot_longer(
data frame name,
cols = what columns will you stack or pivot,
names_to = colname for the column with the column names that your stacked,
values_to = colname for the column with the values
)
```
```{r}
df.long2 <-
pivot_longer(
df,
cols = c("mpg", "wt", "hp"),
names_to = "name",
values_to = "val"
)
```
```{r}
df.long2
```
It changed the order our names but that's ok. We can fix that if needed.
```{r}
ggplot(df.long, aes(x=model, y=val, col=name)) + geom_point() +
facet_wrap(~name)
```
It changed the order our names but that's ok. We can fix that if needed.
```{r}
df.long$name <- factor(df.long$name, levels=c("mpg", "wt", "hp"))
ggplot(df.long, aes(x=model, y=val, col=name)) + geom_point() +
facet_wrap(~name) +
theme(axis.text.x=element_blank())
```
Read `?tidyr::pivot_longer` to learn more options but this will be all you need most of the time.
### `pivot_wider`
```
pivot_wider(
data frame name,
names_from = what column has the column names
values_from = what column has the values
)
```
```{r}
pivot_wider(
df.long,
names_from = "name",
values_from = "val"
)
```
BTW, you will normally see the function called with piping.
```{r results="hide"}
df.long %>%
pivot_wider(
names_from = "name",
values_from = "val"
)
```
# Comments
I rarely use piping. Why?
* Piping is fine but is super slow. Also I develop packages and the R package checks often through errors with piping or any type of implicit reference to data columns in data frames. So like `subset(df, x==1)` is implicit because it is implying that there is an `x` column in `df` but that code will actually work if `x` is not in `df`. If you are a package writer, the check will complain about that vagueness.
* I have a hard time following long strings of pipes and lose track of what's going on.
And I rarely use dplyr, tidyr, or tidyverse verbs.
* tidyverse is "maturing" which means the functions keep changing...which means my code rots quickly.
* I often have to do something that takes me a long time to figure out how to do with tidyverse.
Whereas learning ggplot2 has made my plotting much, much faster, I can't say the same with data-wrangling with tidyverse except for a few functions that I use to help out with ggplot.