-
Notifications
You must be signed in to change notification settings - Fork 50
/
Copy path05-Data-Types.Rmd
114 lines (78 loc) · 2.61 KB
/
05-Data-Types.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
title: "Data Types"
output: html_notebook
---
```{r setup}
library(tidyverse)
library(babynames)
library(nycflights13)
library(stringr)
library(forcats)
library(lubridate)
library(hms)
```
## Your Turn 1
Use `flights` to create `delayed`, the variable that displays whether a flight was delayed (`arr_delay > 0`).
Then, remove all rows that contain an NA in `delayed`.
Finally, create a summary table that shows:
1. How many flights were delayed
2. What proportion of flights were delayed
```{r}
```
## Your Turn 2
In your group, fill in the blanks to:
1. Isolate the last letter of every name and create a logical variable that displays whether the last letter is one of "a", "e", "i", "o", "u", or "y".
2. Use a weighted mean to calculate the proportion of children whose name ends in a vowel (by `year` and `sex`)
3. and then display the results as a line plot.
```{r}
babynames %>%
_______(last = _________,
vowel = __________) %>%
group_by(__________) %>%
_________(p_vowel = weighted.mean(vowel, n)) %>%
_________ +
__________
```
## Your Turn 3
Repeat the previous exercise, some of whose code is below, to make a sensible graph of average TV consumption by marital status.
```{r}
gss_cat %>%
drop_na(________) %>%
group_by(________) %>%
summarise(_________________) %>%
ggplot() +
geom_point(mapping = aes(x = _______, y = _________________________))
```
## Your Turn 4
Do you think liberals or conservatives watch more TV?
Compute average tv hours by party ID an then plot the results.
```{r}
```
## Your Turn 5
What is the best time of day to fly?
Use the `hour` and `minute` variables in `flights` to compute the time of day for each flight as an hms. Then use a smooth line to plot the relationship between time of day and `arr_delay`.
```{r}
```
## Your Turn 6
Fill in the blanks to:
Extract the day of the week of each flight (as a full name) from `time_hour`.
Calculate the average `arr_delay` by day of the week.
Plot the results as a column chart (bar chart) with `geom_col()`.
```{r}
flights %>%
mutate(weekday = _______________________________) %>%
__________________ %>%
drop_na(arr_delay) %>%
summarise(avg_delay = _______________) %>%
ggplot() +
___________(mapping = aes(x = weekday, y = avg_delay))
```
***
# Take Aways
Dplyr gives you three _general_ functions for manipulating data: `mutate()`, `summarise()`, and `group_by()`. Augment these with functions from the packages below, which focus on specific types of data.
Package | Data Type
--------- | --------
stringr | strings
forcats | factors
hms | times
lubridate | dates and times