-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy path06-time.Rmd
355 lines (275 loc) · 17 KB
/
06-time.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
# Temporal data {#time}
Time is ubiquitous in road safety data, since collisions and road safety implementations always happen at some point in time.
This section will show how you can analyse the temporal dimensions of the real world `crashes_2019` object we created in Section \@ref(pkgs), and then demonstrate how to handle time series data in base R, as well as with `hms` and `lubridate` packages.
The aim is to get you up-to-speed with how data analysis with time data 'feels' before learning the details in subsequent sections.
If you are the kind of person who likes to know the details first, feel free to skip this section and return to it later.
## Temporal analysis of crash data
To get a feel for temporal data analysis in R, let's start by reading-in crash data for 2019 with the `stats19` package by typing the following into the Source Editor and running the code, line-by-line, as taught in Section \@ref(rstudio):
```{r, message=FALSE}
library(stats19)
crashes_2019 = get_stats19(2019)
```
Note that, unlike the longer `crashes_2019 = get_stats19(year = 2019, type = "accidents")` used in Section \@ref(pkgs), we did not use *named arguments* in this code chunk.
Instead of `year = 2019`, we simply typed 2019.
That is possible because R functions can be specified by name matching or order: the first argument of `get_stats()` is `year`, so the function is expecting a year value.
Also, although we didn't explicitly specify the `accidents` table, `type = "accidents"` is the default value, so `type` only needs to be specified when importing casualty and vehicle datasets.
With that educational aside out of the way, we will now take a look at the time variables that are actually in our newly read-in dataset:
```{r}
library(tidyverse)
crashes_2019 %>%
select(matches("time|date")) %>%
names()
```
Building on the previous section and a bit of guesswork, it should be clear what just happened:
we selected variables that *match* (with the `matches()` function) the character strings `"time"` or (as indicated by the `|` vertical pipe symbol) `"date"` and returned the matching variable names.
This shows that the `stats19` package gives you not one, not two, but three temporal variables.
**Exercises:**
1. Print the first 6 and then the first 10 elements of each of the three temporal variables in `crashes_2019`.
2. What is the class of each variable (technically, of each vector)?
3. **Bonus:** Extract the weekday from the variable called `date`.
4. **Bonus:** How many crashes happened on Monday?
```{r, eval=FALSE, echo=FALSE}
library(stats19)
crashes_2019 = stats19::get_stats19(year = 2019, type = "ac")
crashes_2019
```
<!-- **Advanced challenge:** calculate how many crashes occurred for each day of the week. Then plot it with ggplot2. Repeat the same exercises extracting the hour of the car accident from the variable called time. How would you combine the two informations in a single plot? -->
```{r, eval=FALSE, echo=FALSE}
# solutions
crashes %>% filter(casualty_hour >= 12)
crashes %>% filter(casualty_hour > 15 & casualty_hour < 19)
crashes_2019 %>%
mutate(my_weekdays = weekdays(date)) %>%
filter(my_weekdays == "Monday") %>%
nrow()
crashes_2019 %>%
mutate(my_weekdays = weekdays(date)) %>%
filter(my_weekdays == "Friday") %>%
nrow()
crashes_2019 %>%
mutate(my_weekdays = weekdays(date)) %>%
group_by(my_weekdays) %>%
summarize(n = n()) %>%
ggplot() +
geom_col(aes(x = my_weekdays, y = n))
crashes_2019 %>%
mutate(my_hours = hour(hm(time))) %>%
group_by(my_hours) %>%
summarize(n = n()) %>%
ggplot() +
geom_col(aes(x = my_hours, y = n))
crashes_2019 %>%
mutate(my_weekdays = weekdays(date), my_hours = hour(hm(time))) %>%
group_by(my_weekdays, my_hours) %>%
summarise(n = n()) %>%
ggplot() +
geom_line(aes(x = my_hours, y = n, col = my_weekdays), size = 1.05)
# the legend needs some reordering
```
```{r, echo=FALSE, eval=FALSE}
crashes_2019 %>%
select(matches("time|date")) %>%
lapply(head, 10)
crashes_2019 %>%
select(matches("time|date")) %>%
sapply(class)
```
Of the three time variables, it should be clear from the outcome of previous exercises that `datetime` contains the most useful information.
To consolidate the plotting know-how learned in Section \@ref(pkgs), we shall start by simply plotting the `datetime` object (Figure 6.1). Plotting data is a good way of understanding new datasets and the variables they contain.
Create the following three plots to show how `date` and `time` vary as a function of `datetime`:
```{r ggtime1, warning=FALSE, fig.cap="Three plots of the datetime (x axis) in relation to the date and time axis.", out.width="30%", fig.show='hold'}
library(ggplot2)
ggplot(crashes_2019) + geom_point(aes(datetime, date))
ggplot(crashes_2019) + geom_point(aes(datetime, time))
b = c("07:00", "09:00", "12:00", "17:00", "19:00")
ggplot(crashes_2019) + geom_point(aes(datetime, time), alpha = 0.01) +
scale_y_discrete(breaks = b)
```
The three figures above tell us many things about the contents of the three temporal variables. It even provides insight into the temporal distribution of road casualties in Great Britain.
The first two plots (Figure 6.1 and 6.2) show: 1) that the `date` variable is identical to the `datetime` variable (at least on the daily resolution than can be seen on the graph); and 2) that `time` values repeat regularly for the range of dates in `datetime` (from the start of Jan 2019 to end of Dec 2019).
Figure 6.3 makes use of `ggplot2`'s functionality to show only certain labels on the Y axis and reduced opacity, so that overlapping points are not completely black. This by far is the most useful of the three plots, showing that most crashes happen between around 7am and 7pm, with a 'long tail' of crashes in the evening, and that for most of the year there is a clear weekly cycle, reflecting the uptick in crashes during the rush hour commute on weekdays, a pattern that is greatly diminished during several weeks in summer (perhaps corresponding with summer holidays).
The 52 weeks of the year can be distinguished even in this small and simple plot, highlighting the ability of visualisation to help understand data.
Next, let's look at how the time-of-day that crashes occur varies as a function of season, severity and day of week.
From the `datetime` object of class `POSIXct`, any type of time information can be extracted. This includes the minute, hour, day of week and month of the crash (or other) event that the object records.
```{r, eval=FALSE, echo=FALSE}
library(lubridate)
weekdays(min(crashes_2019$date) + 0:6)
table(factor(weekdays(crashes_2019$date), levels = weekdays(min(crashes_2019$date) + 0:6)))
```
Building on the time series plot we created in Section \@ref(ggplot2), let's create a graph showing how the hourly distribution of crash numbers changes during the course of a working week.
We will do this first by *preprocessing* the data, creating a new object called `crashes_dow`, containing `hour` and `day` columns, then filtering out weekend, and plotting the results, as shown in the code chunk below and Figure \@ref(fig:ggfacetwkday):
```{r ggfacetrwkay, fig.cap="Facetted time series showing how the number of crashes increases during the working week.", warning=FALSE, message=FALSE, fig.height=2}
# days of the week:
dow = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
crashes_dow = crashes_2019 %>%
mutate(hour = lubridate::hour(datetime)) %>%
mutate(day = factor(weekdays(date), levels = dow))
crashes_dow %>%
filter(!is.na(hour) & !day %in% c("Saturday", "Sunday")) %>%
ggplot(aes(hour)) +
geom_bar(width = 1.01) +
facet_wrap(~day, nrow = 1)
```
The result in Figure \@ref(fig:ggfacetwkday) is useful, but if we're interested in the number of crashes per hour on different days of the week *relative to the average*, we need to undertake more preprocessing steps.
We will count the number of crashes per hour for all 5 working days and then divide by 5 to get the average number of crashes per hour during weekdays. Then we will count the number of crashes per hour/week combination. Finally we will divide the latter by the former.
These steps are shown in the code chunk below, which results in Figure \@ref(fig:ggfacetrwkayrel).
```{r ggfacetrwkayrel, fig.cap="Facetted time series showing relative number of crashes per hour by day in the working week.", warning=FALSE, message=FALSE, fig.height=2}
crashes_day_rel = crashes_dow %>% # create 'day of week relative' object
filter(!is.na(hour) & !day %in% c("Saturday", "Sunday")) %>% # none on weekends
select(day, hour) %>% # keep only time columns
group_by(hour) %>% # group by hour
mutate(n_per_hour = n()/5) %>% # number per hour (divide by 5 for n. days)
group_by(day, hour) %>% # group by day and hour
summarise(n_hday = n(), n_h = first(n_per_hour)) %>% # summarise results
mutate(hday_relative = n_hday / n_h) # calculate relative n. crashes per hour/day
summary(crashes_day_rel)
crashes_day_rel %>%
ggplot() +
geom_col(aes(hour, hday_relative)) +
facet_wrap(~day, nrow = 1)
```
The results clearly show that Friday is a dangerous day as many of the columns are above 1 (*NB as this is a relative calculation, columns that are less than 1 indicate that there are less crashes per hour on that day than average whereas those above 1 indicate that there are more crashes per hour on that day than average*).
The extent to which the high relative number of crashes in the most anomalous hours (Friday evening) is due increased exposure vs increased risk per km travelled cannot be ascertained by this plot but it certainly suggests that Friday afternoon and evening is a worthy focus of road safety research.
**Exercises:**
1. Building on the code above, show the absolute and relative number of crashes per hour on Saturday and Sunday.
2. Filter the dataset so it contains only data from two police forces of your choice (e.g. `West Yorkshire` and `Metropolitan Police`).
3. Try creating plots similar to those shown above but facetted by police force rather than by day of the week.
## Handling dates and date-times
It is worth remembering that base R already has decent support for dates and `datetimes`, although the base R functions are not particularly intuitive.
This is shown in the code chunk below, which creates objects representing the date and time of a fictitious crash event on a cold winter's morning, 1^st^ January 2020, and a subsequent road safety intervention on the 20^th^ October 2020:
```{r}
crash_datetime_character = "2020-01-01 08:35" # creates date/time as a character
crash_datetime = as.POSIXct(crash_datetime_character) # converts date/time to a object of the POSIXct type
class(crash_datetime)
intervention_date_character = "2020-10-20"
intervention_date = as.Date(intervention_date_character) # converts date/time to a object of the Date type
class(intervention_date)
# see ?as.POSIXct for more examples
```
'POSIXct', 'POSIXt' and 'Date' are data types for dates and time that enable easy manipulation of such data. Fortunately for most users, there are easier ways to work with time series data, starting with the `hms` package.
## Hours, minutes seconds with hms
The `hms` library in the `tidyverse` can be used to process hours, minutes and seconds, as shown below.
See a very basic demo of the package and links to the package's help pages with the following commands in which we use the package without loading it with the `library()` function, demonstrating the `package::function()` syntax taught in Section \@ref(pkgs):
```{r}
library(tidyverse)
```
```{r}
crash_time_character = "08:35:00"
crash_time_hms = hms::as_hms(crash_time_character)
class(crash_time_hms)
```
```{r, eval=FALSE}
?hms::`hms-package`
```
As the package's name suggests, it is used for dealing with hours, minutes and seconds.
It can round time objects of class `hms` to the nearest second (or any multiple of a second):
```{r}
hms::round_hms(crash_time_hms, 1) # time to the nearest second
hms::round_hms(crash_time_hms, 1 * 60 * 60) # time to the nearest hour
hms::round_hms(crash_time_hms, 1 * 30 * 60) # time to the nearest half hour
```
It can also convert simple text strings into time objects, e.g. as follows (**Note:** we do not need to include the `:00`):
```{r}
hms::parse_hm("08:35")
```
## The lubridate package
In many cases the most useful and easy to use package when working with temporal data is `lubridate`.
Having installed it, load it as follows:
```{r, message=FALSE}
library(lubridate)
```
The simplest example of a Date object that we can analyze is just the current date, i.e.:
```{r}
today()
```
We can manipulate this object using several `lubridate` functions to extract the current day, month, year, weekday and so on...
```{r, eval=FALSE}
x = today()
day(x)
wday(x)
wday(x) %in% c(1, 6) # is it the weekend?
month(x)
year(x)
```
Base R can also be used to extract data e.g. `# Base R function to get the day of week
weekdays(x)`.
**Exercises: **
1. Look at the help page of the `lubridate` function `month` to see how it is possible to extract the current month as a character vector.
2. Look at other functions in `lubridate` to extract the current weekday as a number, the week of year and the day of the year.
Date variables are often stored simply as character vectors.
This is a problem, since R is not always smart enough to distinguish between character vectors representing Dates.
`lubridate` provides functions that can translate a wide range of date encodings such as `ymd()`, which extracts the Year, Month and Day from a character string, as demonstrated below.
```{r, eval=FALSE}
as.Date("2019-10-17") # works
as.Date("2019 10 17") # fails
ymd("2019 10 17") # works
dmy("17/10/2019") # works
```
Import functions, such as `read_csv`, try to recognize the Date variables.
Sometimes this fails.
You can manually create Date objects, as shown below:
```{r}
x = c("2009-01-01", "2009-02-02", "2009-03-03")
x_date = ymd(x)
x_date
```
**Exercises:**
1. Extract the day, the year-day, the month and the weekday (as a non-abbreviated character vector) of each element of `x_date`.
2. Convert `"09/09/93"` into a date object and extract its weekday.
3. **Bonus:** Read the help page of `as.Date` and `strptime` for further details on base R functions for dates.
4. **Bonus:** Read the Chapter 16 of [R for Data Science book](https://r4ds.had.co.nz/dates-and-times.html) for further details on `lubridate` package.
```{r, echo=FALSE, eval=FALSE}
# 1. Extract the day, the year-day, the month and the weekday (as a non-abbreviated character vector) of each element of `x_date`.
day(x_date)
yday(x_date)
month(x_date)
weekdays(x_date, abbreviate = FALSE)
# 1. Modify the previous example to parse the following character string: `"09/09/1993"` and extract its weekday.
weekdays(dmy("09/09/93"))
wday(dmy("09/09/93"))
```
## Dates in a data frame
We can use Dates for subsetting events in a dataframe. For example, if we define `x_date` as before and add it to the `crash` dataset, i.e.:
```{r}
crashes$casualty_day = x_date
```
Then we can subset events using Dates. For example:
```{r}
filter(crashes, day(casualty_day) < 7) # the events that ocurred in the first week of the month
filter(crashes, weekdays(casualty_day) == "Monday") # the events occurred on monday
```
**Exercises:**
1. Select only the events (rows in `crashes`) that occurred in January.
2. Select only the events that ocurred in an odd year-day.
3. Select only the events that ocurred in a leap-year (**Hint:** check the function `leap_year`).
4. Select only the events that ocurred during the weekend or in June.
5. Select only the events that ocurred during the weekend and in June.
6. Count how many events ocurred during each day of the week.
## Components of time objects
Now we'll take a look at the time components of a Date. Using the function `hms` (acronym for Hour, Minutes, Seconds) and its subfunctions such as `hm` or `ms`, we can parse a character vector representing several times into an Hour object (which is technically called a 'period object').
```{r}
x = c("18:23:35", "00:00:01", "12:34:56")
x_hour = hms(x)
x_hour
```
We can manipulate these objects using several `lubridate` functions to extract the hour component, the minutes, and so on:
```{r}
hour(x_hour)
minute(x_hour)
second(x_hour)
```
If the Hour data does not specify the seconds, we just use a subfunction of `hms`, namely `hm`, to get the hours and minutes, rather than hours, minutes and seconds.
```{r}
x = c("18:23", "00:00", "12:34")
(x_hour = hm(x))
```
We can use Hour data also for subsetting events, like we did for Dates. Let's add a new column for hour to the crashes data:
```{r}
crashes$casualty_hms = hms(c("18:23:35", "00:00:01", "12:34:56"))
crashes$casualty_hour = hour(crashes$casualty_hms)
```
**Exercises:**
1. Filter only the events that occurred after midday (i.e. the PM events). **Hint:** your answer may include `>= 12`.
2. Filter only the events that ocurred between 15:00 and 19:00.
<!-- 1. Round all hours to the next hour. Hint: Look at the help page of the `round_date` function. -->
<!-- 1. **Bonus (difficult):** run the following code, which downloads data for car crashes that occurred during 2019. -->