-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
451 lines (334 loc) · 15.4 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
---
output:
html_document:
keep_md: yes
toc: yes
---
# Peer-graded Assignment: Course Project 1
```{r Run Date, include=FALSE}
Runtime <- Sys.Date()
```
---
title: "PA1_template.Rmd"
author: "Julian Buhagiar"
date: 2017-04-07
output: html_document
---
date last modified: `r Runtime`
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# remove runtime used which is only used for the header of this document
rm(Runtime)
```
## Questions
[What is mean total number of steps taken per day?][1]
[What is the average daily activity pattern?][2]
[Are there differences in activity patterns between weekdays and weekends?][3]
## Description
This assignment makes use of data from a personal activity monitoring device.
This device collects data at 5 minute intervals through out the day. The data
consists of two months of data from an anonymous individual collected during
the months of October and November, 2012 and include the number of steps taken
in 5 minute intervals each day.
```{r source file, echo=TRUE}
URL = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
```
## Variables
### pre-process
included in the source dataset are:
• `steps`: Number of steps taking in a 5-minute interval (missing values are coded as NA)
• `date`: The date on which the measurement was taken in YYYY-MM-DD format
• `interval`: Identifier for the 5-minute interval in which measurement was taken
### post-process
included in the final dataset are:
• `sourcedData`: The data.frame that was read into R
• `timeOnly`: A variable constructed which converts the interval into a time
Note that this stores the current system time data with this value which is to be
disregarded for the analysis
• `day`: The day of the week
• `time`: A variable constructed which joins the observation date with the time
• `constructedData`: This is a transformed version of the `sourcedData` data.frame
where all NAs have been replaced with the average value for the interval.
• `Q1`: This is the data.frame associated with Question 1 output.
• `Average.StepsPerDay`: A list containing the median and mean in answer to Question 1.
• `Q2`: This is the data.frame associated with Question 2 output.
• `activityPeak`: A list containing the lower and upper bounds for the most active
interval in answer to Question 2.
• `Q3`: This a list of two data.frames `weekend` and `weekday` associated with
Question 3 output.
## System Info and Library Prerequisites
### System
```{r prerequisites, echo=FALSE}
message("R version\t",version$version.string)
message("OS\t",version$os)
```
### Libraries
```{r load libraries, include=FALSE}
# 1.0 prerequisite
library(dplyr)
library(stringr)
library(ggplot2)
```
```{r library status, echo=FALSE}
# 1.0 prerequisite
if(!exists("tbl_df")){
message<-message("dplyr 0.4.0 or greater is required")
return()
}
message("dplyr ",packageVersion("dplyr")," was used during for the production of this analysis")
if(!exists("str_pad")){
message<-message("stringr 1.1.0 or greater is required")
return()
}
message("stringr ",packageVersion("stringr")," was used during for the production of this analysis")
if(!exists("ggplot")){
message<-message("ggplot2 2.1.0 or greater is required")
return()
}
message("ggplot2 ",packageVersion("ggplot2")," was used during for the production of this analysis")
```
### Prerequisite Common Functions
#### Workspace Housekeeping
`keep()`is an internal function which keeps a record of the important variables
that the script will retain in the workspace following the running of the code.
`cleanWorkspace()` is a finalising function which removes the `<<-` variables that
aren't intended to be retained after the running of the function.
```{r keep, include=TRUE}
keep<-function(...){
#>DESCRIPTION----
# keeps a record of starting globals
# and appends the desired variables that we wish to keep by the end of the
# script, thereby making all other variables and functions that are made during
# the running of this function as temporary.
if(!exists("GlobalEnvKeep")){
GlobalEnvKeep<<-c(ls(envir = as.environment(globalenv())))
}
GlobalEnvKeep<<-c(GlobalEnvKeep,...)
}
keep()
cleanWorkspace<-function(){
rm(list= ls(envir=as.environment(globalenv()))[!(ls(envir=as.environment(globalenv())) %in% GlobalEnvKeep )],
envir=as.environment(globalenv()))
rm(keep,envir=as.environment(globalenv()))
}
```
#### Retrieve data files
`sourceData()` is used to download the data to the workspace and the working folder.
```{r sourceData function, include=TRUE}
sourceData<-function(src=URL,expectedfiles="activity.csv",workspaceNames="sourcedData"){
if(!file.exists(expectedfiles)){
message("downloading files")
download.file(src,"data.zip")
unzip("data.zip")
file.remove("data.zip")
}
columnClassPreDefined<-c("numeric","character","character")
## This first line will likely take a few seconds to load
if (!exists(workspaceNames)){sourcedData<<- read.csv(expectedfiles,
colClasses=columnClassPreDefined,
header = TRUE,
stringsAsFactors = FALSE,
na.strings = "NA")
#change the date column to a date type field
sourcedData$date<<-as.Date(sourcedData$date)
#pads the interval to become a 4 digit time
sourcedData$interval<<-stringr::str_pad(sourcedData$interval,width=4,side="left",pad="0")
# combines data and time into a single column called time
sourcedData$time<<-paste(sourcedData$date,sourcedData$interval)
sourcedData$time<<-as.POSIXct(strptime(sourcedData$time,"%Y-%m-%d %H%M"))
sourcedData$timeOnly<<-as.POSIXct(strptime(sourcedData$interval,"%H%M"))
sourcedData$day<-weekdays(sourcedData$time)
## retains the sourcedata variable for the project workspace
keep("sourcedData")
}
# organising data----
sourcedData<<-dplyr::tbl_df(sourcedData)
message("Data loaded successfully into the workspace as 'sourcedData'")
##creates a container to keep useful metadata as a list
metadata<<-list(NULL)
metadata$head<<-(head(sourcedData))
metadata$summary<<-summary(sourcedData)
str(sourcedData)
}
```
## Analysis Run Sequence
The call to the function `sourceData()` loads the data into R.
```{r load data}
sourceData(URL)
```
The data when it was loaded in shows there are `r nrow(sourcedData)` observations
and `r length(sourcedData)-3` variables. The `interval` variable represents a
24 hour time and a new variable has been introduced called `time` joining the date
and time together into a single POSIXct date time. Another variable called `day`
has also been included which is needed for [question 3][3]. The [variables][4] are
called `r names(sourcedData)`.
Running `summary()` on the loaded data shows there are `r metadata$summary[7,1]`
The observations were taken between `r range(sourcedData$date)`.
### Question 1 - What is mean total number of steps taken per day?
```{r Q1, echo=TRUE, fig.width=10}
Q1<-as.data.frame(summarise(group_by(sourcedData,date),
sum(steps,na.rm = TRUE)
))
names(Q1)<-c("date","steps.Total")
keep("Q1")
hist(as.matrix(Q1$steps.Total[Q1$steps.Total!=0]),
xlab="Total number of steps in a day",
main="Plot 1: Distribution for the number of steps taken in a day",
breaks= 20)
rug(as.matrix(Q1$steps.Total[Q1$steps.Total!=0]),col="grey")
abline(v=mean(Q1$steps.Total[Q1$steps.Total!=0]),col="blue",lty=3,lwd=5)
abline(v=median(Q1$steps.Total[Q1$steps.Total!=0]),col="red",lty=5,lwd=1)
legend("topright",legend=c("mean","median"),col=c("blue","red"),lty=c(3,5),lwd=c(5,1))
```
The total number of steps per day can be seen in the result of the `Q1` data.frame
Plot 1 shows an evenly balanced distribution with the most common
number of steps per day is between 10000 and 11000. The plot also shows that the
mean and median are close to eachother indicating there is very little skew for
this set of observations.
```{r, echo=TRUE}
Average.StepsPerDay <- list(NULL)
keep(Average.StepsPerDay)
Average.StepsPerDay$mean <- as.integer(mean(Q1$steps.Total[Q1$steps.Total!=0]))
Average.StepsPerDay$median <- as.integer(median(Q1$steps.Total[Q1$steps.Total!=0]))
```
Excluding the days where no data was collected:
The **mean number** of steps taken per day are
`r Average.StepsPerDay$mean` (to nearest whole number).
The **median number** of steps taken per day are
`r Average.StepsPerDay$median` (to nearest whole number).
### Question 2 - What is the average daily activity pattern?
Running `summary()` on the loaded data shows there are `r metadata$summary[7,1]`
The NA values occur for whole days so the script will attempt to impute values
based on the average interval.
```{r, echo=TRUE}
Q2<-as.data.frame(summarise(group_by(sourcedData,timeOnly,interval),
mean(steps,na.rm = TRUE)))
#create a subset of the data where NAs occur
constructedData<-subset(sourcedData,is.na(sourcedData$steps))
#impute the average of the interval where there is the NA
constructedData$steps<-Q2$`mean(steps, na.rm = TRUE)`
#bind the two dataframes together
constructedData<-dplyr::bind_rows(constructedData,subset(sourcedData,!is.na(sourcedData$steps)))
#reorder by time
constructedData<-dplyr::arrange(constructedData, time)
str(constructedData)
keep("constructedData")
NAcount<-sum(is.na(constructedData$steps))
```
There were **`r sum(is.na(sourcedData$steps))` NAs** in the `sourcedData` data.
There are **`r NAcount` NAs** in the `constructedData` data.
```{r Q2, echo=TRUE, fig.width=10}
Q2<-as.data.frame(summarise(group_by(constructedData,timeOnly,interval),
mean(steps,na.rm = TRUE)
))
names(Q2)<-c("time","interval","steps.Mean")
keep("Q2")
with(constructedData,
plot(x=timeOnly,
y=steps,
col=rgb(.5,.5,.5,.1),
pch=20,
ylab="Number of Steps",
xlab="Time of Day",
main = "Plot 2: Activity over a day"))
with(Q2,points(x=time,
y=steps.Mean,col="red",
type="l",
ylim=c(0,max(sourcedData$steps,na.rm=TRUE))
))
legend("topright",
legend=c("average number of steps at time interval"),
col=c("red"),
lty=c(1),
lwd=c(1),
box.col="transparent",
bg = "transparent" )
```
The average activity across a day for the study period can be seen in Plot 2.
```{r activityPeak, echo=TRUE}
activityPeak<-list(NULL)
keep(activityPeak)
activityPeak$most.lower<-format.Date(Q2$time[Q2$steps.Mean==max(Q2$steps.Mean)],"%T")
activityPeak$most.upper<-format.Date(Q2$time[Q2$steps.Mean==max(Q2$steps.Mean)]+299,"%T")
```
The **most active time interval** is in the morning at **`r activityPeak$most.lower`** - **`r activityPeak$most.upper`**
### Question 3 - Are there differences in activity patterns between weekdays and weekends?
A new factor is made in the column `weekdayType` within the `constructedData` data.frame
```{r}
constructedData$weekdayType<-as.factor(
ifelse(constructedData$day=="Saturday"|constructedData$day=="Sunday",
"weekend","weekday"))
Q3<-split(constructedData,constructedData$weekdayType)
keep("Q3")
str(Q3)
```
```{r, fig.height=7, fig.width=10}
par( mfrow= c(2,1), mar = c(4,4,3,1) )
plot(x=Q3$weekday$timeOnly,y=Q3$weekday$steps,
main="Plot3A - Activity on the weekdays",
xlab="time of day",
ylab="number of steps",
col=rgb(.5,.5,.5,.1),
pch=20
)
#get the average for the weekdays
Q3$avg.Weekday<-as.data.frame(summarise(group_by(Q3$weekday,timeOnly,interval),
mean(steps,na.rm = TRUE)
))
names(Q3$avg.Weekday)<-c("time","interval","steps.Mean")
with(Q3$avg.Weekday,points(x=time,
y=steps.Mean,col="red",
type="l",
ylim=c(0,max(sourcedData$steps,na.rm=TRUE))
))
#plots the points
plot(x=Q3$weekend$timeOnly,y=Q3$weekend$steps,
main="Plot3B - Activity on the weekend days",
xlab="time of day",
ylab="number of steps",
col=rgb(.5,.5,.5,.1),
pch=20)
legend("topleft",
legend=c("average number of steps at time interval"),
col=c("red"),
lty=c(1),
lwd=c(1),
box.col="transparent",
bg = "transparent" )
#get the average for the weekends
Q3$avg.weekend<-as.data.frame(summarise(group_by(Q3$weekend,timeOnly,interval),
mean(steps,na.rm = TRUE)
))
names(Q3$avg.weekend)<-c("time","interval","steps.Mean")
with(Q3$avg.weekend,points(x=time,
y=steps.Mean,col="red",
type="l",
ylim=c(0,max(sourcedData$steps,na.rm=TRUE))
))
legend("topleft",
legend=c("average number of steps at time interval"),
col=c("red"),
lty=c(1),
lwd=c(1),
box.col="transparent",
bg = "transparent" )
```
Plot 3 separates out the weekly routine into weekend and weekdays. It shows that
there is a slightly different pattern for weekends.
It appears that within a 5 minute period around 600 steps is what the subjects
natural walking pace, and when they are in a rush they can achieve 800 steps
within an interval.
The peak values occur around the mid afternoon for on the weekends and there is
generally more activity throughout the day compared to weekdays, whereas the
morning period around 9:00 am for both weekends and weekdays but weekdays are faster
paced around this period.
Weekdays also show an ocassional short fast paced spikes around lunch time and at
around 15:00. ctivity begins to dies dwo naround 19:00 wheras on weekends it can
be up to an hour later.
```{r tidyup, include=FALSE}
cleanWorkspace()
```
[1]: https://github.com/JulesBuh/RepData_PeerAssessment/blob/master/PA1_template.md#question-1---what-is-mean-total-number-of-steps-taken-per-day
[2]: https://github.com/JulesBuh/RepData_PeerAssessment/blob/master/PA1_template.md#question-2---what-is-the-average-daily-activity-pattern
[3]: https://github.com/JulesBuh/RepData_PeerAssessment/blob/master/PA1_template.md#question-3---are-there-differences-in-activity-patterns-between-weekdays-and-weekends
[4]: https://github.com/JulesBuh/RepData_PeerAssessment/blob/master/PA1_template.md#variables