-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy path422_caret_tidymodels.Rmd
152 lines (120 loc) · 4.67 KB
/
422_caret_tidymodels.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: "caret_tidymodels"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Comparing `caret` versus `tidymodels` on the same dataset
### Reference
- [Caret vs Tidymodels, Yu En Hsu](https://github.com/yuenhsu/Machine-Learning-Projects)
- [Bike Sharing Dataset from UCI repository](https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset).
- [Caret vs. tidymodels - comparing the old and new](https://konradsemsch.netlify.app/2019/08/caret-vs-tidymodels-comparing-the-old-and-new/)
### Data
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
### Attributes
- instant: record index
- dteday : date
- season : season (1:winter, 2:spring, 3:summer, 4:fall)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from [Web Link])
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered
***
### Load libraries
```{r libraries, message=FALSE, collapse=TRUE}
library(tidymodels)
library(caret)
library(lubridate)
library(tidyverse)
library(moments)
library(corrr)
library(randomForest)
```
```{r loaddata, message=FALSE, collapse=TRUE}
bike <- read_csv("datasets/other/DataBikeSharing.csv")
bike %>% dim() # pipe tidyverse approach
dim(bike) # Base R
nrow(bike)
ncol(bike)
length(bike)
```
### Remove and rename columns
```{r bike}
# bike %>%
# mutate(instant = NULL, yr = yr + 2011) %>%
# rename(
# date = dteday,
# year = yr,
# month = mnth,
# hour = hr,
# weather = weathersit,
# humidity = hum,
# total = cnt
# ) ->
# bike
head(bike)
# remove 1st column
bike = subset(bike, select = -c(instant) )
bike$yr <-bike$yr + 2011
# Base R or data.table
names(bike)[names(bike) == 'dteday'] <- 'date'
colnames(bike)[colnames(bike) == 'yr'] <- 'year'
library(data.table)
setnames(bike, "mnth", "month")
nms <- c("date", "season", "year", "month", "hour", "holiday", "weekday", "workingday", "weather", "temp", "atemp", "humidity", "windspeed", "casual", "registered", "total")
setnames(bike, nms)
```
```{r bike_long}
bike %>%
pivot_longer(
cols = c(casual, registered, total),
names_to = "usertype",
values_to = "count"
) ->
bike_long
head(bike_long)
tail(bike_long)
# This could be done with the package data.table
```
## Exploring and plotting data using ggplot {.tabset}
### Target variable
```{r plot_target}
# Rental count
bike_long %>%
ggplot(aes(count, colour = usertype)) +
geom_density() +
labs(
title = "Distribution of the number of rental bikes",
x = "Number per hour", y = "Density"
) +
scale_colour_discrete(
name = "User type",
breaks = c("casual", "registered", "total"),
labels = c("Non-registered", "Registered", "Total")
)
```
### By year
```{r plot_year}
bike_long %>%
filter(!usertype == "total") %>%
ggplot(aes(as.factor(year), count)) +
geom_violin(aes(fill = usertype))
```
```{r}
```