-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathEDA1.Rmd
311 lines (159 loc) · 10.8 KB
/
EDA1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
---
title: "Exploratory Data Analysis 1"
subtitle: "R Notes"
author: "Luc Anselin and Grant Morrison^[University of Chicago, Center for Spatial Data Science -- [email protected],[email protected]]"
date: "08/06/2018"
output:
html_document:
fig_caption: yes
self_contained: no
toc: yes
toc_depth: 4
css: tutor.css
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
<br>
## Introduction
This notebook cover the functionality of the [Exploratory Data Analysis 1](https://geodacenter.github.io/workbook/2a_eda/lab2a.html) section of the GeoDa workbook. We refer to that document for details on the methodology, references, etc. The goal of these notes is to approximate as closely as possible the operations carried out using GeoDa by means of a range of R packages.
The notes are written with R beginners in mind, more seasoned R users can probably skip most of the comments
on data structures and other R particulars. Also, as always in R, there are typically several ways to achieve a specific objective, so what is shown here is just one way that works, but there often are others (that may even be more elegant, work faster, or scale better).
For this notebook, we will use socioeconomic data about NYC from the GeoDa website. Our goal in this lab is show how to implement exploratory data analysis methods with one and two variables.
### Objectives
After completing the notebook, you should know how to carry out the following tasks:
- Creating basic univariate plots
- Creating Scatterplots
- Implementing different regression methods(linear, loess, and lowess)
- Interactive Plots
- Taking advantage of shiny functionality for more advanced interactions with the data
#### R Packages used
- **ggplot2**: To make statistical plots. We use this rather than base R for increased functionality and more aesthetically pleasing plots.
- **gap**: To run the chow test in our shiny application.
- **plotly**: This is used to make our scatterplot interactive, so we can select data directly from the scatterplot for the chow test.
- **shiny**: To make a reactive application for the chow test.
#### R Commands used
Below follows a list of the commands used in this notebook. For further details
and a comprehensive list of options, please consult the
[R documentation](https://www.rdocumentation.org).
- **Base R**: `install.packages`, `library`, `head`, `summary`, `print`, `lm`, `lines`, `plot`, `read.csv`, `lowess`
- **ggplot2**: `ggplot`, `geom_boxplot`, `geom_histogram`, `geom_point`, `geom_smooth`
- **gap**: `chow.test`
- **plotly**: `plot_ly`
- **shiny**: `renderPlotly`, `renderPrint`, `shinyApp`, `fluidPage`, `plotlyOutput`, `verbatimTextOutput`
## Preliminaries
Before starting, make sure to have the latest version of R and of packages that are compiled for the matching version of R (this document was created using R 3.5.1 of 2018-07-02). Also, optionally, set a working directory, even though we will not
actually be saving any files.^[Use `setwd(directorypath)` to specify the working directory.]
### Load packages
First, we load all the required packages using the `library` command. If you don't have some of these in your system, make sure to install them first as well as
their dependencies.^[Use
`install.packages(packagename)`.] You will get an error message if something is missing. If needed, just install the missing piece and everything will work after that.
```{r}
library(ggplot2)
library(shiny)
library(plotly)
library(gap)
```
```{r}
```
## Obtaining the Data from the GeoDa website
To get the data for this notebook, you will and to go to [NYC Data](https://geodacenter.github.io/data-and-lab/nyc/) The download format is a zipfile, so you will need to unzip it by double clicking on the file in your file finder. From there move the resulting folder titled: nyc into your working directory to continue. Once that is done, you can use the base R function: `read.csv` to read the data into your R environment. There are faster table reading functions, but for small datasets such as ours, `read.csv` is sufficient.
```{r}
nyc.data <- read.csv("nyc/nyc.csv")
head(nyc.data)
```
## Univariate Data Exploration
Before we begin using **ggplot2**, it is important to get a sense of how the plots are built with this library. All **ggplot2** plots are built starting with the `ggplot` function, where the dataset is specified and the axises are set. All plots start with this base layer and then you can add on to the with **+** following the command. You can add points, lines, and many other layers to this base layer. This approach is both intuitive and makes the code easier too read.
### Box Plot
Our first plot will be a histogram. We start with the base specifications, then add `geom_boxplot`. Inside of the `ggplot` function we speficy the data with `data =` and the axis of the plot with `aes`. We set the y axis to our choosen variable for the boxplot, to get a vertical boxplot.
There is not a convenient way to get summary statistics put on to our plot. We use Base R functionality in conjunction with our plot. The `summary` command gives us summary stats of our chosen variable.
```{r}
ggplot(data = nyc.data, aes(x = "", y = kids2009)) +
geom_boxplot()
summary(nyc.data$kids2009)
```
### Histogram
We willl now make a histogram. It follows the same form as the command to get a boxplot, with a few differences. We still specify the dataset and the axis. In this case we dont do an x and y, we can just entered our chosen variable into the `aes`. I'm not completely sure why this is, but as you will find with a lot of R, things are often fickle and will require a google search.
```{r}
ggplot(data = nyc.data, aes(kids2009)) +
geom_histogram()
summary(nyc.data$kids2009)
```
## Bivariate Data Exploration
In this section we will do some bivariate data exploration. This will done through scatterplots and regression. We will implement **lm**, **loess**, and **lowess** regression in this section. **lm** is just the line of best fit and comes with an r value. **loess** is a local polynomial regression. **lowess** is a locally weighted scatterplot smoother.
We begin with the `ggplot` function with a specification of the dataset and choose the x and y variables. For the x axis we choose **kids2000** which is percentage of households with kids under the age of 18. For the y axis we choose **pubast00**, the percentage of households receiving public assisstance. We then add `geom_point` to get a scatterplot of our two variables.
```{r}
ggplot(data = nyc.data, aes(x=kids2000, y=pubast00)) +
geom_point(shape=1)
```
### lm regression
Adding a regression line to our scatterplot is very simple. We just add the `geo_smooth` function to our command from above. This gives us a line of best fit with a shaded region that indicates a 95% confidence interval for the line.
```{r}
ggplot(data = nyc.data, aes(x=kids2000, y=pubast00)) +
geom_point(shape=1) +
geom_smooth(method=lm)
```
To turn off the shaded region, we just set se=FALSE in `geom_smooth`
```{r}
ggplot(data = nyc.data, aes(x=kids2000, y=pubast00)) +
geom_point(shape=1) +
geom_smooth(method=lm,se=FALSE)
```
**ggplot2** doesnt have a convenient way to put the regression statistics in our plot, so we will use Base R to get these stats separately. To do this we need `lm` command. In here we need to specify the dataset and the two variables for the regression.
```{r}
linear_mod <- lm(pubast00 ~ kids2000, data=nyc.data)
print(linear_mod)
```
###loess regression
Now we will implement a nonparametric regression. This is simple to do. We just change the method from lm to loess in `geom_smooth`
```{r}
ggplot(data = nyc.data, aes(x=kids2000, y=pubast00)) +
geom_point(shape=1) +
geom_smooth(method=loess)
```
### lowess regression
**ggplot2** doesn't have lowess regression. We will use base R to do this instead. It is less aesthically pleasing, but still gets the job done. To do this, we start with `plot` command and the add lines with `lines` commmand. For `plot`, we specify the xm then y variable and the use `main =` to give it a title.
```{r}
plot(nyc.data$kids2000, nyc.data$pubast00, main="lowess(nyc_data)")
lines(lowess(nyc.data$kids2000, nyc.data$pubast00), col=2)
lines(lowess(nyc.data$kids2000, nyc.data$pubast00, f=.2), col=3)
```
## shiny applications
In this section we will implement an interactive plot that outputs chow test statistics. We will be taking advantage of **shiny** reactive expressions to make this application. This won't be a complete guide on **shiny** aplications, as there is far too much to show for that to be feasible. A **shiny** can be useful in situations where you want to show aspects of data through interactive visuals on the web. Otherwise, it is normally easier to use other software. For instance, we are building an app here that allows the user to select points from the scatterplot and outputs chow test stats. This is implemented in GeoDa software already. Instead of building a whole app, you can just download the software and have the functionality at your fingertips instantly.
**shiny** apps consist of three parts: a user interface, a server, and the command that launches the app by using the server and ui as arguments.
### user interface
The user interface is where you structure the layout of your application. In our case, it is fairly simple as we just need some text output and a plot.
```{r}
ui <- fluidPage(
plotlyOutput("plot"),
verbatimTextOutput("brush")
)
```
### server
This is the second part of a shiny app: the server. This will be the hardest part of the code to navigate. For this app, it consists of 2 parts: rendering the plot and rendering the text. Rendering the plot is relativey simple, we just need to use the `plot_ly` function and specify our x, y, key variables. The key variable is important in identifying which observations have been select and which have not for the chow test.
```{r}
server <- function(input, output, session) {
output$plot <- renderPlotly({
# use the key aesthetic/argument to help uniquely identify selected observations
plot_ly(nyc.data, x = ~kids2000, y = ~pubast00, key = ~subborough) %>% layout(dragmode = "select")
})
output$brush <- renderPrint({
# d is event data, gained from selecting points on the plot
d <- event_data("plotly_selected")
m <- nyc.data
#this loop gives us a data frame with the nonselected observations
for(x in d$key){
m <- m %>% filter(subborough != x)
}
if (is.null(d)) "Select data for the Chow Test"
#runs the chow test on the selected data
else {
chow.test(m$pubast00,m$kids2000,d$y,d$x,x=NULL)
}
})
}
```
The third part of the app is the `shinyApp` command. This launches the app. If everything is inorder with the server and ui portions of the code the app should work.
```{r}
shinyApp(ui, server)
```