04-dataR.Rmd

---
---
# (APPENDIX) Appendix {-} 

# Data Wrangling in R

In this part of our toolkit, we're going to learn how to do the same things we did with Chapter 4 - Spatial Data Wrangling, but this time, we'll use R code to handle our spatial data. 

### Getting Started {-}

R is a great choice for starting in data science because it's built for it. It's not just a programming language, it is a whole system with tools and libraries made to help you think and work like a data scientist easily.

We assume a basic knowledge of R and coding languages for these toolkits. For most of the tutorials in this toolkit, you’ll need to have R and RStudio downloaded and installed on your system. You should be able to install packages, know how to find the address to a folder on your computer system, and have very basic familiarity with R.

### Tutorials for R {-}

If you are new to R, we recommend the following [intro-level tutorials](https://learn.datacamp.com/courses/free-introduction-to-r) provided through [installation guides](https://rspatial.org/intr/1-introduction.html). You can also refer to this [R for Social Scientists](https://datacarpentry.org/r-socialsci/) tutorial developed by Data Carpentry for a refresher.

You can also visit the [RStudio Education](https://education.rstudio.com/learn/) page to select a learning path tailored to your experience level ([Beginners](https://education.rstudio.com/learn/beginner/), [Intermediates](https://education.rstudio.com/learn/intermediate/), [Experts](https://education.rstudio.com/learn/expert/)). They offer detailed instructions to learners at different stages of their R journey.


## Environmental Setup

Getting started with data analysis in R involves a few preliminary steps, including downloading datasets and setting up a working directory. This introduction will guide you through these essential steps to ensure a smooth start to your data analysis journey in R.

:::tools
**Download the Activity Datasets**

Please download and unzip this file to get started: [SDOHPlace-DataWrangling.zip](https://github.com/healthyregions/sdohplace-toolkit/raw/main/data/SDOHPlace-DataWrangling.zip)
:::

### Setting Up the Working Directory {-}

Setting up a working directory in R is crucial as it defines the location on your computer where your files and scripts will be saved and accessed. You can set the working directory to any folder on your system where you plan to store your datasets and R scripts. To set your working directory, use the `setwd("/path/to/your/directory")` and specify the path to your desired directory. 

### Installing & Working with R Libraries {-}

Before starting operations related to spatial data, we need to complete an environmental setup. This workshop requires several packages, which can be installed from CRAN:

- `sf`: simplifies spatial data manipulation
- `tmap`: streamlines thematic map creation
- `dplyr`: facilitates data manipulation
- `tidygeocoder`: converts addresses to coordinates easily

Uncomment to install packages with code snippet below. You only need to install packages once in an R environment.

```{r eval=FALSE}
#install.packages("sf", "tmap", "tidygeocoder", "dplyr")
```

:::tip
**Installation Tip**

For Mac users, check out https://github.com/r-spatial/sf for additional tips if you run into errors when installing the `sf` package. Using homebrew to install `gdal` usually fixes any remaining issues.
:::

Now, loading the required libraries for further steps:

```{r load-libraries, warning=FALSE, message=FALSE}
library(sf)
library(dplyr)
library(tmap)
```


## Intro to Spatial Data

Spatial data analysis in R provides a robust framework for understanding geographical information, enabling users to explore, visualize, and model spatial relationships directly within their data. Through the integration of specialized packages like sf for spatial data manipulation, ggplot2 and tmap for advanced mapping, and tidygeocoder for geocoding, R becomes a powerful tool for geographic data science. This ecosystem allows researchers and analysts to uncover spatial patterns, analyze geographic trends, and produce detailed maps that convey complex information intuitively.

### Load Spatial Data {-}

We need to load the spatial data (shapefile). Remember, this type of data is actually comprised of multiple files. All need to be present in order to read correctly. Let's use **chicagotracts.shp** for practice, which includes the census tracts boundary in Chicago.

First, we need to read the shapefile data from where you save it.

```{r eval=FALSE}
Chi_tracts = st_read("SDOHPlace-DataWrangling/chicagotracts.shp")
```

Your output will look something like:

```
## Reading layer `chicagotracts' from data source `./SDOHPlace-DataWrangling/chicagotracts.shp' using driver `ESRI Shapefile'
## Simple feature collection with 801 features and 9 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94025 ymin: 41.64429 xmax: -87.52366 ymax: 42.02392
## Geodetic CRS:  WGS 84
```

Always inspect data when loading in. Let's look at a non-spatial view.

```{r non-spatial-view}
head(Chi_tracts)
```

Check out the data structure of this file.

```{r data-structure}
str(Chi_tracts)

```

The data is no longer a shapefile but an `sf` object, comprised of polygons. The `plot()` command in R help to quickly visualizes the geometric shapes of Chicago's census tracts. The output includes multiple maps because the `sf` framework enables previews of each attribute in our spatial file. 

```{r spatial-view}
plot(Chi_tracts)
```


### Adding a Basemap

Then, we can use `tmap`, a mapping library, in interactive mode to add a basemap layer. It plots the spatial data from `Chi_tracts`, applies a minimal theme for clarity, and labels the map with a title, offering a straightforward visualization of the area's census tracts. 

We stylize the borders of the tract boundaries by making it transparent at 50% (which is equal to an alpha level of 0.5). 

```{r tmap-view}
library(tmap)

tmap_mode("view")

tm_shape(Chi_tracts) + tm_borders(alpha=0.5) +
  tm_layout(title = "Census Tract Map of Chicago")
```

Still in the interactive mode (`view'), we can switch to a different basemap. Here we bring in a "Voyager" style map from Carto, a cartographic firm. We'll make the borders less transparent by adjusting the alpha level.


```{r add-basemap}

tmap_mode("view")

tm_basemap("CartoDB.Voyager") + 
  tm_shape(Chi_tracts) + tm_borders(alpha=0.8, col = "gray40") +
  tm_layout(title = "Census Tract Map of Chicago")
```

::: tip

**Tip**

For additional options, you can preview basemaps at the [Leaflet Providers Demo](https://leaflet-extras.github.io/leaflet-providers/preview/). Some basemaps we recommend that work consistently are by:

- CartoDB (Carto)
- Open Street Map
- ESRI

Not all basemaps are available anymore, and some require keys that you'd need to add on your own.

:::


## Coordinate Reference Systems

For this exercise we will use `chicagotracts.shp` to explore how to change the projection of a spatial dataset in R. First, let's check out the current coordinate reference system.

```{r inspect-crs}
st_crs(Chi_tracts)
```

We can use the `st_transform` function to transform CRS. When projecting a dataset of Illinois, the most appropriate NAD83 projection would be NAD83 UTM zone 16N. Chicago sits within the area best covered by **NAD83 / Illinois East (ftUS)** (EPSG:3435). 

```{r }
Chi_tracts.3435 <- st_transform(Chi_tracts, "EPSG:3435")
# Chi_tracts.3435 <- st_transform(Chi_tracts, 3435)
st_crs(Chi_tracts.3435)
```

After change the projection, we can plot the map. We'll swith to the static version of tmap, using the `plot` mode.

```{r}

tmap_mode("plot")

tm_shape(Chi_tracts.3435) + tm_borders(alpha=0.5) + 
  tm_layout(main.title ="EPSG:3435 (ft)",
             main.title.position = "center")
 
```

What if we had used the wrong EPSG code, referencing the wrong projection? Here we'll transform and plot EPSG Code 3561, a coordinate reference system in Haw'aii. 


```{r transform-3561}

Chi_tracts.3561 <- st_transform(Chi_tracts, "EPSG:3561")


tm_shape(Chi_tracts.3561) + tm_borders(alpha=0.5) + 
  tm_layout(main.title ="EPSG:3561 (ft)",
             main.title.position = "center")
 
```

It's obviously not correct -- the wrong CRS can cause a load of trouble. Make sure you specify carefully!

### Refine Basic Map {-}

Let's take a deeper look at the cartographic mapping package, `tmap`. We approach mapping with one layer at a time. Always start with the object you want to map by calling it with the `tm_shape` function. Then, at least one descriptive/styling function follows. There are hundreds of variations and paramater specification.

Here we style the tracts with some semi-transparent borders.

```{r }
library(tmap)

tm_shape(Chi_tracts) + tm_borders(alpha=0.5) 
```

Next we fill the tracts with a light gray, and adjust the color and transparency of borders. We also add a scale bar, positioning it to the left and having a thickness of 0.8 units, and turn off the frame.

```{r }
tm_shape(Chi_tracts) + tm_fill(col = "gray90") + tm_borders(alpha=0.2, col = "gray10") +
  tm_scale_bar(position = ("left"), lwd = 0.8) +
  tm_layout(frame = F)
```

Check out https://rdrr.io/cran/tmap/man/tm_polygons.html for more ideas.

## Converting to Spatial Data


### Convert CSVs to Spatial Data

We are using the `Affordable_Rental_Housing_Developments.csv` in the dataset to show how to convert a csv Lat/Long data to points. First, we need to load the CSV data.

```{r load-csv-data}
housing = read.csv("SDOHPlace-DataWrangling/Affordable_Rental_Housing_Developments.csv")
```

Then, we need to ensure that no column (intended to be used as a coordinate) is entirely empty or filled with NA values.

```{r }
cleaned_housing <- na.omit(housing)
```

Inspect the data to confirm it's doing what you expect it to be doing. What columns will you use to specify the coordinates? In this dataset, we have multiple coordinate options. We'll use latitude and longitude, or rather, longitude as our X value, and latitude as our Y value.
In the data, it's specified as "Longitude" and "Latitude." 

```{r }
head(cleaned_housing)
```

Finally, we start to convert it to points. Be sure you use the CRS of the original coordinates recorded. In this case we weren't sure what CRS that was, so we use `EPSG:4326` to test.

```{r }
points_housing <- st_as_sf(cleaned_housing, coords = c("Longitude", "Latitude"), crs = 4326)
```

View the resulting sf object with a basemap to confirm they are in the right place. Overlay them on top of the tract data, to confirm they are plotting correctly.

```{r }

### First Layer
tm_shape(Chi_tracts) + tm_borders(lwd = 0.5) + 
  
  ### Second Layer
  tm_shape(points_housing) + tm_dots(size = 0.1 )

```

You can change the `tmap_mode` to "view" to add a basemap in an interactive setting, and then switch back to "plot" when complete. Because we'e plotting dots using *tmap*, we'll use the `tm_dots` parameter for styling.

```{r }

tmap_mode("view")

tm_shape(Chi_tracts) + tm_borders(lwd = 0.5) + 
  tm_shape(points_housing) + tm_dots(size = 0.01)

tmap_mode("plot")

```

We'll reproject to `EPSG:3435`, our system standard

```{r }
housing.3435 <- st_transform(points_housing, "EPSG:3435")

```

#### Write Spatial Data

Finally, we can save our points as a spatial dataset. Use 'st_write' to write your spatial object in R to a data format of your choice. Here, we'll write to a geojson file. 

Uncomment to run this line.

```{r }

#st_write(housing.3435, "housing.geojson", driver = "GeoJSON")

```

You could also save as a "housing.shapefile" to get a shapefile format, however you'll get an error noting that some column names are too long and must be shortened. Shapefile formats have a limit of 10 characters for field names. 

```{r }

#st_write(housing.3435, "housing.shp", driver = "ESRI Shapefile")

```
The file may still write, but the column names that were too long may be shortened automatically. 

To change column or field names in R objects, there are dozens of options. Try searching and "googling" different search terms to identify solutions on your own. 

### Geocode Addresses

Here, we will use `chicago_methadone_nogeometry.csv` for practice, which includes methadone centers in Chicago (center names and addresses). First we load the `tidygeocoder` to get our geocoding done. 

```{r warning=FALSE}
library(tidygeocoder)
```

Let's read in and inspect data for methadone maintenance providers. Note, these addresses were made available by SAMSHA, and are known as publicly available information. An additional analysis could call each service to check on access to medication during COVID in Septmber 2020, and the list would be updated further.

```{r}
methadoneClinics <- read.csv("SDOHPlace-DataWrangling/chicago_methadone_nogeometry.csv")

head(methadoneClinics)
```

Let's geocode one address first, just to make sure our system is working. We'll use the "cascade" method which use the US Census and OpenStreetMap geocoders. These two services are the main options with `tidygeocoder`.

```{r}
sample <- geo("2260 N. Elston Ave. Chicago, IL", lat = latitude, long = longitude, method = 'cascade')
head(sample)
```

As we prepare for geocoding, check out the structure of the dataset. The data should be a character to be read properly. 

```{r}
str(methadoneClinics)
```

We need to clean the data a bit. We'll add a new column for a full address, as required by the geocoding service. When you use a geocoding service, be sure to read the documentation and understand how the data needs to be formatted for input.

```{r}
methadoneClinics$fullAdd <- paste(as.character(methadoneClinics$Address), 
                                  as.character(methadoneClinics$City),
                                  as.character(methadoneClinics$State), 
                                  as.character(methadoneClinics$Zip))
```

We're ready to go! Batch geocode with one function, and inspect:

```{r}
geoCodedClinics <-  geocode(methadoneClinics,
address = 'fullAdd', lat = latitude, long = longitude, method = 'cascade')

head(geoCodedClinics)
```

There were two that didn't geocode correctly. You can inspect further. This could involve a quick check for spelling issues; or, searching the address and pulling the lat/long using Google Maps and inputting manually. Or, if we are concerned it's a human or unknown error, we could omit. For this exercise we'll just omit the two clinics that didn't geocode correctly.

```{r}
geoCodedClinics2 <- na.omit(geoCodedClinics)
```

## Convert to Spatial Data

This is not spatial data yet! To convert a static file to spatial data, we use the powerful `st_as_sf` function from `sf`. Indicate the x,y parameters (=longitude, latitude) and the coordinate reference system used. Our geocoding service used the standard **EPSG:4326**, so we input that here.

```{r warning = FALSE}

methadoneSf <- st_as_sf(geoCodedClinics2, 
                        coords = c( "longitude", "latitude"),
                        crs = 4326)
```

### Basic Map of Points {-}

For a really simple map of points -- to ensure they were geocoded and converted to spatial data correctly, we use `tmap`. We'll use the interactive version to view.

```{r warning = FALSE, message=FALSE}

tmap_mode("view")

tm_shape(methadoneSf) + tm_dots() 
```

If your points didn't plot correctly:

- Did you flip the longitude/latitude values?
- Did you input the correct CRS?

Those two issues are the most common errors.


## Merge Data sets

### Reshape Data {-}

Here, we are trying to use the `COVID-19_Cases__Tests__and_Deaths_by_ZIP_Code.csv` dataset to practice how to convert long data to a wide data format.

We subset to the first two columns, and the sixth column. That gives us the zip code, the reporting week, and cumulative cases of Covid-19. We want each zip code to be a unique row, with cases by week as a columns. Choose whatever subset function you prefer best!

```{r}

covid = read.csv("SDOHPlace-DataWrangling/COVID-19_Cases__Tests__and_Deaths_by_ZIP_Code.csv")

covid_clean = covid[,c(1:2, 6)]

head(covid_clean) 
```

Now, we are trying to create a wide data set with the cumulative cases for each week for each zip code. Enter the code and you will see the new wide data format.

```{r}
covid_wide <- reshape(covid_clean, direction = "wide",
                      idvar = "ZIP.Code", timevar = "Week.Number") 
head(covid_wide)
```

### Join by Attribute {-}

Here, we’ll merge data sets with a common variable in R. Merging the cumulative case data set you created in the last section to zip code spatial data (`ChiZipMaster1.geojson`) will allow you to map the case data. You’ll be merging the case data and spatial data using the zip codes field of each dataset.

We've cleaned our covid case data already, but not all values under the zip code column are valid. There is a row has a value of "unknown", so let's remove that.

```{r}
covid_wide_clean <- covid_wide %>%
  filter(ZIP.Code != "unknown" & !is.na(ZIP.Code))
```

Then, we need to load the zip code data.

```{r eval=FALSE}
zipcode <- st_read("SDOHPlace-DataWrangling/ChiZipMaster1.geojson")
```

Your output will look something like:

```
## Reading layer `ChiZipMaster1' from data source `./SDOHPlace-DataWrangling/ChiZipMaster1.geojson' using driver `GeoJSON'
## Simple feature collection with 540 features and 31 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.87596 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS:  WGS 84
```

You’ll notice that the zip codes are repeated in the zip code data set, and needs to be cleaned before we can continue with merging the data.

```{r}

zipcode_unique <- distinct(zipcode)

zipcode_unique <- zipcode %>%
  group_by(zip) %>%
  slice(1) %>%
  ungroup()
```

Now, the two datasets are ready to join together by the zip code. Make sure to check they have been joined successully.

We'll have these joined datasets be called Chi_Zipsf, to denote a final zip code master dataset.

```{r}
Chi_Zipsf <- zipcode_unique %>%
  left_join(covid_wide_clean, by = c("zip" = "ZIP.Code"))
```

We'll reproject to EPSG:3435, the standard used in our study area.

```{r}
Chi_Zipsf.3435 <- st_transform(Chi_Zipsf, 3435)
```

### Join by Location {-}

We’ll create a spatial join with the housing and zip code data we've brought in.

In this example, we want to join zip-level data to the Rental Housing Developments, so we can identify which zips they are within.

First, let's try "sticking" all the zip code data too the housing points, intersecting zip codes with housing developments.

To do this, both datasets will need to be in the same CRS. We have already standardized both using EPSG:3435.

```{r}

Housing.f <- st_join(housing.3435, Chi_Zipsf.3435, join = st_intersects)
```

Don't forget to inspect the data. Uncomment to explore!

```{r}
#head(Housing.f)
```

We could also flip things around, and try to count how many developments intersect each zip code. We can use lengths() to find out how many items are present in a vector. Here,

```{r}

Chi_Zipsf.3435$TotHousing <- lengths(st_intersects(Chi_Zipsf.3435, housing.3435))

head(Chi_Zipsf.3435)


```


## Inspect Data


### Thematic Maps

To inspect data from a spatial perspective, we can create a series of choropleth maps.


#### Example 1. Number of Affordable Housing Developments per Zip Code {-}

Choose the variable "TotHousing" to map total developments per zip coode, as we calculated previously. Here we'll map using Jenks data breaks, with a Blue to Purple palette, and four bins (or groups of data to be plotted). A histogram of the data is plotted, visualizing where breaks occured in the data to generate the map.

```{r, message=FALSE, warning=FALSE}

tmap_mode("plot")

tm_shape(Chi_Zipsf.3435) + tm_polygons("TotHousing", legend.hist = TRUE, style="jenks", pal="BuPu", n=4, title = "Housing Dev.") + 

  tm_layout(legend.outside = TRUE, legend.outside.position = "right")
```

#### Example 2. Number of COVID-19 Cases per Zip Code {-}

Let's do the same, but plut a different variable. Select a different variable name as your parameter in the 'tm_fill' parameter.

```{r, message=FALSE, warning=FALSE}

tm_shape(Chi_Zipsf.3435) + tm_polygons("Case.Rate...Cumulative",
              legend.hist = TRUE, style="jenks", 
              pal="BuPu", n=4, title = "COVID Case Rate") + 

  tm_layout(legend.outside = TRUE, legend.outside.position = "right")
```

### Map Overlay

#### Example 1. Afforfable Housing Developments & Zip Code Boundaries {-}

We previously translated the housing dataset from a CSV to a spatial object. Let's take an attribute connect with each point, the total number of units per housing development, and visualize as a graduated symbology. Points with more units will be bigger, and not all places are weighted the same visually. 

We use the "style" parameter to add a standard deviation data classification break. Check out tmap documentation for more options, like quantiles, natural jenks, or other options.


```{r}
tm_shape(housing.3435) + tm_bubbles("Units", col = "purple", style = "sd") 
```

Then, let's overlay that layer to the zip code boundary.

```{r}
tm_shape(Chi_Zipsf.3435) + tm_polygons(col = "gray80") +
  
  tm_shape(housing.3435) + tm_bubbles("Units", col = "purple") 

```

You can also color-code according to the total number of units. Here, we'll add a palette using a "viridis" color scheme, as a graduate color point map. For extra style, we'll add labels to each zip code, update with a new basemap, and make the whole map interactive.

```{r}
tmap_mode("view")

#Add Basemap
tm_basemap("Esri.WorldGrayCanvas") + 
  
  #Add First Layer, Style
  tm_shape(Chi_Zipsf.3435) +  tm_borders(col = "gray10") + 
    tm_text("zip", size = 0.7) +

  #Add Second Layer, Style
  tm_shape(housing.3435) + 
    tm_bubbles( col = "Units", style = "quantile", 
                pal = "viridis", size = 0.1) 
  
```

#### Example 2. COVID-19 & Methadone {-}

In the first example, let create a map showing both COVID-19 and methadone clinic data (used in A.3). First, let's add our zip code map.

With this overlay, we'll add a "hack" to include the methadone clinic points in a legend.

```{r}
tmap_mode("plot")

##Add and style First Layer
tm_shape(Chi_Zipsf) + tm_polygons("Case.Rate...Cumulative", 
              style="jenks", pal="BuPu", n=4, title = "COVID Rt") + 
  
  ##Add Second Layer
  tm_shape(methadoneSf) + tm_dots(size = 0.2, col = "gray20") +
  
  ##"Hack" a manual symbology for dots in the legend
  tm_add_legend("symbol", col = "gray20", size = .2, labels = "Methadone MOUD") +
  
  ##Final Cartographic Styling
  tm_layout(legend.outside = TRUE, legend.outside.position = "right")
```


## Resources {-}
For a deeper dive on topics discussed in this chapter, please check out the following. If you have a resource to add, feel free to suggest one by submitting an [issue](https://github.com/healthyregions/sdhoplace-toolkit/issues) to our toolkit repository.

- For tips on using `tmap`, check out the online text, [Elegant and informative maps with tmap](https://r-tmap.github.io/tmap-book/index.html) by Tennekes and Nowosad.

- Try out more mapping with the ggplot2 library. The [Maps](https://ggplot2-book.org/maps) chapter will give you a head start.  

- We highly recommend Chapters 3-5 as mandatory reading in this classic, [Geocomputation with R](https://r.geocompx.org/) by Lovelace, Nowosad, and Muenchow. Perfecting selections and filters in the *Attribute Data Operations* chapter will help you become a data wrangling master. Perfect distance metrics and essential GIS operations in subsequent chapters.

- The Appendix in [Gimond's Intro to GIS](https://mgimond.github.io/Spatial/introGIS.html) online book has a super overview of R examples, not to be missed.

- Another superb resource is [Analyzing US Census Data](https://walker-data.com/census-r/index.html) by Kyle Walker, with some of our favorite examples of extracing & reshaping data directly from the Census using code. Highly recommended!