forked from data-science-chats/web_scrape_data
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathscrape_subsidies_data.Rmd
207 lines (170 loc) · 7.66 KB
/
scrape_subsidies_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
---
title: "scrape_subsidies_data"
author: "Haley Epperly"
date: '2022-04-19'
output: html_document
editor_options:
chunk_output_type: console
---
Objective: Web scrape subsidies data (4 categories of subsidies plus total) for each year
available (2010-2019, 2020 not finished) and every county in the US from
the [EWG website](https://farm.ewg.org/index.php).
```{r, load libraries}
library(tidyverse)
library(rvest)
library(janitor)
library(stringr)
library(here)
library(tidycensus)
library(usmap)
```
First go through step by step without any pipes
```{r, without pipes}
url <- 'https://farm.ewg.org/region.php?fips=53000&progcode=total&yr='
year <- 2017
#Read in the content from a .html file. This is generalized, reading in all body text.
#Only need to input url.
step1 <- read_html(paste0(url,year))
#find HTML element using CSS selectors:
#read in identifying code from website for exact portion of website that you
#are interested in. (use inspect to find this path)
step2 <- html_elements(step1, "#main_content_area > script:nth-child(7)")
#convert html to text (retrieves text from specified element)
step3 <- html_text2(step2)
step3
#can see that each county start with their fips code in the form of C00000
#we need to split up this big chunk of html code by county
#code below says split the text every time you see a C with 5 numbers after it that
#range from 0 to 9
step3_5 <- strsplit(step3, "(C[0-9]{5})", perl = TRUE)
#however, that code above actually removes the fips codes when it splits by them
#in order to keep the fips code, we need to add some additional symbols
step4 <- strsplit(step3, "(?<=.)(?=(C[0-9]{5}))", perl = TRUE)
#convert to dataframe to visualize better
step5 <- as.data.frame(step4)
#can see first row does not correspond to any fips code
#this was the html text that was before the first fips code
#delete that first row
#step6 <- step5[-1,]
#test9 <- as.data.frame(test9)
step6 <- as.data.frame(step5[-1,])
#now we want to create two columns, one with fips codes and the other one with
#all the other data
#look at rows of html text in step 6 - can see we need to separate by ",value
#this time we are okay with it removing those characters when we separate by them
#we have to use \\ to "escape" certain characters - this lets us use a " in code
#without it thinking it means something in the code
step7 <- str_split_fixed(step6[,1], '\\",value', 2)
step8 <- as.data.frame(step7)
#look at rows of html text in step 7 - can see that the values I want (subsidies)
#all start with a $, so now we need to split the second column wherever it encounters
#a dollar sign. We know we have 5 subsidies (can see 5 $) and that the text before
#the first dollar sign will also be kept, so we need to specify 6 columns.
#Again we have to "escape" the character used ($)
step9 <- separate(step8, col = V2, sep='\\$', into=c('x','total','commodity','conservation', 'disaster','insurance'), remove=TRUE)
#We can get rid of that first new column (x) with the html text before the first subsidy
step10 <- step9[,-2]
#now need to get rid of all the text after the subsidy amount in each column
#can look at columns and see what comes after the subsidy amount
#only first column had "</b>" right after the subsidy amount, but all other columns had </td>.
#use gsub to replace the pattern and everything after the pattern with nothing
step10[,c(3:6)] <-lapply(step10[,c(3:6)], gsub, pattern = '</td>.*', replacement = "")
step10[,2] <- gsub(step10[,2], pattern = '</b>.*', replacement = "")
#the subsidy values include commas, which we need to remove to read as.numeric
#use gsub again to replace all commas with nothing
step10[,c(2:6)] <-lapply(step10[,c(2:6)], gsub, pattern = ',', replacement = "")
#add in year variable
step10$year <- 2017
#rename first column to fips
step11 <- rename(step10, fips = V1)
#remove leading C on fips values and convert to numeric
step11[,1] <- gsub(step11[,1], pattern = 'C', replacement = "")
```
Combine that all up into a function with pipes
```{r, function}
sub_scrape <- function(url, year){
read_html(paste0(url,year)) %>%
html_elements("#main_content_area > script:nth-child(7)") %>%
html_text2() %>%
strsplit("(?<=.)(?=(C[0-9]{5}))",perl = TRUE) %>%
as.data.frame() %>%
.[-1,] %>%
str_split_fixed('\\",value', 2) %>%
as.data.frame() %>%
separate(V2, sep='\\$', into=c('x','total','commodity','conservation',
'disaster','insurance'), remove=TRUE) %>%
.[,-2] %>%
mutate(total = gsub('</b>.*', "", total)) %>%
mutate(commodity = gsub('</td>.*', "", commodity)) %>%
mutate(conservation = gsub('</td>.*', "", conservation)) %>%
mutate(disaster = gsub('</td>.*', "", disaster)) %>%
mutate(insurance = gsub('</td>.*', "", insurance)) %>%
mutate(total = as.numeric(gsub(',', "", total))) %>%
mutate(commodity = as.numeric(gsub(',', "", commodity))) %>%
mutate(conservation = as.numeric(gsub(',', "", conservation))) %>%
mutate(disaster = as.numeric(gsub(',', "", disaster))) %>%
mutate(insurance = as.numeric(gsub(',', "", insurance))) %>%
add_column(Year = year) %>%
rename(fips = V1) %>%
mutate(fips = as.numeric(gsub('C', "", fips)))
}
#test function
tax2 <- sub_scrape('https://farm.ewg.org/region.php?fips=53000&progcode=total&yr=', 2017)
```
Use function to scrape data for all states and years
First, the EWG subsidy website url includes the fips code and year, so we
need to create a vector with all state fip codes + 000 to run through the loop
```{r}
#load in fips codes for each county
#this dataset is in the tidycensus package
fips_ids <- fips_codes
#add new column with just state fips (add 000 after state fips ID)
states_fips <- paste0(fips_ids$state_code, "000") %>%
#remove duplicates (duplicates are because of multiple counties per state)
unique() %>%
#remove state codes > 60 (US territories that don't have EWG subsidies data)
.[. < 60000]
```
Run for loop to iterate through all years for WA
```{r}
#example using WA - loop through all available years
url <- 'https://farm.ewg.org/region.php?fips=53000&progcode=total&yr='
subsidies_data <- data.frame()
for(year in 2010:2019){
subsidies_data_year <- sub_scrape(url,year)
subsidies_data <- rbind(subsidies_data_year, subsidies_data)
print(subsidies_data)
}
```
Expand on for loop to iterate through all combos of years and fips codes (500 webpages).
Don't run this - takes too long (~5-10 minutes), get 32,000 rows of data.
```{r}
#create empty dataframe to populate
subsidies_data <- data.frame()
#run through all possible fips codes in fips vector we previously created
for(state_fip in states_fips){
#sprintf is a wrapper for a C function, %s says input string here
#including the '%s' in the url below allows us to input different fips codes
url <- sprintf('https://farm.ewg.org/region.php?fips=%s&progcode=total&yr=', state_fip)
#within each fips option, run through all possible years
for(year in 2010:2019){
subsidies_data_year <- sub_scrape(url,year)
subsidies_data <- rbind(subsidies_data_year, subsidies_data)
print(subsidies_data)
}
}
```
Create a plot using example of WA data
```{r, plot function}
#read in full subsidies data
sub_data <- read_csv(here("2010_19_us_county_subsidies_data.csv"))
sub_map <- function(year, subtype) {
plot_usmap(data = sub_data[sub_data$Year==year,], values = subtype, size = .1) +
labs(title = str_to_title(paste(subtype, "subsidies received in", year))) +
scale_fill_continuous(low = "white", high = "blue", name = str_to_title(paste(subtype, "subsidies")), label = scales::comma) +
theme(legend.position = "right", plot.title = element_text(size=14), legend.title = element_text(size=12))
}
sub_map(2019, "total")
sub_map(2019, "conservation")
sub_map(2019, "insurance")
```