Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Marin County Scraper #80

Merged
merged 50 commits into from
Sep 3, 2020
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
95f93c2
i think I got series data for the cases
kwonangela7 May 19, 2020
46d23cf
tried a variety of things to download csvs, eventually selected the r…
kwonangela7 May 26, 2020
e3e2357
added in Rob's suggested code
kwonangela7 May 26, 2020
9893f63
revised csv parsing logic now that I'm working with a csv_string
kwonangela7 May 27, 2020
a5a66d8
finished breakdown parsings
kwonangela7 Jun 2, 2020
a27073f
finalized series and test scraping methods with function annotations …
kwonangela7 Jun 10, 2020
30ccf08
fixed the bug so that only chart notes from the charts I'm looking at…
kwonangela7 Jun 10, 2020
cffeaea
Merge branch 'master' of https://github.com/sfbrigade/data-covid19-sf…
kwonangela7 Jun 10, 2020
cb9947c
moved marin scraper to folder
kwonangela7 Jun 10, 2020
b53c84e
deleted extra copy of marin_scraper.py
kwonangela7 Jun 11, 2020
ac14dfb
pulled new files
kwonangela7 Jun 20, 2020
c108ac1
converted tab to 4 spaces
kwonangela7 Jun 20, 2020
c6a969f
raised error for wrong kind of href
kwonangela7 Jun 20, 2020
c3b6f1a
Merge branch 'master' into marin-county
kwonangela7 Jun 24, 2020
0156f85
Update covid19_sfbayarea/data/marin_scraper.py
kwonangela7 Jun 28, 2020
029f367
changed module name
kwonangela7 Jun 28, 2020
cf1ddb3
deleted marin_scraper.py
kwonangela7 Jun 28, 2020
07760b2
Merge branch 'master' into marin-county
kwonangela7 Jun 28, 2020
a36b2c0
renamed file, will rename at the end lol
kwonangela7 Jun 28, 2020
0aadb08
renamed county function, added scraper to init file
kwonangela7 Jun 28, 2020
37880a6
pls ignore previous renaming commits, this is the actual commit to pr…
kwonangela7 Jun 28, 2020
9b90510
removing file with the wrong name
kwonangela7 Jun 28, 2020
8b2f8b9
added import to init statement
kwonangela7 Jun 29, 2020
39ce2bb
used soup.select('h4+p') instead of find_next_sibling + threw error
kwonangela7 Jul 1, 2020
7521dfb
fixed get_case_series to use csv modeul, not use numpy, and use the p…
kwonangela7 Jul 7, 2020
1e3fcbc
fixed case and deaths series data + breakdown functions to use csv mo…
kwonangela7 Jul 8, 2020
5b04be9
testing to get the most recent commits on this branch
Jul 11, 2020
850650e
Merge branch 'marin-county' of https://github.com/sfbrigade/data-covi…
Jul 11, 2020
2ba5273
simplified test logic
kwonangela7 Jul 15, 2020
d574680
fixed testing data logic, fixed age mappings. The raw counts for age …
kwonangela7 Jul 16, 2020
0b94cc4
fixed linter errors
kwonangela7 Jul 17, 2020
f7f532b
Merge branch 'master' into marin-county
kwonangela7 Jul 17, 2020
eb079be
ready to write up code in context managers tomorrow
kwonangela7 Jul 17, 2020
0fc2903
Merge branch 'marin-county' of https://github.com/sfbrigade/data-covi…
kwonangela7 Jul 17, 2020
862a240
rewrote metadata and extract csv functions using context managers
kwonangela7 Jul 18, 2020
153d379
fixed half of metadata function, not sure what's wrong with the other…
kwonangela7 Jul 23, 2020
90e75ee
fixed metadata function - finallygit add covid19_sfbayarea/data/marin…
kwonangela7 Aug 18, 2020
176cbd7
Merge remote-tracking branch 'origin/master' into marin-county
kwonangela7 Aug 18, 2020
6d12de7
added data points to data model needed for marin, updated README, and…
kwonangela7 Aug 22, 2020
04e62e1
fixed linter issue
kwonangela7 Aug 22, 2020
ef46d85
Update covid19_sfbayarea/data/marin.py
kwonangela7 Aug 29, 2020
b4826cc
removed instances of inmate as that data is not collected by marin co…
kwonangela7 Aug 29, 2020
e4f586f
updated README - inmate section
kwonangela7 Aug 29, 2020
c1e3be8
Merge branch 'marin-county' of https://github.com/sfbrigade/data-covi…
kwonangela7 Aug 29, 2020
e4b185b
updated Race and Ethnicity README
kwonangela7 Aug 29, 2020
d35f453
made sure to use 4-space indentation, fixed test series function to o…
kwonangela7 Aug 29, 2020
eb7c0c1
updated meta_from_baypd
kwonangela7 Aug 29, 2020
f5d8ae1
updated meta_from_source
kwonangela7 Aug 29, 2020
3a03180
updated meta_from_source about testing data nuances
kwonangela7 Aug 29, 2020
6d55284
Delete inmates from population_totals
elaguerta Sep 3, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
303 changes: 303 additions & 0 deletions covid19_sfbayarea/data/marin_scraper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,303 @@
#!/usr/bin/env python3
import csv
import json
import numpy as np
from typing import List, Dict, Tuple
from selenium import webdriver
from bs4 import BeautifulSoup
from urllib.parse import unquote_plus
from datetime import datetime
import re

from .utils import get_data_model

def get_county_data() -> Dict:
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved
"""Main method for populating county data"""

url = 'https://coronavirus.marinhhs.org/surveillance'
model = get_data_model()

chart_ids = {"cases": "Eq6Es", "deaths": "Eq6Es", "tests": '2Hgir', "age": "VOeBm", "gender": "FEciW", "race_eth": "aBeEd"}
# population totals and transmission data missing.
model['name'] = "Marin County"
model['update_time'] = datetime.today().isoformat()
# No actual update time on their website? They update most charts daily (so the isoformat is only partially correct.)
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved
model['source_url'] = url
model['meta_from_source'] = get_metadata(url, chart_ids)
model["series"]["cases"] = get_case_series(chart_ids["cases"], url)
model["series"]["deaths"] = get_death_series(chart_ids["deaths"], url)
model["series"]["tests"] = get_test_series(chart_ids["tests"], url)
model["case_totals"]["age_group"], model["death_totals"]["age_group"] = get_breakdown_age(chart_ids["age"], url)
model["case_totals"]["gender"], model["death_totals"]["gender"] = get_breakdown_gender(chart_ids["gender"], url)
model["case_totals"]["race_eth"], model["death_totals"]["race_eth"] = get_breakdown_race_eth(chart_ids["race_eth"], url)

print(model)

def extract_csvs(chart_id: str, url: str) -> str:
"""This method extracts the csv string from the data wrapper charts."""
driver = webdriver.Chrome('/Users/angelakwon/Downloads/chromedriver')
# need to figure out how to change the webdriver
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved

driver.implicitly_wait(30)
driver.get(url)

frame = driver.find_element_by_css_selector(f'iframe[src*="//datawrapper.dwcdn.net/{chart_id}/"]')

driver.switch_to.frame(frame)
# Grab the raw data out of the link's href attribute
csv_data = driver.find_element_by_class_name('dw-data-link').get_attribute('href')
# Switch back to the parent frame to "reset" the context
driver.switch_to.parent_frame()

# Deal with the data
if csv_data.startswith('data:'):
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved
media, data = csv_data[5:].split(',', 1)
# Will likely always have this kind of data type
if media != 'application/octet-stream;charset=utf-8':
raise ValueError(f'Cannot handle media type "{media}"')
csv_string = unquote_plus(data)

# Then leave the iframe
driver.switch_to.default_content()
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved

return csv_string

def get_metadata(url: str, chart_ids: str) -> Tuple:
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved
notes = []
driver = webdriver.Chrome('/Users/angelakwon/Downloads/chromedriver') # change this to point to Github one
driver.implicitly_wait(30)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html5lib')
metadata = []

to_be_matched = ['Total Cases, Recovered, Hospitalizations and Deaths by Date Reported', 'Daily Count of Positive Results and Total Tests for Marin County Residents by Test Date ', 'Cases, Hospitalizations, and Deaths by Age, Gender and Race/Ethnicity ']
chart_metadata = []

for text in to_be_matched:
target = soup.find('h4',text=text)
if not target:
raise ValueError('Cannot handle this header.')
for sib in target.find_next_siblings()[:1]: # I only want the first paragraph tag
# Is it more efficient to use something like (soup object).select('h1 + p') to grab the first paragraph that follows?
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved
metadata += [sib.text]

# Metadata for each chart visualizing the data of the csv file I'll pull. There's probably a better way to organize this.
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved
for chart_id in chart_ids.values():
frame = driver.find_element_by_css_selector(f'iframe[src*="//datawrapper.dwcdn.net/{chart_id}/"]')
driver.switch_to.frame(frame)
# The metadata for the charts is located in elements with the class `dw-chart-notes'
for c in driver.find_elements_by_class_name('dw-chart-notes'):
chart_metadata.append(c.text)

# Switch back to the parent frame to "reset" the context
driver.switch_to.parent_frame()

driver.quit()
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved

# Return the metadata. I take the set of the chart_metadata since there are repeating metadata strings.
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved
return metadata, list(set(chart_metadata))

def get_case_series(chart_id: str, url: str) -> List:
"""This method extracts the date, number of cumulative cases, and new cases."""
csv_ = extract_csvs(chart_id, url)
series = []

csv_strs = csv_.splitlines()
keys = csv_strs[0].split(',')
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved

if keys != ['Date', 'Total Cases', 'Total Recovered*', 'Total Hospitalized', 'Total Deaths']:
raise ValueError('The headers have changed')
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved

case_history = []

for row in csv_strs[1:]:
daily = {}
# Grab the date in the first column
date_time_obj = datetime.strptime(row.split(',')[0], '%m/%d/%Y')
daily["date"] = date_time_obj.isoformat()
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved
# Collect the case totals in order to compute the change in cases per day
case_history.append(int(row.split(',')[1]))
# Grab the cumulative number in the fifth column
daily["cumul_cases"] = int(row.split(',')[1])
series.append(daily)

case_history_diff = np.diff(case_history)
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved
# there will be no calculated difference for the first day, so adding it in manually
case_history_diff = np.insert(case_history_diff, 0, 0)
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved
# adding the case differences into the series
for val, case_num in enumerate(case_history_diff):
series[val]["cases"] = case_num
return series

def get_death_series(chart_id: str, url: str) -> List:
"""This method extracts the date, number of cumulative deaths, and new deaths."""
csv_ = extract_csvs(chart_id, url)
series = []

csv_strs = csv_.splitlines()
keys = csv_strs[0].split(',')
if keys != ['Date', 'Total Cases', 'Total Recovered*', 'Total Hospitalized', 'Total Deaths']:
raise ValueError('The headers have changed.')

death_history = []

for row in csv_strs[1:]:
daily = {}
# Grab the date in the first column
date_time_obj = datetime.strptime(row.split(',')[0], '%m/%d/%Y')
daily["date"] = date_time_obj.isoformat()
# Collect the death totals in order to compute the change in deaths per day
death_history.append(int(row.split(',')[4]))
# Grab the cumulative number in the fifth column
daily["cumul_deaths"] = int(row.split(',')[4])
series.append(daily)

death_history_diff = np.diff(death_history)
# there will be no calculated difference for the first day, so adding it in manually
death_history_diff = np.insert(death_history_diff, 0, 0)
# adding the case differences into the series
for val, death_num in enumerate(death_history_diff):
series[val]["deaths"] = death_num
return series

def get_breakdown_age(chart_id: str, url: str) -> Tuple:
"""This method gets the breakdown of cases and deaths by age."""
csv_ = extract_csvs(chart_id, url)
c_brkdown = []
d_brkdown = []

csv_strs = csv_.splitlines()
keys = csv_strs[0].split(',')

if keys != ['Age Category', 'POPULATION', 'Cases', 'Hospitalizations', 'Deaths']:
raise ValueError('The headers have changed')

ages = ['0-18', '19-34', '35-49', '50-64', '65+']
for row in csv_strs[1:]:
c_age = {}
d_age = {}
# Extracting the age group and the raw count (the 3rd and 5th columns, respectively) for both cases and deaths.
# Each new row has data for a different age group.
c_age["group"] = row.split(',')[0]
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved
if c_age["group"] not in ages:
raise ValueError('The age groups have changed.')
c_age["raw_count"] = int(row.split(',')[2])
d_age["group"] = row.split(',')[0]
d_age["raw_count"] = int(row.split(',')[4])
c_brkdown.append(c_age)
d_brkdown.append(d_age)

return c_brkdown, d_brkdown

def get_breakdown_gender(chart_id: str, url: str) -> Tuple:
"""This method gets the breakdown of cases and deaths by gender."""
csv_ = extract_csvs(chart_id, url)

csv_strs = csv_.splitlines()
keys = csv_strs[0].split(',')
if keys != ['Gender', 'POPULATION', 'Cases', 'Hospitalizations', 'Deaths']:
raise ValueError('The headers have changed.')

genders = ['male', 'female']
c_gender = {}
d_gender = {}

for row in csv_strs[1:]:
# Extracting the gender and the raw count (the 3rd and 5th columns, respectively) for both cases and deaths.
# Each new row has data for a different gender.
split = row.split(',')
gender = split[0].lower()
if gender not in genders:
return ValueError('The genders have changed.')
c_gender[gender] = int(split[2])
d_gender[gender] = int(split[4])

return c_gender, d_gender

def get_breakdown_race_eth(chart_id: str, url: str) -> Tuple:
"""This method gets the breakdown of cases and deaths by race/ethnicity."""

csv_ = extract_csvs(chart_id, url)

csv_strs = csv_.splitlines()
keys = csv_strs[0].split(',')

if keys != ['Race/Ethnicity', 'COUNTY POPULATION', 'Case Count', 'Percent of Cases', 'Hospitalization Count', 'Percent of Hospitalizations', 'Death Count', 'Percent of Deaths']:
raise ValueError("The headers have changed.")

key_mapping = {"black/african american":"African_Amer", "hispanic/latino": "Latinx_or_Hispanic",
"american indian/alaska native": "Native_Amer", "native hawaiian/pacific islander": "Pacific_Islander", "white": "White", "asian": "Asian", "multi or other race": "Multi or Other Race"}
# "Multiple_Race", "Other" are not separate in this data set - they are one value under "Multi or Other Race"

c_race_eth = {}
d_race_eth = {}

for row in csv_strs[1:]:
split = row.split(',')
race_eth = split[0].lower()
if race_eth not in key_mapping:
raise ValueError("The race_eth groups have changed.")
else:
c_race_eth[key_mapping[race_eth]] = int(split[2])
d_race_eth[key_mapping[race_eth]] = int(split[6])

return c_race_eth, d_race_eth

def get_test_series(chart_id: str, url: str) -> Tuple:
"""This method gets the date, the number of positive and negative tests on that date, and the number of cumulative positive and negative tests."""

csv_ = extract_csvs(chart_id, url)
series = []

csv_strs = csv_.splitlines()
keys = csv_strs[0].split(',')

test_history = []

# Grab the dates, which are in the header
for entry in csv_strs[:1][0].split(',')[1:]:
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved
# need to exclude very first item in the csv_strs[:1][0].split(',') list (which is the value 'Date')
daily = {}
date_time_obj = datetime.strptime(entry, '%m/%d/%Y')
daily["date"] = date_time_obj.isoformat()
series.append(daily)

# The slicing makes this if statement hard to look at... there must be a better way?
if csv_strs[1:2][0].split(',')[:1][0] != 'Positive Tests' and csv_strs[2:][0].split(',')[:1][0] != 'Negative Tests':
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved
raise ValueError('The kinds of tests have changed.')

# Grab the positive test result numbers, which is in the second row.
# [1:] is included to make sure that 'Positive Tests' is not captured.
p_entries = csv_strs[1:2][0].split(',')[1:]
n_entries = csv_strs[2:][0].split(',')[1:]

get_test_series_helper(series, p_entries, ['positive', 'cumul_pos'])
get_test_series_helper(series, n_entries, ['negative', 'cumul_neg'])
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved

return series

def get_test_series_helper(series: list, entries: list, keys: list) -> List:
"""This method helps get the pos/neg test count and the cumulative pos/neg test count."""

# initialize values cumulative number, the positive/negative and cumul_pos/neg values for the first day, and the index needed for the while loop.

# there's probably a more efficient way to do all of this, but I just wasn't sure.
cumul = int(entries[0])
series[0][keys[0]] = int(entries[0])
series[0][keys[1]] = cumul
index = 1

while index < len(series):
# get a particular day
day = series[index]
curr = int(entries[index])
# get pos/neg test count
day[keys[0]] = int(curr)
# add that day's pos/neg test count to get cumulative number of positive tests
cumul += curr
day[keys[1]] = cumul
index += 1
return series


get_county_data()
kwonangela7 marked this conversation as resolved.
Show resolved Hide resolved
4 changes: 2 additions & 2 deletions data_models/data_model.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@
"meta_from_baypd": "STORE IMPORTANT NOTES ABOUT OUR METHODS HERE",
"series": {
"cases": [
{ "date": "yyyy-mm-dd", "cases": -1, "cumul_cases": -1},
{ "date": "yyyy-mm-dd", "cases": -1, "cumul_cases": -1 },
{ "date": "yyyy-mm-dd", "cases": -1, "cumul_cases": -1 }
],
"deaths": [
{ "date": "yyyy-mm-dd", "deaths": -1, "cumul_deaths": -1 },
{ "date": "yyyy-mm-dd", "deaths": -1, "cumul_deaths": -1}
{ "date": "yyyy-mm-dd", "deaths": -1, "cumul_deaths": -1 }
],
"tests": [
{
Expand Down