-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Sonoma Data Scraper #57
Merged
Changes from all commits
Commits
Show all changes
70 commits
Select commit
Hold shift + click to select a range
10f0dfe
organization Merge CDM readme into readme
e5bceba
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
7db78f7
organization Move data models to own folder
05dbee7
organization Replace tabs with spaces
998800b
sonoma Get top level metadata
3fcb13f
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
ed855bd
sonoma Move scraper and collect metadata
fdab8a4
sonoma Add transmission types
8bd2081
sonoma Get cases, active, recovered, and death series
bd72db8
sonoma Get case data by age
ee5a8b7
sonoma Fix table numbers
7745e5b
sonoma Add test getter
e7ab26f
sonoma Factor out some common code
dc9b9fe
sonoma Add cases by race
af8bfe2
sonoma Add hospitalizations
adbe419
sonoma Add hospitalizations by gender
6b71193
sonoma Fix type error
627e82a
sonoma Redo definitions getter
a565a83
sonoma Add get_county function
358a441
sonoma Add docstrings
7dc3beb
sonoma Comment out hospitalizations by gender
6a4ead9
sonoma Add docstring for gender hospitalization
336e5ac
sonoma Remove unused variable
5297eeb
sonoma Replace findAll with find_all
5093fe3
sonoma Make newlines clearer
48dd3c1
sonoma Comment out hospitalizations
2a76315
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
a8ce742
sonoma Use better date parser
c9f3500
sonoma Improve transform cases function
058a555
sonoma Fix date formats, table selection, and number parsing
fd4e135
sonoma Use custom int parse function
8310ca0
sonoma Create custom FormatError exception
148eec8
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
ada9b2a
sonoma use template defaults for race
40b84e0
sonoma Fix test breakage
eb1a489
sonoma Use unique functions for age and gender
fdb2045
sonoma Transform age group names
1e1b0a8
sonoma Add error handling for gender and age transformations
1f3755a
sonoma Rename scraper file
96b81b5
sonoma Fix error handling for age
fb339b4
sonoma Fix typing errors
5d96031
sonoma Factor out getting section by title
fd09e5e
sonoma Correct deaths and cases aggregation
7770c89
sonoma Raise error for hospitalization change
6b4b69b
sonoma Add error for getting section by title
f1c7f05
sonoma Fix typing issue for age
5c9a9ed
sonoma Write parse table function
41d61c4
Fix typo
06163e2
sonoma Comment and typing fixes
ba6df28
Use raw string for regex
ad0e174
Merge branch 'sonoma' of github.com:sfbrigade/data-covid19-sfbayarea …
2bf3faf
sonoma Remove commented out code
bac1b5b
sonoma Remove unused variable
6a0ef8c
sonoma Add sonoma to init.py
15456e1
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
329f92d
sonoma Correct conventions for sonoma
3822877
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
f070125
Fix conflicts
d1aec84
Fix error import
1deaa9c
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
869418a
Fix linter errors and import
6ef13b4
Add type aliases
5fdc2aa
Use get cell function for cases
ldtcooper aed862f
Remove data model readme from main readme
ldtcooper 898672d
Add readme link
ldtcooper a549ea4
Refactor test and gender functions
ldtcooper 97b72c1
Refactor all transforn functions but cases
ldtcooper 28df7be
Fix types
ldtcooper 6ddf682
Add docstrings
ldtcooper 4a92856
Use datetime attribute
ldtcooper File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,261 @@ | ||||||
import requests | ||||||
import json | ||||||
import re | ||||||
import dateutil.parser | ||||||
from typing import List, Dict, Union | ||||||
from bs4 import BeautifulSoup, element # type: ignore | ||||||
from ..errors import FormatError | ||||||
|
||||||
TimeSeriesItem = Dict[str, Union[str, int]] | ||||||
TimeSeries = List[TimeSeriesItem] | ||||||
UnformattedSeriesItem = Dict[str, str] | ||||||
UnformattedSeries = List[UnformattedSeriesItem] | ||||||
|
||||||
def get_section_by_title(header: str, soup: BeautifulSoup) -> element.Tag: | ||||||
""" | ||||||
Takes in a header string and returns the parent element of that header | ||||||
""" | ||||||
header_tag = soup.find(lambda tag: tag.name == 'h3' and header in tag.get_text()) | ||||||
if not header_tag: | ||||||
raise FormatError('The header "{0}" no longer corresponds to a section'.format(header)) | ||||||
|
||||||
return header_tag.find_parent() | ||||||
|
||||||
def get_table(header: str, soup: BeautifulSoup) -> element.Tag: | ||||||
""" | ||||||
Takes in a header and a BeautifulSoup object and returns the table under | ||||||
that header | ||||||
""" | ||||||
tables = get_section_by_title(header, soup).find_all('table') | ||||||
# this lets us get the second cases table | ||||||
return tables[-1] | ||||||
|
||||||
def get_cells(row: element.ResultSet) -> List[str]: | ||||||
""" | ||||||
Gets all th and tr elements within a single tr element | ||||||
""" | ||||||
return [el.text for el in row.find_all(['th', 'td'])] | ||||||
|
||||||
def row_list_to_dict(row: List[str], headers: List[str]) -> UnformattedSeriesItem: | ||||||
""" | ||||||
Takes in a list of headers and a corresponding list of cells | ||||||
and returns a dictionary associating the headers with the cells | ||||||
""" | ||||||
return dict(zip(headers, row)) | ||||||
|
||||||
def parse_table(tag: element.Tag) -> UnformattedSeries: | ||||||
""" | ||||||
Takes in a BeautifulSoup table tag and returns a list of dictionaries | ||||||
where the keys correspond to header names and the values to corresponding cell values | ||||||
""" | ||||||
rows = tag.find_all('tr') | ||||||
header = rows[0] | ||||||
body = rows[1:] | ||||||
header_cells = get_cells(header) | ||||||
body_cells = (get_cells(row) for row in body) | ||||||
return [row_list_to_dict(row, header_cells) for row in body_cells] | ||||||
|
||||||
def parse_int(text: str) -> int: | ||||||
""" | ||||||
Takes in a number in string form and returns that string in integer form | ||||||
and handles zeroes represented as dashes | ||||||
""" | ||||||
text = text.strip() | ||||||
if text == '-': | ||||||
return 0 | ||||||
else: | ||||||
return int(text.replace(',', '')) | ||||||
|
||||||
def generate_update_time(soup: BeautifulSoup) -> str: | ||||||
""" | ||||||
Generates a timestamp string (e.g. May 6, 2020 10:00 AM) for when the scraper is run | ||||||
""" | ||||||
update_time_text = soup.find('time', {'class': 'updated'})['datetime'] | ||||||
try: | ||||||
date = dateutil.parser.parse(update_time_text) | ||||||
except ValueError: | ||||||
raise ValueError(f'Date is not in ISO 8601' | ||||||
f'format: "{update_time_text}"') | ||||||
return date.isoformat() | ||||||
|
||||||
def get_source_meta(soup: BeautifulSoup) -> str: | ||||||
""" | ||||||
Finds the 'Definitions' header on the page and gets all of the text in it. | ||||||
""" | ||||||
definitions_section = get_section_by_title('Definitions', soup) | ||||||
definitions_text = definitions_section.text | ||||||
return definitions_text.replace('\n', '/').strip() | ||||||
|
||||||
def transform_cases(cases_tag: element.Tag) -> Dict[str, TimeSeries]: | ||||||
""" | ||||||
Takes in a BeautifulSoup tag for the cases table and returns all cases | ||||||
(historic and active), deaths, and recoveries in the form: | ||||||
{ 'cases': [], 'deaths': [] } | ||||||
Where each list contains dictionaries (representing each day's data) | ||||||
of form (example for cases): | ||||||
{ 'date': '', 'cases': -1, 'cumul_cases': -1 } | ||||||
""" | ||||||
cases = [] | ||||||
cumul_cases = 0 | ||||||
deaths = [] | ||||||
cumul_deaths = 0 | ||||||
|
||||||
rows = list(reversed(parse_table(cases_tag))) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, forgot to comment on this in the review: I don’t think there’s any reason to convert this to a list, since you’re only iterating over it once and not returning it:
Suggested change
|
||||||
for row in rows: | ||||||
date = dateutil.parser.parse(row['Date']).date().isoformat() | ||||||
new_infected = parse_int(row['New']) | ||||||
dead = parse_int(row['Deaths']) | ||||||
|
||||||
cumul_cases += new_infected | ||||||
case_dict: TimeSeriesItem = { 'date': date, 'cases': new_infected, 'cumul_cases': cumul_cases } | ||||||
cases.append(case_dict) | ||||||
|
||||||
new_deaths = dead - cumul_deaths | ||||||
cumul_deaths = dead | ||||||
death_dict: TimeSeriesItem = { 'date': date, 'deaths': new_deaths, 'cumul_deaths': dead } | ||||||
deaths.append(death_dict) | ||||||
|
||||||
return { 'cases': cases, 'deaths': deaths } | ||||||
|
||||||
def transform_transmission(transmission_tag: element.Tag) -> Dict[str, int]: | ||||||
""" | ||||||
Takes in a BeautifulSoup tag for the transmissions table and breaks it into | ||||||
a dictionary of type: | ||||||
{'community': -1, 'from_contact': -1, 'travel': -1, 'unknown': -1} | ||||||
""" | ||||||
transmissions = {} | ||||||
rows = parse_table(transmission_tag) | ||||||
# turns the transmission categories on the page into the ones we're using | ||||||
transmission_type_conversion = {'Community': 'community', 'Close Contact': 'from_contact', 'Travel': 'travel', 'Under Investigation': 'unknown'} | ||||||
for row in rows: | ||||||
type = row['Source'] | ||||||
number = parse_int(row['Cases']) | ||||||
if type not in transmission_type_conversion: | ||||||
raise FormatError(f'The transmission type {type} was not found in transmission_type_conversion') | ||||||
type = transmission_type_conversion[type] | ||||||
transmissions[type] = number | ||||||
return transmissions | ||||||
|
||||||
def transform_tests(tests_tag: element.Tag) -> Dict[str, int]: | ||||||
""" | ||||||
Transform function for the tests table. | ||||||
Takes in a BeautifulSoup tag for a table and returns a dictionary | ||||||
""" | ||||||
tests = {} | ||||||
rows = parse_table(tests_tag) | ||||||
for row in rows: | ||||||
lower_res = row['Results'].lower() | ||||||
tests[lower_res] = parse_int(row['Number']) | ||||||
return tests; | ||||||
|
||||||
def transform_gender(tag: element.Tag) -> Dict[str, int]: | ||||||
""" | ||||||
Transform function for the cases by gender table. | ||||||
Takes in a BeautifulSoup tag for a table and returns a dictionary | ||||||
in which the keys are strings and the values integers | ||||||
""" | ||||||
genders = {} | ||||||
rows = parse_table(tag) | ||||||
gender_string_conversions = {'Males': 'male', 'Females': 'female'} | ||||||
for row in rows: | ||||||
gender = row['Gender'] | ||||||
cases = parse_int(row['Cases']) | ||||||
if gender not in gender_string_conversions: | ||||||
raise FormatError('An unrecognized gender has been added to the gender table') | ||||||
genders[gender_string_conversions[gender]] = cases | ||||||
return genders | ||||||
|
||||||
def transform_age(tag: element.Tag) -> TimeSeries: | ||||||
""" | ||||||
Transform function for the cases by age group table. | ||||||
Takes in a BeautifulSoup tag for a table and returns a list of | ||||||
dictionaries in which the keys are strings and the values integers | ||||||
""" | ||||||
categories: TimeSeries = [] | ||||||
rows = parse_table(tag) | ||||||
for row in rows: | ||||||
raw_count = parse_int(row['Cases']) | ||||||
group = row['Age Group'] | ||||||
element: TimeSeriesItem = {'group': group, 'raw_count': raw_count} | ||||||
categories.append(element) | ||||||
return categories | ||||||
|
||||||
def get_unknown_race(race_eth_tag: element.Tag) -> int: | ||||||
""" | ||||||
Gets the notes under the 'Cases by race and ethnicity' table to find the | ||||||
number of cases where the person's race is unknown | ||||||
""" | ||||||
parent = race_eth_tag.parent | ||||||
note = parent.find('p').text | ||||||
matches = re.search(r'(\d+) \(\d{1,3}%\) missing race/ethnicity', note) | ||||||
if not matches: | ||||||
raise FormatError('The format of the note with unknown race data has changed') | ||||||
return(parse_int(matches.groups()[0])) | ||||||
|
||||||
def transform_race_eth(race_eth_tag: element.Tag) -> Dict[str, int]: | ||||||
""" | ||||||
Takes in the BeautifulSoup tag for the cases by race/ethnicity table and | ||||||
transforms it into an object of form: | ||||||
'race_eth': {'Asian': -1, 'Latinx_or_Hispanic': -1, 'Other': -1, 'White':-1, 'Unknown': -1} | ||||||
NB: These are the only races reported seperatley by Sonoma county at this time | ||||||
""" | ||||||
race_cases = { | ||||||
'Asian': 0, | ||||||
'Latinx_or_Hispanic': 0, | ||||||
'Other': 0, | ||||||
'White': 0, | ||||||
'Unknown': 0 | ||||||
} | ||||||
race_transform = {'Asian/Pacific Islander, non-Hispanic': 'Asian', 'Hispanic/Latino': 'Latinx_or_Hispanic', 'Other*, non-Hispanic': 'Other', 'White, non-Hispanic': 'White'} | ||||||
rows = parse_table(race_eth_tag) | ||||||
for row in rows: | ||||||
group_name = row['Race/Ethnicity'] | ||||||
cases = parse_int(row['Cases']) | ||||||
if group_name not in race_transform: | ||||||
raise FormatError('The racial group {0} is new in the data -- please adjust the scraper accordingly') | ||||||
internal_name = race_transform[group_name] | ||||||
race_cases[internal_name] = cases | ||||||
race_cases['Unknown'] = get_unknown_race(race_eth_tag) | ||||||
return race_cases | ||||||
|
||||||
def get_table_tags(soup: BeautifulSoup) -> List[element.Tag]: | ||||||
""" | ||||||
Takes in a BeautifulSoup object and returns an array of the tables we need | ||||||
""" | ||||||
headers = ['Cases by Date', 'Test Results', 'Cases by Source', 'Cases by Age Group', 'Cases by Gender', 'Cases by Race'] | ||||||
return [get_table(header, soup) for header in headers] | ||||||
|
||||||
def get_county() -> Dict: | ||||||
""" | ||||||
Main method for populating county data .json | ||||||
""" | ||||||
url = 'https://socoemergency.org/emergency/novel-coronavirus/coronavirus-cases/' | ||||||
# need this to avoid 403 error ¯\_(ツ)_/¯ | ||||||
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'} | ||||||
page = requests.get(url, headers=headers) | ||||||
page.raise_for_status() | ||||||
sonoma_soup = BeautifulSoup(page.content, 'html5lib') | ||||||
|
||||||
hist_cases, total_tests, cases_by_source, cases_by_age, cases_by_gender, cases_by_race = get_table_tags(sonoma_soup) | ||||||
|
||||||
model = { | ||||||
'name': 'Sonoma County', | ||||||
'update_time': generate_update_time(sonoma_soup), | ||||||
'source': url, | ||||||
'meta_from_source': get_source_meta(sonoma_soup), | ||||||
'meta_from_baypd': 'Racial "Other" category includes "Black/African American, American Indian/Alaska Native, and Other"', | ||||||
'series': transform_cases(hist_cases), | ||||||
'case_totals': { | ||||||
'transmission_cat': transform_transmission(cases_by_source), | ||||||
'age_group': transform_age(cases_by_age), | ||||||
'race_eth': transform_race_eth(cases_by_race), | ||||||
'gender': transform_gender(cases_by_gender) | ||||||
}, | ||||||
'tests_totals': { | ||||||
'tests': transform_tests(total_tests), | ||||||
}, | ||||||
} | ||||||
return model | ||||||
|
||||||
if __name__ == '__main__': | ||||||
print(json.dumps(get_county(), indent=4)) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: one of the important things about naming (or aliasing) types like this is to change how you conceptualize your values and functions (e.g. you shouldn’t be thinking of
UnformattedSeriesItem
like a shortcut forDict[str, str]
here; you should be thinking of it like a subclass ofdict
— it should conceptually be its own separate thing).So if you’re changing the return type to something named
UnformattedSeriesItem
, it’s probably a good idea to change the function name to not talk about making adict
and instead call it something likerow_list_to_series_item
or something.