Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sonoma Data Scraper #57

Merged
merged 70 commits into from
Aug 18, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
10f0dfe
organization Merge CDM readme into readme
May 2, 2020
e5bceba
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
May 2, 2020
7db78f7
organization Move data models to own folder
May 2, 2020
05dbee7
organization Replace tabs with spaces
May 7, 2020
998800b
sonoma Get top level metadata
May 7, 2020
3fcb13f
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
May 9, 2020
ed855bd
sonoma Move scraper and collect metadata
May 10, 2020
fdab8a4
sonoma Add transmission types
May 10, 2020
8bd2081
sonoma Get cases, active, recovered, and death series
May 12, 2020
bd72db8
sonoma Get case data by age
May 12, 2020
ee5a8b7
sonoma Fix table numbers
May 16, 2020
7745e5b
sonoma Add test getter
May 16, 2020
e7ab26f
sonoma Factor out some common code
May 16, 2020
dc9b9fe
sonoma Add cases by race
May 16, 2020
af8bfe2
sonoma Add hospitalizations
May 17, 2020
adbe419
sonoma Add hospitalizations by gender
May 17, 2020
6b71193
sonoma Fix type error
May 17, 2020
627e82a
sonoma Redo definitions getter
May 17, 2020
a565a83
sonoma Add get_county function
May 17, 2020
358a441
sonoma Add docstrings
May 17, 2020
7dc3beb
sonoma Comment out hospitalizations by gender
May 17, 2020
6a4ead9
sonoma Add docstring for gender hospitalization
May 17, 2020
336e5ac
sonoma Remove unused variable
May 17, 2020
5297eeb
sonoma Replace findAll with find_all
May 19, 2020
5093fe3
sonoma Make newlines clearer
May 19, 2020
48dd3c1
sonoma Comment out hospitalizations
May 21, 2020
2a76315
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
May 21, 2020
a8ce742
sonoma Use better date parser
May 21, 2020
c9f3500
sonoma Improve transform cases function
May 21, 2020
058a555
sonoma Fix date formats, table selection, and number parsing
May 21, 2020
fd4e135
sonoma Use custom int parse function
May 21, 2020
8310ca0
sonoma Create custom FormatError exception
May 21, 2020
148eec8
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
May 23, 2020
ada9b2a
sonoma use template defaults for race
May 23, 2020
40b84e0
sonoma Fix test breakage
May 23, 2020
eb1a489
sonoma Use unique functions for age and gender
May 23, 2020
fdb2045
sonoma Transform age group names
May 23, 2020
1e1b0a8
sonoma Add error handling for gender and age transformations
May 23, 2020
1f3755a
sonoma Rename scraper file
May 23, 2020
96b81b5
sonoma Fix error handling for age
May 23, 2020
fb339b4
sonoma Fix typing errors
May 23, 2020
5d96031
sonoma Factor out getting section by title
May 28, 2020
fd09e5e
sonoma Correct deaths and cases aggregation
May 28, 2020
7770c89
sonoma Raise error for hospitalization change
May 28, 2020
6b4b69b
sonoma Add error for getting section by title
May 28, 2020
f1c7f05
sonoma Fix typing issue for age
May 28, 2020
5c9a9ed
sonoma Write parse table function
May 31, 2020
41d61c4
Fix typo
Jun 7, 2020
06163e2
sonoma Comment and typing fixes
Jun 7, 2020
ba6df28
Use raw string for regex
Jun 7, 2020
ad0e174
Merge branch 'sonoma' of github.com:sfbrigade/data-covid19-sfbayarea …
Jun 7, 2020
2bf3faf
sonoma Remove commented out code
Jun 7, 2020
bac1b5b
sonoma Remove unused variable
Jun 7, 2020
6a0ef8c
sonoma Add sonoma to init.py
Jun 8, 2020
15456e1
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
Jun 17, 2020
329f92d
sonoma Correct conventions for sonoma
Jun 17, 2020
3822877
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
Jul 30, 2020
f070125
Fix conflicts
Jul 30, 2020
d1aec84
Fix error import
Jul 30, 2020
1deaa9c
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
Aug 5, 2020
869418a
Fix linter errors and import
Aug 6, 2020
6ef13b4
Add type aliases
Aug 8, 2020
5fdc2aa
Use get cell function for cases
ldtcooper Aug 8, 2020
aed862f
Remove data model readme from main readme
ldtcooper Aug 11, 2020
898672d
Add readme link
ldtcooper Aug 11, 2020
a549ea4
Refactor test and gender functions
ldtcooper Aug 13, 2020
97b72c1
Refactor all transforn functions but cases
ldtcooper Aug 13, 2020
28df7be
Fix types
ldtcooper Aug 13, 2020
6ddf682
Add docstrings
ldtcooper Aug 13, 2020
4a92856
Use datetime attribute
ldtcooper Aug 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ To install this project, you can simply run `./install.sh` in your terminal. Thi
## Running the scraper
To run the scraper, you can use the run script by typing `sh run_scraper.sh` into your terminal. This will enable the virtual environment and run `scraper.py`. Once again, the virtual environment will not stay active after the script finishes running. If you want to run the scraper without the run script, enable the virtual environment, then run `python3 scraper.py`.

## Running the API
The best way to run the API right now is to run the command `FLASK_APP="app.py" FLASK_ENV=development flask run;`. Note that this is not the best way to run the scraper at this time.
## Data Models
The data models are in JSON format and are located in the `data_models` directory. For more information, see the [data model readme](./data_models/README.md).

## Development

Expand Down
3 changes: 2 additions & 1 deletion covid19_sfbayarea/data/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from typing import Dict, Any
from . import alameda
from . import san_francisco
from . import sonoma
from . import solano

scrapers: Dict[str, Any] = {
Expand All @@ -11,6 +12,6 @@
'san_francisco': san_francisco,
# 'san_mateo': None,
# 'santa_clara': None,
'sonoma': sonoma,
'solano': solano,
# 'sonoma': None,
}
261 changes: 261 additions & 0 deletions covid19_sfbayarea/data/sonoma.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,261 @@
import requests
import json
import re
import dateutil.parser
from typing import List, Dict, Union
from bs4 import BeautifulSoup, element # type: ignore
from ..errors import FormatError

TimeSeriesItem = Dict[str, Union[str, int]]
TimeSeries = List[TimeSeriesItem]
UnformattedSeriesItem = Dict[str, str]
UnformattedSeries = List[UnformattedSeriesItem]

def get_section_by_title(header: str, soup: BeautifulSoup) -> element.Tag:
"""
Takes in a header string and returns the parent element of that header
"""
header_tag = soup.find(lambda tag: tag.name == 'h3' and header in tag.get_text())
if not header_tag:
raise FormatError('The header "{0}" no longer corresponds to a section'.format(header))

return header_tag.find_parent()

def get_table(header: str, soup: BeautifulSoup) -> element.Tag:
"""
Takes in a header and a BeautifulSoup object and returns the table under
that header
"""
tables = get_section_by_title(header, soup).find_all('table')
# this lets us get the second cases table
return tables[-1]

def get_cells(row: element.ResultSet) -> List[str]:
"""
Gets all th and tr elements within a single tr element
"""
return [el.text for el in row.find_all(['th', 'td'])]

def row_list_to_dict(row: List[str], headers: List[str]) -> UnformattedSeriesItem:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: one of the important things about naming (or aliasing) types like this is to change how you conceptualize your values and functions (e.g. you shouldn’t be thinking of UnformattedSeriesItem like a shortcut for Dict[str, str] here; you should be thinking of it like a subclass of dict — it should conceptually be its own separate thing).

So if you’re changing the return type to something named UnformattedSeriesItem, it’s probably a good idea to change the function name to not talk about making a dict and instead call it something like row_list_to_series_item or something.

"""
Takes in a list of headers and a corresponding list of cells
and returns a dictionary associating the headers with the cells
"""
return dict(zip(headers, row))

def parse_table(tag: element.Tag) -> UnformattedSeries:
"""
Takes in a BeautifulSoup table tag and returns a list of dictionaries
where the keys correspond to header names and the values to corresponding cell values
"""
rows = tag.find_all('tr')
header = rows[0]
body = rows[1:]
header_cells = get_cells(header)
body_cells = (get_cells(row) for row in body)
return [row_list_to_dict(row, header_cells) for row in body_cells]

def parse_int(text: str) -> int:
"""
Takes in a number in string form and returns that string in integer form
and handles zeroes represented as dashes
"""
text = text.strip()
if text == '-':
return 0
else:
return int(text.replace(',', ''))

def generate_update_time(soup: BeautifulSoup) -> str:
"""
Generates a timestamp string (e.g. May 6, 2020 10:00 AM) for when the scraper is run
"""
update_time_text = soup.find('time', {'class': 'updated'})['datetime']
try:
date = dateutil.parser.parse(update_time_text)
except ValueError:
raise ValueError(f'Date is not in ISO 8601'
f'format: "{update_time_text}"')
return date.isoformat()

def get_source_meta(soup: BeautifulSoup) -> str:
"""
Finds the 'Definitions' header on the page and gets all of the text in it.
"""
definitions_section = get_section_by_title('Definitions', soup)
definitions_text = definitions_section.text
return definitions_text.replace('\n', '/').strip()

def transform_cases(cases_tag: element.Tag) -> Dict[str, TimeSeries]:
"""
Takes in a BeautifulSoup tag for the cases table and returns all cases
(historic and active), deaths, and recoveries in the form:
{ 'cases': [], 'deaths': [] }
Where each list contains dictionaries (representing each day's data)
of form (example for cases):
{ 'date': '', 'cases': -1, 'cumul_cases': -1 }
"""
cases = []
cumul_cases = 0
deaths = []
cumul_deaths = 0

rows = list(reversed(parse_table(cases_tag)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, forgot to comment on this in the review: I don’t think there’s any reason to convert this to a list, since you’re only iterating over it once and not returning it:

Suggested change
rows = list(reversed(parse_table(cases_tag)))
rows = reversed(parse_table(cases_tag))

for row in rows:
date = dateutil.parser.parse(row['Date']).date().isoformat()
new_infected = parse_int(row['New'])
dead = parse_int(row['Deaths'])

cumul_cases += new_infected
case_dict: TimeSeriesItem = { 'date': date, 'cases': new_infected, 'cumul_cases': cumul_cases }
cases.append(case_dict)

new_deaths = dead - cumul_deaths
cumul_deaths = dead
death_dict: TimeSeriesItem = { 'date': date, 'deaths': new_deaths, 'cumul_deaths': dead }
deaths.append(death_dict)

return { 'cases': cases, 'deaths': deaths }

def transform_transmission(transmission_tag: element.Tag) -> Dict[str, int]:
"""
Takes in a BeautifulSoup tag for the transmissions table and breaks it into
a dictionary of type:
{'community': -1, 'from_contact': -1, 'travel': -1, 'unknown': -1}
"""
transmissions = {}
rows = parse_table(transmission_tag)
# turns the transmission categories on the page into the ones we're using
transmission_type_conversion = {'Community': 'community', 'Close Contact': 'from_contact', 'Travel': 'travel', 'Under Investigation': 'unknown'}
for row in rows:
type = row['Source']
number = parse_int(row['Cases'])
if type not in transmission_type_conversion:
raise FormatError(f'The transmission type {type} was not found in transmission_type_conversion')
type = transmission_type_conversion[type]
transmissions[type] = number
return transmissions

def transform_tests(tests_tag: element.Tag) -> Dict[str, int]:
"""
Transform function for the tests table.
Takes in a BeautifulSoup tag for a table and returns a dictionary
"""
tests = {}
rows = parse_table(tests_tag)
for row in rows:
lower_res = row['Results'].lower()
tests[lower_res] = parse_int(row['Number'])
return tests;

def transform_gender(tag: element.Tag) -> Dict[str, int]:
"""
Transform function for the cases by gender table.
Takes in a BeautifulSoup tag for a table and returns a dictionary
in which the keys are strings and the values integers
"""
genders = {}
rows = parse_table(tag)
gender_string_conversions = {'Males': 'male', 'Females': 'female'}
for row in rows:
gender = row['Gender']
cases = parse_int(row['Cases'])
if gender not in gender_string_conversions:
raise FormatError('An unrecognized gender has been added to the gender table')
genders[gender_string_conversions[gender]] = cases
return genders

def transform_age(tag: element.Tag) -> TimeSeries:
"""
Transform function for the cases by age group table.
Takes in a BeautifulSoup tag for a table and returns a list of
dictionaries in which the keys are strings and the values integers
"""
categories: TimeSeries = []
rows = parse_table(tag)
for row in rows:
raw_count = parse_int(row['Cases'])
group = row['Age Group']
element: TimeSeriesItem = {'group': group, 'raw_count': raw_count}
categories.append(element)
return categories

def get_unknown_race(race_eth_tag: element.Tag) -> int:
"""
Gets the notes under the 'Cases by race and ethnicity' table to find the
number of cases where the person's race is unknown
"""
parent = race_eth_tag.parent
note = parent.find('p').text
matches = re.search(r'(\d+) \(\d{1,3}%\) missing race/ethnicity', note)
if not matches:
raise FormatError('The format of the note with unknown race data has changed')
return(parse_int(matches.groups()[0]))

def transform_race_eth(race_eth_tag: element.Tag) -> Dict[str, int]:
"""
Takes in the BeautifulSoup tag for the cases by race/ethnicity table and
transforms it into an object of form:
'race_eth': {'Asian': -1, 'Latinx_or_Hispanic': -1, 'Other': -1, 'White':-1, 'Unknown': -1}
NB: These are the only races reported seperatley by Sonoma county at this time
"""
race_cases = {
'Asian': 0,
'Latinx_or_Hispanic': 0,
'Other': 0,
'White': 0,
'Unknown': 0
}
race_transform = {'Asian/Pacific Islander, non-Hispanic': 'Asian', 'Hispanic/Latino': 'Latinx_or_Hispanic', 'Other*, non-Hispanic': 'Other', 'White, non-Hispanic': 'White'}
rows = parse_table(race_eth_tag)
for row in rows:
group_name = row['Race/Ethnicity']
cases = parse_int(row['Cases'])
if group_name not in race_transform:
raise FormatError('The racial group {0} is new in the data -- please adjust the scraper accordingly')
internal_name = race_transform[group_name]
race_cases[internal_name] = cases
race_cases['Unknown'] = get_unknown_race(race_eth_tag)
return race_cases

def get_table_tags(soup: BeautifulSoup) -> List[element.Tag]:
"""
Takes in a BeautifulSoup object and returns an array of the tables we need
"""
headers = ['Cases by Date', 'Test Results', 'Cases by Source', 'Cases by Age Group', 'Cases by Gender', 'Cases by Race']
return [get_table(header, soup) for header in headers]

def get_county() -> Dict:
"""
Main method for populating county data .json
"""
url = 'https://socoemergency.org/emergency/novel-coronavirus/coronavirus-cases/'
# need this to avoid 403 error ¯\_(ツ)_/¯
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
page = requests.get(url, headers=headers)
page.raise_for_status()
sonoma_soup = BeautifulSoup(page.content, 'html5lib')

hist_cases, total_tests, cases_by_source, cases_by_age, cases_by_gender, cases_by_race = get_table_tags(sonoma_soup)

model = {
'name': 'Sonoma County',
'update_time': generate_update_time(sonoma_soup),
'source': url,
'meta_from_source': get_source_meta(sonoma_soup),
'meta_from_baypd': 'Racial "Other" category includes "Black/African American, American Indian/Alaska Native, and Other"',
'series': transform_cases(hist_cases),
'case_totals': {
'transmission_cat': transform_transmission(cases_by_source),
'age_group': transform_age(cases_by_age),
'race_eth': transform_race_eth(cases_by_race),
'gender': transform_gender(cases_by_gender)
},
'tests_totals': {
'tests': transform_tests(total_tests),
},
}
return model

if __name__ == '__main__':
print(json.dumps(get_county(), indent=4))