Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sonoma Data Scraper #57

Merged
merged 70 commits into from
Aug 18, 2020
Merged
Show file tree
Hide file tree
Changes from 61 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
10f0dfe
organization Merge CDM readme into readme
May 2, 2020
e5bceba
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
May 2, 2020
7db78f7
organization Move data models to own folder
May 2, 2020
05dbee7
organization Replace tabs with spaces
May 7, 2020
998800b
sonoma Get top level metadata
May 7, 2020
3fcb13f
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
May 9, 2020
ed855bd
sonoma Move scraper and collect metadata
May 10, 2020
fdab8a4
sonoma Add transmission types
May 10, 2020
8bd2081
sonoma Get cases, active, recovered, and death series
May 12, 2020
bd72db8
sonoma Get case data by age
May 12, 2020
ee5a8b7
sonoma Fix table numbers
May 16, 2020
7745e5b
sonoma Add test getter
May 16, 2020
e7ab26f
sonoma Factor out some common code
May 16, 2020
dc9b9fe
sonoma Add cases by race
May 16, 2020
af8bfe2
sonoma Add hospitalizations
May 17, 2020
adbe419
sonoma Add hospitalizations by gender
May 17, 2020
6b71193
sonoma Fix type error
May 17, 2020
627e82a
sonoma Redo definitions getter
May 17, 2020
a565a83
sonoma Add get_county function
May 17, 2020
358a441
sonoma Add docstrings
May 17, 2020
7dc3beb
sonoma Comment out hospitalizations by gender
May 17, 2020
6a4ead9
sonoma Add docstring for gender hospitalization
May 17, 2020
336e5ac
sonoma Remove unused variable
May 17, 2020
5297eeb
sonoma Replace findAll with find_all
May 19, 2020
5093fe3
sonoma Make newlines clearer
May 19, 2020
48dd3c1
sonoma Comment out hospitalizations
May 21, 2020
2a76315
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
May 21, 2020
a8ce742
sonoma Use better date parser
May 21, 2020
c9f3500
sonoma Improve transform cases function
May 21, 2020
058a555
sonoma Fix date formats, table selection, and number parsing
May 21, 2020
fd4e135
sonoma Use custom int parse function
May 21, 2020
8310ca0
sonoma Create custom FormatError exception
May 21, 2020
148eec8
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
May 23, 2020
ada9b2a
sonoma use template defaults for race
May 23, 2020
40b84e0
sonoma Fix test breakage
May 23, 2020
eb1a489
sonoma Use unique functions for age and gender
May 23, 2020
fdb2045
sonoma Transform age group names
May 23, 2020
1e1b0a8
sonoma Add error handling for gender and age transformations
May 23, 2020
1f3755a
sonoma Rename scraper file
May 23, 2020
96b81b5
sonoma Fix error handling for age
May 23, 2020
fb339b4
sonoma Fix typing errors
May 23, 2020
5d96031
sonoma Factor out getting section by title
May 28, 2020
fd09e5e
sonoma Correct deaths and cases aggregation
May 28, 2020
7770c89
sonoma Raise error for hospitalization change
May 28, 2020
6b4b69b
sonoma Add error for getting section by title
May 28, 2020
f1c7f05
sonoma Fix typing issue for age
May 28, 2020
5c9a9ed
sonoma Write parse table function
May 31, 2020
41d61c4
Fix typo
Jun 7, 2020
06163e2
sonoma Comment and typing fixes
Jun 7, 2020
ba6df28
Use raw string for regex
Jun 7, 2020
ad0e174
Merge branch 'sonoma' of github.com:sfbrigade/data-covid19-sfbayarea …
Jun 7, 2020
2bf3faf
sonoma Remove commented out code
Jun 7, 2020
bac1b5b
sonoma Remove unused variable
Jun 7, 2020
6a0ef8c
sonoma Add sonoma to init.py
Jun 8, 2020
15456e1
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
Jun 17, 2020
329f92d
sonoma Correct conventions for sonoma
Jun 17, 2020
3822877
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
Jul 30, 2020
f070125
Fix conflicts
Jul 30, 2020
d1aec84
Fix error import
Jul 30, 2020
1deaa9c
Merge branch 'master' of github.com:sfbrigade/data-covid19-sfbayarea …
Aug 5, 2020
869418a
Fix linter errors and import
Aug 6, 2020
6ef13b4
Add type aliases
Aug 8, 2020
5fdc2aa
Use get cell function for cases
ldtcooper Aug 8, 2020
aed862f
Remove data model readme from main readme
ldtcooper Aug 11, 2020
898672d
Add readme link
ldtcooper Aug 11, 2020
a549ea4
Refactor test and gender functions
ldtcooper Aug 13, 2020
97b72c1
Refactor all transforn functions but cases
ldtcooper Aug 13, 2020
28df7be
Fix types
ldtcooper Aug 13, 2020
6ddf682
Add docstrings
ldtcooper Aug 13, 2020
4a92856
Use datetime attribute
ldtcooper Aug 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 153 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,159 @@ To run the scraper, you can use the run script by typing `sh run_scraper.sh` int
## Running the API
The best way to run the API right now is to run the command `FLASK_APP="app.py" FLASK_ENV=development flask run;`. Note that this is not the best way to run the scraper at this time.

## Data Model
ldtcooper marked this conversation as resolved.
Show resolved Hide resolved
The following sections document the differences between the counties in the common data model (see `data_models` directory) which we will see as we begin to get data from them.

### Ages

Please make sure to use the following age brackets for the different counties. Note that the brackets may also vary by whether you are scraping cases or deaths data:


#### San Francisco
##### Cases
"age": [
{"group": "18_and_under", "raw_count": -1 },
{"group": "18_to_30", "raw_count": -1 },
{"group": "31_to_40", "raw_count": -1 },
{"group": "41_to_50", "raw_count": -1 },
{"group": "51_to_60", "raw_count": -1 },
{"group": "61_to_70", "raw_count": -1 },
{"group": "71_to_80", "raw_count": -1 },
{"group": "81_and_older", "raw_count": -1}
]
##### Deaths
Data broken down by gender is not available on the json files, only on the dashboard.


#### Alameda
##### Cases
"age": [
{"group": "18_and_under", "raw_count": -1 },
{"group": "18_to_30", "raw_count": -1 },
{"group": "31_to_40", "raw_count": -1 },
{"group": "41_to_50", "raw_count": -1 },
{"group": "51_to_60", "raw_count": -1 },
{"group": "61_to_70", "raw_count": -1 },
{"group": "71_to_80", "raw_count": -1 },
{"group": "81_and_older", "raw_count": -1 },
{"group": "Unknown", "raw_count": -1 }
]
##### Deaths
Data broken down by gender is not available.


#### Sonoma
##### Cases
"age": [
{"group": "0_to_17", "raw_count": -1 },
{"group": "18_to_49", "raw_count": -1 },
{"group": "50_to_64", "raw_count": -1 },
{"group": "65_and_older", "raw_count": -1 },
{"group": "Unknown", "raw_count": -1 }
]
##### Deaths
Data broken down by gender is not available.


#### Santa Clara
##### Cases
"age": [
{"group": "20_and_under", "raw_count": -1 },
{"group": "21_to_30", "raw_count": -1 },
{"group": "31_to_40", "raw_count": -1 },
{"group": "41_to_50", "raw_count": -1 },
{"group": "51_to_60", "raw_count": -1 },
{"group": "61_to_70", "raw_count": -1 },
{"group": "71_to_80", "raw_count": -1 },
{"group": "81_to_90", "raw_count": -1 },
{"group": "90_and_older", "raw_count": -1 },
{"group": "Unknown", "raw_count": -1 }
]
##### Deaths
"age": [
{"group": "20_and_under", "raw_count": -1 },
{"group": "21_to_30", "raw_count": -1 },
{"group": "31_to_40", "raw_count": -1 },
{"group": "41_to_50", "raw_count": -1 },
{"group": "51_to_60", "raw_count": -1 },
{"group": "61_to_70", "raw_count": -1 },
{"group": "71_to_80", "raw_count": -1 },
{"group": "81_to_90", "raw_count": -1 },
{"group": "90_and_older", "raw_count": -1 }
]


#### San Mateo
##### Cases
"age": [
{"group": "0_to_19", "raw_count": -1 },
{"group": "20_to_29", "raw_count": -1 },
{"group": "30_to_39", "raw_count": -1 },
{"group": "40_to_49", "raw_count": -1 },
{"group": "50_to_59", "raw_count": -1 },
{"group": "60_to_69", "raw_count": -1 },
{"group": "70_to_79", "raw_count": -1 },
{"group": "80_to_89", "raw_count": -1 },
{"group": "90_and_older", "raw_count": -1 }
]
##### Deaths
age": [
{"group": "0_to_19", "raw_count": -1 },
{"group": "20_to_29", "raw_count": -1 },
{"group": "30_to_39", "raw_count": -1 },
{"group": "40_to_49", "raw_count": -1 },
{"group": "50_to_59", "raw_count": -1 },
{"group": "60_to_69", "raw_count": -1 },
{"group": "70_to_79", "raw_count": -1 },
{"group": "80_to_89", "raw_count": -1 },
{"group": "90_and_older", "raw_count": -1 }
]


#### Contra Costa
##### Cases
age": [
{"group": "0_to_20", "raw_count": -1 },
{"group": "21_to_40", "raw_count": -1 },
{"group": "41_to_60", "raw_count": -1 },
{"group": "61_to_80", "raw_count": -1 },
{"group": "81_to_100", "raw_count": -1 }
]
##### Deaths
Data broken down by gender is not available.


#### Marin
##### Cases and Deaths
age": [
{"group": "0_to_18", "raw_count": -1 },
{"group": "19_to_34", "raw_count": -1 },
{"group": "35_to_49", "raw_count": -1 },
{"group": "50_to_64", "raw_count": -1 },
{"group": "65_and_older", "raw_count": -1 }
]



#### Solano
##### Cases and Deaths
age": [
{"group": "0_to_18", "raw_count": -1 },
{"group": "19_to_64", "raw_count": -1 },
{"group": "65_and_older", "raw_count": -1 }
]


#### Napa
##### Cases
age": [
{"group": "0_to_17", "raw_count": -1 },
{"group": "18_to_49", "raw_count": -1 },
{"group": "50_to_64", "raw_count": -1 },
{"group": "Over_64", "raw_count": -1 }
]
##### Deaths
Data broken down by gender is not available.
## Development

We use CircleCI to lint the code and run tests in this repository, but you can (and should!) also run tests locally.
Expand Down
3 changes: 3 additions & 0 deletions covid19_sfbayarea/data/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from typing import Dict, Any
from . import alameda
from . import san_francisco
from . import sonoma
from . import solano

scrapers: Dict[str, Any] = {
Expand All @@ -11,6 +12,8 @@
'san_francisco': san_francisco,
# 'san_mateo': None,
# 'santa_clara': None,
# 'solano': None,
ldtcooper marked this conversation as resolved.
Show resolved Hide resolved
'sonoma': sonoma,
'solano': solano,
# 'sonoma': None,
}
7 changes: 7 additions & 0 deletions covid19_sfbayarea/data/format_error.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
class FormatError(Exception):
"""
A custom error to raise whenever a scraper runs into something in an
unexpected format. This usually means that the website the scraper is
accessing has changed
"""
pass
ldtcooper marked this conversation as resolved.
Show resolved Hide resolved
Loading