Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sonoma Data Scraper #57

Merged
merged 70 commits into from
Aug 18, 2020
Merged

Add Sonoma Data Scraper #57

merged 70 commits into from
Aug 18, 2020

Conversation

ldtcooper
Copy link
Collaborator

No description provided.

@ldtcooper ldtcooper marked this pull request as draft May 17, 2020 01:57
@ldtcooper
Copy link
Collaborator Author

The code is mostly done now. At this point, I'm just doing some clean-up and clarification, hence the draft status.
I know my implementation is somewhat repetitive, and I'm open to suggestions to fix that. I played around with the idea of having a generic function to get the rows and cells from a tag, loop through them, and take a function to do any extra cleaning/transformation on them, but that made things more complicated and hard-to-read IMO.

@ldtcooper ldtcooper marked this pull request as ready for review May 17, 2020 04:06
@ldtcooper
Copy link
Collaborator Author

It looks like it does. I'll try to fix those conflicts and linter errors tonight

@ldtcooper
Copy link
Collaborator Author

Looks like all the linter and merge problems have been fixed @Mr0grog @elaguerta

README.md Outdated Show resolved Hide resolved
import dateutil.parser
from typing import List, Dict, Union
from bs4 import BeautifulSoup, element # type: ignore
from format_error import FormatError # type: ignore
Copy link
Collaborator

@Mr0grog Mr0grog Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is broken for me. If I run ./run_scraper_data.sh sonoma or python scraper_data.py sonoma, I get:

Traceback (most recent call last):
  File "scraper_data.py", line 4, in <module>
    from covid19_sfbayarea import data as data_scrapers
  File "/Users/rbrackett/Dev/sfbrigade/data-covid19-sfbayarea/covid19_sfbayarea/data/__init__.py", line 4, in <module>
    from . import sonoma
  File "/Users/rbrackett/Dev/sfbrigade/data-covid19-sfbayarea/covid19_sfbayarea/data/sonoma.py", line 8, in <module>
    from format_error import FormatError
ModuleNotFoundError: No module named 'format_error'

Does it work for you? Seems like it should be:

Suggested change
from format_error import FormatError # type: ignore
from .format_error import FormatError

(It also shouldn’t need need the # type: ignore comment.)


Also, while you’re here, maybe it’s worthwhile to make this a shared module up in covid19_sfbayarea/errors.py instead of covid19_sfbayarea/data/format_error.py? Some of the news scrapers also use an identical class, and we should really only have one implementation. Then this line would be:

Suggested change
from format_error import FormatError # type: ignore
from ..errors import FormatError

@elaguerta
Copy link
Collaborator

elaguerta commented Jul 7, 2020

In addition to the HTML tables, they are also using ArcGIS dashboards. I found some endpoints:

README.md Outdated Show resolved Hide resolved
covid19_sfbayarea/data/__init__.py Outdated Show resolved Hide resolved
covid19_sfbayarea/data/format_error.py Outdated Show resolved Hide resolved
covid19_sfbayarea/data/sonoma.py Outdated Show resolved Hide resolved
covid19_sfbayarea/data/sonoma.py Outdated Show resolved Hide resolved
covid19_sfbayarea/data/sonoma.py Outdated Show resolved Hide resolved
covid19_sfbayarea/data/sonoma.py Outdated Show resolved Hide resolved
covid19_sfbayarea/data/sonoma.py Outdated Show resolved Hide resolved
covid19_sfbayarea/data/sonoma.py Outdated Show resolved Hide resolved
covid19_sfbayarea/data/sonoma.py Outdated Show resolved Hide resolved
@ldtcooper ldtcooper requested a review from Mr0grog August 8, 2020 21:05
Copy link
Collaborator

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just calling out these notes from the last review, which didn’t seem to be addressed one way or the other. Not sure if you missed them or if you’re disagreeing (which is also fine; it just helps to be clear about it):

@ldtcooper
Copy link
Collaborator Author

I did just miss those. Let me take a look

@ldtcooper ldtcooper requested a review from Mr0grog August 13, 2020 03:10
@ldtcooper
Copy link
Collaborator Author

Okay, I think I addressed those three points

Copy link
Collaborator

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one “food for thought” note inline, but this looks good to me overall. 👍

"""
return [el.text for el in row.find_all(['th', 'td'])]

def row_list_to_dict(row: List[str], headers: List[str]) -> UnformattedSeriesItem:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: one of the important things about naming (or aliasing) types like this is to change how you conceptualize your values and functions (e.g. you shouldn’t be thinking of UnformattedSeriesItem like a shortcut for Dict[str, str] here; you should be thinking of it like a subclass of dict — it should conceptually be its own separate thing).

So if you’re changing the return type to something named UnformattedSeriesItem, it’s probably a good idea to change the function name to not talk about making a dict and instead call it something like row_list_to_series_item or something.

deaths = []
cumul_deaths = 0

rows = list(reversed(parse_table(cases_tag)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, forgot to comment on this in the review: I don’t think there’s any reason to convert this to a list, since you’re only iterating over it once and not returning it:

Suggested change
rows = list(reversed(parse_table(cases_tag)))
rows = reversed(parse_table(cases_tag))

@ldtcooper ldtcooper merged commit 8fed7ed into master Aug 18, 2020
@ldtcooper ldtcooper deleted the sonoma branch August 18, 2020 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants