Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CivicPlusSite needs better handling for names of downloaded files #96

Open
zstumgoren opened this issue Aug 7, 2021 · 0 comments
Open
Labels
bug Something isn't working

Comments

@zstumgoren
Copy link
Member

zstumgoren commented Aug 7, 2021

On a test scrape for Belvedere, CA for roughly June through early August, the scrape generated less-than-helpful names for downloaded files:

/tmp/civic_scraper/
├── assets
│   ├── civicplus_www_06142021-577_agenda.html
│   ├── civicplus_www_06142021-577_agenda.pdf
│   ├── civicplus_www_06142021-577_agenda_packet.pdf
│   ├── civicplus_www_06142021-577_minutes.pdf
│   ├── civicplus_www_06152021-578_agenda.html
<<< snipped >>>

This appears to stem from our handling of the meeting_id variable, which is used in Asset.download to generate the file name.

Need to either debug for this locale and/or adopt an alternate convention for standardizing file names in CivicPlusSite (and generally).

An ideal solution would be storing file artifacts based on a combination of place, agency, date of meeting, committee type, document type and document format (i.e. the file suffix). For example:

# Note, place may need more careful handling
/tmp/civic_scraper/assets/ca_belvedere/20210604_city_council_agenda_packet.pdf
/tmp/civic_scraper/assets/ca_belvedere/20210604_city_council_agenda_packet.html

It's likely that we may not have all this information available for all platforms, so we may need platform specific solutions.

Or we can go in a totally different direction and just generate unique names based on a file hash, and then use asset metadata (e.g. stored in the metadata CSV) to link given files with their unique names.

@zstumgoren zstumgoren added the bug Something isn't working label Aug 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant