-
-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Improve 50-a Data Collection #388
Comments
So with the inclusion of the officer data on their page, it looks like the ingestion of officer data should be split into two parts, the first being downloading the CSV and storing that data, and the second being scraping the other data(that isn't in the CSV) like the list of complaint numbers, gender, age, url, etc. We could then enrich the previously stored data in the ingestion layer or I guess incorporate pandas in the scraper repo and do some data processing there to enrich the CSV and output a single JSONL file. Not sure what the better approach is. Page with the officer CSV: https://www.50-a.org/about @DMalone87 you mentioned the complaints page but I'm not sure how to find it, could you link it here? |
Sure thing! |
PR Here: National-Police-Data-Coalition/police-data-trust-scrapers#17 EDIT: Closed, need to add tests. Will create another PR. |
Is your feature request related to a problem? Please describe.
Currently, our 50-a Scraper does not properly capture officer data. First, officers are not being associated with their Units. We are collecting the unit names, but we aren't taking the step of connecting each officer to the unit(s) that they've worked for. Second, we aren't properly collecting the complaints associated with each officer. We are collecting the dispositions of the complaints, but we aren't associating complaint data with individual officers.
Describe the solution you'd like
When scraping officer data from 50-a.org, make the following adjustments:
This means an entry in the JSON output might change from this:
To this:
When scraping command data, make the following adjustments:
Therefore this:
{"scraped_at": "2024-05-15 14:17:28", "name": "24th Precinct", "url": "https://www.50-a.org/command/24pct"}
Will become this:
{"scraped_at": "2024-05-15 14:17:28", "name": "24th Precinct", "url": "https://www.50-a.org/command/24pct"}, "website_url": "https://www1.nyc.gov/site/nypd/bureaus/patrol/precincts/24th-precinct.page", "commanding_officer": "https://www.50-a.org/officer/KYGH", "address": "151 W 100th St, New York, NY 10025", "description": "The 24th Precinct is located on the Upper West Side of Manhattan and encompasses Manhattan Valley and a portion of Riverside Park. It is a residential and commercial community of multiple dwelling homes and one major housing development.", "officers": [{"url": "https://www.50-a.org/officer/WHJ5", "most_recent": 2024}, {"url": "https://www.50-a.org/officer/4JJ9", "most_recent": 2024}, {"url": "https://www.50-a.org/officer/J7Y3", "most_recent": 2023}]}
Additional context
The text was updated successfully, but these errors were encountered: