Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Improve 50-a Data Collection #388

Open
DMalone87 opened this issue May 29, 2024 · 5 comments
Open

[FEATURE] Improve 50-a Data Collection #388

DMalone87 opened this issue May 29, 2024 · 5 comments
Assignees
Labels

Comments

@DMalone87
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Currently, our 50-a Scraper does not properly capture officer data. First, officers are not being associated with their Units. We are collecting the unit names, but we aren't taking the step of connecting each officer to the unit(s) that they've worked for. Second, we aren't properly collecting the complaints associated with each officer. We are collecting the dispositions of the complaints, but we aren't associating complaint data with individual officers.

Describe the solution you'd like
When scraping officer data from 50-a.org, make the following adjustments:

  • Include a list of complaint numbers associated with the officer.
  • Include the Tax Number for each officer

This means an entry in the JSON output might change from this:

{"scraped_at": "2024-05-15 00:05:13", "url": "https://www.50-a.org/officer/TK8M", "name": "Benjamin F. Colecchia", "badge": "Badge #3490", "race": "White", "gender": "Male", "complaints": [{"name": "complaints", "count": 1}, {"name": "allegations", "count": 1}, {"name": "substantiated", "count": 0}, {"name": "Exonerated", "count": 1}], "age": null}
{"scraped_at": "2024-05-15 00:05:13", "url": "https://www.50-a.org/officer/7G3P", "name": "Ernesto Nieves", "badge": "Badge #4684", "race": "Hispanic", "gender": "Male", "complaints": [{"name": "complaints", "count": 2}, {"name": "allegations", "count": 2}, {"name": "substantiated", "count": 0}, {"name": "Complaint Withdrawn", "count": 1}, {"name": "Exonerated", "count": 1}], "age": "23"}

To this:

{"scraped_at": "2024-05-15 00:05:13", "url": "https://www.50-a.org/officer/TK8M", "name": "Benjamin F. Colecchia", "badge": "Badge #3490", "race": "White", "gender": "Male", "complaints": [9800290], "age": null, "taxnum": "918638"}
{"scraped_at": "2024-05-15 00:05:13", "url": "https://www.50-a.org/officer/7G3P", "name": "Ernesto Nieves", "badge": "Badge #4684", "race": "Hispanic", "gender": "Male", "complaints": [200410455, 200207742], "age": "23", "taxnum": "922871"}

When scraping command data, make the following adjustments:

  • Collect the officers who have worked for that command and each officer's most recent employment with that command.
  • Collect the commanding officer of each command.
  • Collect the official website url for each command.
  • Collect the description and address from each command.

Therefore this:

{"scraped_at": "2024-05-15 14:17:28", "name": "24th Precinct", "url": "https://www.50-a.org/command/24pct"}

Will become this:

{"scraped_at": "2024-05-15 14:17:28", "name": "24th Precinct", "url": "https://www.50-a.org/command/24pct"}, "website_url": "https://www1.nyc.gov/site/nypd/bureaus/patrol/precincts/24th-precinct.page", "commanding_officer": "https://www.50-a.org/officer/KYGH", "address": "151 W 100th St, New York, NY 10025", "description": "The 24th Precinct is located on the Upper West Side of Manhattan and encompasses Manhattan Valley and a portion of Riverside Park. It is a residential and commercial community of multiple dwelling homes and one major housing development.", "officers": [{"url": "https://www.50-a.org/officer/WHJ5", "most_recent": 2024}, {"url": "https://www.50-a.org/officer/4JJ9", "most_recent": 2024}, {"url": "https://www.50-a.org/officer/J7Y3", "most_recent": 2023}]}

Additional context

@DMalone87 DMalone87 added enhancement New feature or request backend labels May 29, 2024
@aasnani
Copy link

aasnani commented Jun 30, 2024

image

@aasnani
Copy link

aasnani commented Jun 30, 2024

So with the inclusion of the officer data on their page, it looks like the ingestion of officer data should be split into two parts, the first being downloading the CSV and storing that data, and the second being scraping the other data(that isn't in the CSV) like the list of complaint numbers, gender, age, url, etc. We could then enrich the previously stored data in the ingestion layer or I guess incorporate pandas in the scraper repo and do some data processing there to enrich the CSV and output a single JSONL file. Not sure what the better approach is.

Page with the officer CSV: https://www.50-a.org/about

@DMalone87 you mentioned the complaints page but I'm not sure how to find it, could you link it here?

@DMalone87
Copy link
Collaborator Author

@aasnani
Copy link

aasnani commented Jul 21, 2024

PR Here: National-Police-Data-Coalition/police-data-trust-scrapers#17

EDIT: Closed, need to add tests. Will create another PR.

@aasnani
Copy link

aasnani commented Jul 23, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants