PDAP metadata made easier to access and work with, beginning with data sources and agencies.
This mirrors our Data Sources what's that? from Airtable to our PostgreSQL database. Airtable is the entry point for our data, but using Postgres allows us to make an API that's decoupled from Airtable.
This script runs daily from the Automation Manager, using the Dockerfile and Jenkinsfile to pull the latest changes, install any requirements, and run the script from a container environment.
git clone https://github.com/Police-Data-Accessibility-Project/data-sources-mirror.git
cd data-sources-mirror
If you don't already have virtualenv, install the package:
pip install virtualenv
Then run the following command to create a virtual environment:
virtualenv -p python3.9 mirror_venv
source mirror_venv/bin/activate
pip install -r requirements.txt
5. Create a file named .env in the same directory containing secrets for DO_DATABASE_URL, AIRTABLE_BASE_ID, and AIRTABLE_TOKEN
The app should have a DO_DATABASE_URL, AIRTABLE_BASE_ID, and AIRTABLE_TOKEN for PDAP's Data Sources DigitalOcean. Reach out to [email protected] or make noise in Discord if you'd like access to these keys.
DO_DATABASE_URL=postgres://data_sources_app:<password>@db-postgresql-nyc3-38355-do-user-8463429-0.c.db.ondigitalocean.com:25060/defaultdb
AIRTABLE_TOKEN=<airtable_token>
AIRTABLE_BASE_ID=<airtable_base_id>
python3 mirror.py
Hourly
We're taking advantage of a Python wrapper around the Airtable API to provide flat data joinable on an agency's airtable_uid
. The script is scheduled to run via crontab on the Automation Manager droplet in DigitalOcean, on an hourly basis. On each run, the latest code from the main branch of the repo is fetched and current requirements are installed in order to make sure the script is up to date on the droplet. During execution, only rows from each table that are new or have been updated in the last two hours are fetched to update our Postgres database.
The intent is to do as little transformation as we can, erring on the side of being true to what the API wrapper returns, while also being transparent about how we're doing things.
Some caveats:
- As with all data, this is human-entered and human-devised and subject to imperfections. Corrections and suggestions and questions welcome.
- We'd love to have more fields filled out, or new records added, related to any agencies or data sources.