LMGTDFY is a web-based utility to catalog all open data file formats found on a given domain name. It finds CSV, XML, JSON, XLS, XLSX, XML, and Shapefiles, and makes the resulting inventory available for download as a CSV file. It does this using Bing’s API.
This is intended for people who need to inventory all data files on a given domain name—these are generally employees of state and municipal government, who are creating an open data repository, and performing the initial step of figuring out what data is already being emitted by their government.
LMGTDFY powers U.S. Open Data’s LMGTDFY site, but anybody can install the software and use it to create their own inventory. You might want to do this if you have more than 2,000 data files on your site. U.S. Open Data’s LMGTDFY site caps the number of results at 2,000, in order to avoid winding up with an untenably large invoice for using Bing’s API. (Microsoft allows 5,000 searches/month for free, but has kindly provided U.S. Open Data with a substantial sponsorship for this service, in the form of API credits.)
In Ubuntu/Debian—on RedHat-based platforms, use yum instead. This will automatically start the RabbitMQ messaging service when the server starts.
sudo apt-get install rabbitmq-server build-essential autoconf libtool pkg-config python-dev libxml2-dev libxslt1-dev zlib1g-dev
Within the webroot directory (e.g., /var/www/example.com/
), install the Python dependencies.
sudo pip install -r requirements.txt
That’s the Django WSGI web server we'll be running, which Nginx or Apache will proxy to.
sudo pip install gunicorn
Set up your site's config file. The default site config is in opendata/
, and an example settings file is at settings_example.py
. Rename it to settings.py
and edit the config options to suit your purposes. At a minimum, you must sign up for and provide a Bing API key.
cd opendata
mv settings_example.py settings.py
pico settings.py
You will probably want to edit TIME_ZONE
and SEARCH_PER_QUERY
(the maximum number of results to get from Bing, per site), and you have to provie your Bing API key as BING_KEY
. If you’re deploying a live site, you’ll want to set DEBUG
to False
.
Follow the prompts as appropriate to set up the database. This will also set up the superuser you use to log into the site’s /admin
page.
python manage.py syncdb
Make sure to specify the PID file, so you can stop it again later.
gunicorn -D -p gunicorn.pid --access-logfile access.log --error-logfile error.log opendata.wsgi
```
### Launch Celery
This will manage the search API requests. Use screen to run this `screen` to keep it running in the background, and reattach to Celery to stop it with Ctrl-C. The `--autoscale` option specifies how many (and how few) copies of Celery you want running at one time—basically, how many people who think are going to be trying to request data on your site at once. `10,2` means that it allows a maximum of 10, and always has at least 2 instances running.
```
celery -A opendata worker -l info --autoscale=10,2
```
### Proxy it through Apache/Nginx
Pass all requests through to Gunicorn, except for any requests to `static/`. For instance, in Apache, you'd set up something like this:
```
ProxyPass /static !
ProxyPass / http://localhost:8000/
ProxyPassReverse / http://localhost:8000/
```
### Final configuration
By default, no domains are whitelisted—you have to specify which TLDs or domains that you want to permit people to get data for. That's because Bing API queries cost money, so you might want to restrict it. Log in to `http://example.com/admin` (using your site’s domain name) and add any TLDs (e.g., `gov`) or domain names that you want to allow site visitors to index.
### Also
Note that if you need to stop the currently running Gunicorn instance, you can run:
```
kill `cat gunicorn.pid`
```
### Finally
You'll probably want to set up Gunicorn and Celery to run continuously, and to start and stop along with your server. Here are [instructions on daemonizing Gunicorn](https://gunicorn-docs.readthedocs.org/en/latest/deploy.html) and [instructions on daemonizing Celery](https://celery.readthedocs.org/en/latest/tutorials/daemonizing.html).
## Colophon
This tool was made by [Ted Han](https://github.com/knowtheory) and [SVSG](http://svsg.co/) for [U.S. Open Data](https://usopendata.org/).