-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathguide.txt
566 lines (452 loc) · 21.3 KB
/
guide.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
---
title: Database Setup Guide
author: Ben Hancock
lang: en-US
---
Overview
========
This document covers retrieval of the data to build the Ancient Places
Country Search database, and setting up the database itself. This
project uses PostgreSQL (version 14 or later is assumed) and the
PostGIS extension. This tutorial assumes that you are setting up the
database yourself, rather than using a managed service.
If you are viewing this guide online, you may also wish to view the
code repository for additional materials and context. It can be found at:
<https://github.com/benghancock/ancient-places>
Retrieving the Data
===================
As noted in the project README, this project utilizes data from two
main sources: [Pleiades] for data on archaelogical sites, and
geographical data from [Natural Earth]. This section covers how to
retrieve and prepare the data prior to setting up the database that
will eventually store it for use with the application.
Pleiades is a public repository or “gazetteer” of geographic
information about the ancient world. It offers its data in a variety
of formats, all of which are available at this URL:
<https://pleiades.stoa.org/downloads>
For this project, we will use the [GIS package], and specifically the
`"places*"` tables that it contains. For easy retrieval of the
necessary tables, use the `fetch_pleiades_places.sh` script in the
`scripts/` directory, found in the root of this project.
Here is a useful excerpt about "places" from the Pleiades website:
> Pleiades places are the primary organizational construct of the
> gazetteer. They are conceptual entities: the term "place" applies to
> any locus of human attention, material or intellectual, in a
> real-world geographic context.
Note: The document mentions that places are "entirely abstract, conceptual
entities [...] objects of thought, speech, or writing, not tangible,
mappable points on the earth’s surface." This may introduce some
wrinkles into our project further down the line, but we'll visit those
when we get to them.
Running this script will leave three files in the working directory:
* `places.csv`: Pleiades Places.
* `place_types.csv`: terms from the Place Types
* `places_place_types.csv`: matches place ids (join to places.csv:id)
to placetype ids (join to place_types.csv:key).
Once we've retrieved these, we're ready for the next step.
[Pleiades]: https://pleiades.stoa.org/
[Natural Earth]: https://www.naturalearthdata.com/
[GIS package]: https://atlantides.org/downloads/pleiades/gis/
Setting up the Database
=======================
Configuring Roles
-----------------
When starting out configuring your database, it's a good idea to think
about your security model -- even if this is just a hobby project with
public data. Again, in this example I will assume that you are setting
up the database yourself, and that we will be running both the
Postgres server and the application on the same host. If you are using
a managed service in the cloud, this will almost certainly not be the
case, so your mileage may vary.
Our security model will be pretty basic: we will create one role to
administer the database (create the tables, etc.), and another role
for the application with `SELECT`-only privileges to read the tables
or views that it needs.
For this document, I will leave aside how to install and set up the
database server itself. This likely varies depending on your operating
system anyway; I built this project on a local installation of Fedora
Linux on my laptop, so if you’re in a similar environment, you may
find this tutorial helpful:
<https://docs.fedoraproject.org/en-US/quick-docs/postgresql/>
Once you have your database server installed and running, connect using the
`psql` command line tool. You will first need to to this as the Postgres
"superuser", named `postgres`. Below is an example session; note that the
statements that were executed are echoed after they succeed:
```
$ sudo -u postgres psql
# ... prompted for password ...
psql (14.3)
Type "help" for help.
postgres=# CREATE DATABASE archaia;
CREATE DATABASE
postgres=# CREATE ROLE archaia_admin NOINHERIT;
CREATE ROLE
postgres=# ALTER DATABASE archaia OWNER TO archaia_admin;
ALTER DATABASE
postgres=# GRANT archaia_admin TO {your user here};
GRANT ROLE
```
Here we are making use of Postgres' [role-based permissions model], by
creating a database and a dedicated admin role for that database that
has no other special permissions. In the final statement, we make our
user (that is, the username you use on your operating system) a
*member* of the `archaia_admin` group, so that we can administer this
database from our regular user account without creating a new special
password. We could also make others members of this group, and remove
them as appropriate.
Now, we are ready to exit our superuser session and log back in:
```
postgres=# \q
$ psql -d archaia
psql (14.3)
Type "help" for help.
archaia=>
```
[role-based permissions model]: https://www.postgresql.org/docs/14/user-manag.html
Creating Tables and Loading in Data
-----------------------------------
Once we have done this, we can now create the necessary tables to hold
the "places" data we fetched from Pleiades. Please refer to the file
`create_places_tables.sql` in the ``sql/`` directory of this project.
With the tables created, we can now load data into them using the
`COPY` statement, or `psql`'s analogous `\copy` command (the latter of
which typically bypasses permissions issues).
```
archaia=> \copy places from '/path/to/places.csv' with (format csv, header);
COPY 38953
archaia=> \copy places_types from '/path/to/place_types.csv' (format csv, header);
COPY 180
archaia=> \copy places_place_types from '/path/to/places_place_types.csv' (format csv, header);
ERROR: insert or update on table "places_place_types" violates foreign key constraint "places_place_types_place_type_fkey"
DETAIL: Key (place_type)=(quarry-group) is not present in table "places_types".
```
In our attempt to load data into the last table, we've run into an error; it
looks like although we have a place with a "quarry-group" type, that type
isn't in our "places_types" table. We can dig into this issue with a few
shell commands on the CSV data.
First, let's see how many records this impacts:
```
$ cat places_place_types.csv | grep 'quarry-group' | wc -l
3
```
Ok, just three records -- not bad. We can drop those, or maybe we can find
a close-enough type that would be appropriate?
```
$ cat place_types.csv | grep '^quarry'
quarry,quarry,A quarry as defined by the Getty Art and Architecture Thesaurus: Open-air excavations from which stone for building or other purposes is or has been obtained by cutting or blasting.
```
That seems close enough. Let's update those three records in our data file
and try loading again:
```
$ sed 's/quarry-group/quarry/' places_place_types.csv > places_place_types1.csv
$ fg
psql -d archaia
archaia=> \copy places_place_types from '/path/to/places_place_types1.csv' (format csv, header);
ERROR: insert or update on table "places_place_types" violates foreign key constraint "places_place_types_place_type_fkey"
DETAIL: Key (place_type)=(labeled feature) is not present in table "places_types".
```
Darn, we've hit another missing key. This one affects a higher number
of records (more than 250), and there's nothing obviously analogous in
the `place_types.csv` file. So let's drop the constraint, and move on:
```
archaia=> ALTER TABLE places_place_types
archaia-> DROP CONSTRAINT places_place_types_place_type_fkey;
ALTER TABLE
archaia=> \copy places_place_types from '/home/bgh/projects/code/ancient-places/places_place_types1.csv' (format csv, header);
COPY 43963
```
Now that we're set up, let's test our table setup with a query. (At
this point, you may want to consider switching to a tool like
[`pgadmin`] for running queries and viewing the output.)
``` sql
SELECT
places.title,
places.id,
places_place_types.place_type,
places_types.definition
FROM places
LEFT JOIN places_place_types
ON places.id = places_place_types.place_id
LEFT JOIN places_types
ON places_place_types.place_type = places_types.key
ORDER BY places.title ASC
LIMIT 200;
```
Creating PostGIS 'Geometry' Columns
===================================
Next, we'll want to make sure that we have the PostGIS extension for
Postgres installed and enabled; this will allow us to work with the
geographical data in the Pleiades dataset. You can enable the
extension using the `CREATE EXTENSION` statement.
The first time I tried this, though, I ran into an error:
```
ancient_places=> CREATE EXTENSION postgis;
ERROR: could not open extension control file "/usr/share/pgsql/extension/postgis.control": No such file or directory
```
I found that several standalone packages were provided for PostGIS
by Fedora's package manager, and installed them:
```
$ sudo dnf install postgis postgis-docs postgis-upgrade postgis-utils
```
Now, running the create extension statement works, though note that
you must be the database superuser (by default, `postgres`) in order
to successfully execute it. To check the version of PostGIS installed,
execute this statement:
``` sql
SELECT postgis_full_version();
```
Throughout this document, I'm running PostGIS 3.2.2.
The next thing for us to do is to create columns for the data using one
of PostGIS' supported geospatial data types. There are two main types
to choose from: `geography` and `geometry`. The PostGIS documentation
on [why to choose one over the other] is helpful on this topic, but to briefly
paraphrase: the `geography` type is appropriate for geographically dispersed
data, whereas `geometry` is generally appropriate for more geographically
compact data. That said, PostGIS offers some useful functions for dealing
with `geometry` data, and casting from one to the other is trivial.
::: Note
**A Word (or Two) About Spatial Reference Systems**
One important concept when working with spatial data is the idea of
*coordinates systems*, or *spatial reference systems*. Explained
simply, these are systems that humans can use to reflect the
location of places on the earth on a map. As expressed more artfully
by the [PostGIS documentation](https://www.postgis.net/workshops/postgis-intro/projection.html):
> The earth is not flat, and there is no simple way of putting it down
> on a flat paper map (or computer screen), so people have come up
> with all sorts of ingenious solutions, each with pros and cons. Some
> projections preserve area, so all objects have a relative size to
> each other; other projections preserve angles (conformal) like the
> Mercator projection; some projections try to find a good
> intermediate mix with only little distortion on several
> parameters. Common to all projections is that they transform the
> (spherical) world onto a flat Cartesian coordinate system, and which
> projection to choose depends on how you will be using the data.
The thing to underline here is that, when working with geographic
data, you should know which spatial reference system it uses. In
PostGIS and other GIS systems, these are referred by their spatial
reference identifier (SRID). More on that here:
<https://postgis.net/workshops/postgis-intro/loading_data.html#srid-26918-what-s-with-that>
Setting the appropriate SRID ensures that when using spatial functions
to calculate distance, etc., your results will be correct. This is
even more important when working with data that utilizes different
spatial reference systems. The most common SRID for geopgraphic
coordinates is SRID 4326, which corresponds to
[“longitude/latitude on the WGS84 spheroid”](https://www.postgis.net/workshops/postgis-intro/projection.html#transforming-data).
Luckily that's what all of our data uses throughout these exercises.
:::
We will use the representative lat/lon coordinates in our `places` table
in order to construct a column of the `geography` data type.
``` sql
ALTER TABLE places
ADD COLUMN repr_geog geography(POINT, 4326);
UPDATE places
SET repr_geog =
ST_SetSRID(
ST_MakePoint(representative_longitude, representative_latitude),
4326)::geography;
```
[why to choose one over the other]: https://www.postgis.net/workshops/postgis-intro/geography.html#why-not-use-geography
Importing Data from Natural Earth
=================================
Now that we have our data on archaeological sites in the shape we want
it, we need to import data on modern-day geopolitical borders in order
to support our goal of enabling search by country. For this, I found
the "Cultural Vectors" data from Natural Earth to fit the bill:
<https://www.naturalearthdata.com/downloads/10m-cultural-vectors/>
Specifically, I downloaded the "Admin 0 - Countries" dataset. Country
boundaries are, of course, a political construct and may be fluid or
disputed. To this point, Natural Earth notes that it "shows de facto
boundaries by default according to who controls the territory, versus
de jure." (It also offers "point-of-view" border data.)
This dataset is offered as a [shapefile], and so to import it into
Postgres, we'll want to use a tool called `shp2pgsql`. This is a
command-line tool that is probably included with the `postgis-client`
package in your package manager. This tool has a lot of command-line
options, and it's worth exploring them by running `shp2pgsql -?`. But
for our use-case, no flags are necessary:
```
$ wget 'https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip'
$ unzip ne_10m_admin_0_countries.zip
$ shp2pgsql ne_10m_admin_0_countries.shp countries_political | psql -d archaia
```
You'll see a lot of output fly by in your terminal as the data is
loaded into a new table in the `archaia` database, called
`countries_political`. Back in your query tool (either pgadmin or
psql), you can get information on this new table like so:
``` sql
SELECT column_name
FROM information_schema.columns
WHERE table_name = 'countries_political';
```
There are *a lot* of columns in this dataset, but there are really
only two that we need to be concerned with for now: `sovereignt`
and `geom`, which are the names of the countries (in English) and
their geometries, respectively. Take a quick look:
``` sql
SELECT sovereignt, geom
FROM countries_political
ORDER BY sovereignt ASC
LIMIT 10;
```
We're now almost ready to do a join with our Pleiades `places` table, in
order to place the sites in the modern-day borders where they can now
be found. But a couple additional considerations need to be made first.
First, we should consider that these places may often *not* fall
neatly within current-day geopolitical borders. It's always possible
that they could fall _on_ a border, and we would like to at least try
to accomodate this scenario.
This means that our places may have a one-to-many relationship with
our countries data, and we should choose a join method that would
support this. Furthermore, in line with the Pleiades note that places
are often "conceptual" rather than actual mappable points on the
earth, it follows that many "places" will actually not have
coordinates at all, and thus won't fall into a "country" as such --
though they may have cultural ties to actual, geographic locations.
In practical terms, this means that we probably want to use the
PostGIS function [`ST_Intersects`] to perform our join, rather than
[`ST_Contains`], since we will want to join places on all of the
countries that they may touch. Even this is not perfect ; Pleiades'
cautions that often the coordinates given may just be the centroid of
a very large bounding box covering a large area. Some of the places
are also along roads or routes that traverse long distances.
Before we give this a try, we'll also want to create indexes on our
spatial data columns, in order to make our query reasonably
performant, and also make sure that we set the SRID for the `geom`
column from the Natural Earth Data.
``` sql
CREATE INDEX countries_geom_idx ON countries_political
USING GIST (geom);
CREATE INDEX places_repr_geog_idx ON places
USING GIST (repr_geog);
```
Now we're ready to join our tables in a way that provides all the data
we need for our service.
[shapefile]: https://en.wikipedia.org/wiki/Shapefile
[`ST_Intersects`]: https://postgis.net/docs/manual-dev/en/ST_Intersects.html
[`ST_Contains`]: https://postgis.net/docs/manual-dev/en/ST_Contains.html
Optimizing Our Data for Querying
================================
Our full query will be a bit complex, since we need to perform
multiple joins and also cast our `geography` type to a `geometry` type
in order to use the proper function. Here we go:
``` sql
SELECT
cp.sovereignt as country_name,
pt.place_id,
pt.place_name,
pt.pleiades_uri,
pt.place_type,
pt.place_type_def,
pt.descrip,
pt.repr_lon,
pt.repr_lat
FROM (
SELECT
places.id as place_id,
places.repr_geog as place_geog,
places.title as place_name,
places.description as descrip,
places.uri as pleiades_uri,
places.representative_longitude as repr_lon,
places.representative_latitude as repr_lat,
places_place_types.place_type,
places_types.definition as place_type_def
FROM places LEFT JOIN places_place_types
ON places.id = places_place_types.place_id
LEFT JOIN places_types
ON places_place_types.place_type = places_types.key
ORDER BY places.title ASC
) as pt
LEFT JOIN
countries_political cp
ON ST_Intersects(
cp.geom,
pt.place_geog::geometry
)
WHERE pt.place_geog IS NOT NULL
ORDER BY cp.sovereignt ASC;
```
On my laptop, this query takes just over a minute to complete, which is
not terrible. But this is a complex query we don't want to have to
re-type often, and if we're going to be querying *this* data, we'll want
to have it go much quicker. So let's create a [materialized view] of
this result, and then put an index on the "country_name" column:
``` sql
CREATE MATERIALIZED VIEW IF NOT EXISTS countries_places AS
-- ... long query above goes here
CREATE INDEX country_name_lower_idx ON countries_places ((lower(country_name)));
```
Now, a query like this runs in about 100ms, bringing back ~2,800 records.
``` sql
SELECT * FROM countries_places
WHERE lower(country_name) LIKE 'greece';
```
Pretty neat!
That concludes the database setup portion of the documentation. Next,
we'll work on building the application that exposes a search API for
our database.
[materialized view]: https://www.postgresql.org/docs/current/sql-creatematerializedview.html
Appendix A: Adjusting Results By Using WKT
==========================================
The approach above works pretty well, but there are some issues. A
closer inspection of our new materialized view reveals that we
actually have more than a couple thousand "places" _without_ any known
country -- even when we factor out those that have no
coordinates. That doesn't seem right.
There seem to be a couple things going on here. One is that the place
coordinates in the GIS data I retrieved were not always precise (this
appears to be either a fluke affecting some Pleiades records or, more
likely, an error in how I handled them). The other is that some of our
sites land very close to bodies of water, and thus don't seem to fall
within the boundaries of the country borders in the Natural Earth
shapefile data. NE does provide maritime boundary data, but utilizing
that was out of scope for this project.
The other, more fundamental problem, though, is that some of the
places in the data are huge; they are regions that overlap many
modern-day countries, or they are routes that cross many modern-day
borders. Saying that a "place" lands in a particular country because
that's where the centroid of the bounding box is seems pretty
arbitrary.
To remedy this, we can use the ["Well-known text"] (WKT)
representation of the places that is present in the Pleiades data when
performing our join. This has the additional benefit that we don't
need to create a new geometry column in our table for the
representative lat/lon; we can just cast the WKT string that is
already in the data, using the PostGIS [ST_GeomFromText] function.
This means we'll change the `JOIN` clause in our SQL query to create
the view to the following:
```sql
-- snip ...
LEFT JOIN
countries_political cp
ON ST_Intersects(
ST_SetSRID(cp.geom, 4326),
ST_GeomFromText(pt.bounding_box_wkt, 4326)
)
WHERE pt.place_geog IS NOT NULL
ORDER BY cp.sovereignt ASC;
```
Note how I've also explicitly set the SRID for both geometries. This
should ensure we get more correct, consistent results.
Without indexes on these columns, this query would take a long time to
complete; it's better update the tables first, create indexes, and
then run it again for performance.
```sql
ALTER TABLE places ADD COLUMN bb_geom geometry;
UPDATE places SET bb_geom = ST_GeomFromText(places.bounding_box_wkt, 4326);
CREATE INDEX places_bb_geom_idx ON places
USING GIST (bb_geom);
UPDATE countries_political
SET geom = ST_SetSRID(geom, 4326);
REINDEX INDEX countries_geom_idx;
```
It's still not *fast*, but it does complete -- in about 2 minutes on
my laptop. Plus, that's the point of the materialized view. To see the
full updated query, see the file `create_view.sql` in the `sql/`
directory of this repository.
This approach is also still not perfect: it brings back almost 4,000
places without a matching country, often for reasons that were not
clear to me on inspection. But it is rational, and takes advantage of
the richer WKT data available from Pleiades.
["Well-known text"]: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry
[ST_GeomFromText]: https://postgis.net/docs/en/ST_GeomFromText.html