-
Notifications
You must be signed in to change notification settings - Fork 54
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
🗺 Include GeoNames to verify admin boundaries
Download cities with +15,000 inhabitants from GeoNames Import GeoNames cities into Postgres Disable automatic builds for draft PRs Match GeoName city names with OSM admin boundaries Add city name to POI list based on newly created `cities` table Additional Google category tags from scraping Finland Materialize some sub-queries as tables to improve building time of POI model
- Loading branch information
1 parent
44d73c2
commit 99f3f21
Showing
18 changed files
with
1,479 additions
and
42 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
import pandas | ||
|
||
|
||
def txt_to_csv(file_path): | ||
read_file = pandas.read_csv( | ||
file_path, delimiter="\t", header=None, low_memory=False | ||
) | ||
read_file.to_csv(file_path.replace(".txt", ".csv"), index=None, header=False) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
9 changes: 9 additions & 0 deletions
9
.../core/database/transformer/dbt/macros/admin_boundaries/get_all_versions_of_city_names.sql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{% macro get_all_versions_of_city_names(country_code) %} | ||
{% set query %} | ||
SELECT DISTINCT geoname_id, unnest(ascii_name || alternate_names) AS name | ||
FROM admin_boundary_geonames_cities | ||
WHERE country_code = '{{ country_code }}' | ||
{% endset %} | ||
|
||
{{ return(query) }} | ||
{% endmacro %} |
14 changes: 14 additions & 0 deletions
14
...tabase/transformer/dbt/macros/admin_boundaries/match_admin_boundaries_with_city_names.sql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
{% macro match_admin_boundaries_with_city_names(country_code) %} | ||
{% set all_versions_of_city_names = get_all_versions_of_city_names(country_code) %} | ||
|
||
{% set query %} | ||
SELECT id, geoname_id, levenshtein(ab.name, avocn.name) AS levenshtein_distance | ||
FROM admin_boundary AS ab LEFT JOIN ({{ all_versions_of_city_names }}) AS avocn ON | ||
ab.name = avocn.name OR | ||
unaccent(ab.name) = avocn.name OR | ||
ab.name LIKE avocn.name || '%' OR | ||
unaccent(ab.name) LIKE avocn.name || '%' | ||
{% endset %} | ||
|
||
{{ return(query) }} | ||
{% endmacro %} |
10 changes: 10 additions & 0 deletions
10
kuwala/core/database/transformer/dbt/models/admin_boundaries/cities.sql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
SELECT city_name, geometry | ||
FROM {{ ref('city_candidates') }} | ||
INNER JOIN ( | ||
SELECT city_name AS city_name_best_match, MIN(min_levenshtein_distance) AS min_levenshtein_distance_best_match | ||
FROM {{ ref('city_candidates') }} | ||
GROUP BY city_name | ||
) AS best_city_candidates ON | ||
city_candidates.city_name = best_city_candidates.city_name_best_match AND | ||
city_candidates.min_levenshtein_distance = best_city_candidates.min_levenshtein_distance_best_match | ||
WHERE st_contains(geometry, st_setsrid(st_makepoint(candidate_longitude, candidate_latitude), 4326)) |
22 changes: 22 additions & 0 deletions
22
kuwala/core/database/transformer/dbt/models/admin_boundaries/city_candidates.sql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
SELECT | ||
abgc.ascii_name AS city_name, | ||
ab.name AS admin_boundary_name, | ||
min_levenshtein_distance, | ||
ab.geometry AS geometry, | ||
abgc.latitude AS candidate_latitude, | ||
abgc.longitude AS candidate_longitude | ||
FROM ( | ||
SELECT id, geoname_id, levenshtein_distance | ||
FROM ({{ match_admin_boundaries_with_city_names(var('country')) }}) AS mabwcn | ||
WHERE geoname_id IS NOT NULL AND levenshtein_distance < 10 | ||
) AS matched_cities | ||
INNER JOIN ( | ||
SELECT id AS id_best_match, MIN(levenshtein_distance) AS min_levenshtein_distance | ||
FROM ({{ match_admin_boundaries_with_city_names(var('country')) }}) AS mabwcn | ||
WHERE geoname_id IS NOT NULL AND levenshtein_distance < 10 | ||
GROUP BY id | ||
) AS matched_cities_min_levenshtein_distances ON | ||
matched_cities.id = matched_cities_min_levenshtein_distances.id_best_match AND | ||
matched_cities.levenshtein_distance = matched_cities_min_levenshtein_distances.min_levenshtein_distance | ||
LEFT JOIN admin_boundary_geonames_cities AS abgc USING (geoname_id) | ||
LEFT JOIN admin_boundary AS ab USING (id) |
36 changes: 36 additions & 0 deletions
36
kuwala/core/database/transformer/dbt/models/admin_boundaries/schema.yml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
|
||
version: 2 | ||
|
||
models: | ||
# Cities | ||
- name: cities | ||
description: 'List of matched city names from GeoNames against OSM admin boundaries' | ||
columns: | ||
- name: city_name | ||
description: 'City name from GeoNames' | ||
tests: | ||
- not_null | ||
- name: geometry | ||
description: 'Geometry based on matched admin boundary' | ||
tests: | ||
- not_null | ||
# City candidates | ||
- name: city_candidates | ||
description: 'List of matched city names from GeoNames against OSM admin boundaries' | ||
columns: | ||
- name: city_name | ||
description: 'City name from GeoNames' | ||
tests: | ||
- not_null | ||
- name: admin_boundary_name | ||
description: 'Name of matched admin boundary' | ||
tests: | ||
- not_null | ||
- name: min_levenshtein_distance | ||
description: 'Levenshtein distance of GeoNames city name and admin boundary name' | ||
tests: | ||
- not_null | ||
- name: geometry | ||
description: 'Geometry of matched admin boundary' | ||
tests: | ||
- not_null |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
8 changes: 3 additions & 5 deletions
8
kuwala/core/database/transformer/dbt/models/poi/poi_address_city.sql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,3 @@ | ||
SELECT poi_id, name, id | ||
FROM admin_boundary AS ab, {{ ref('poi_matched') }} AS poi | ||
WHERE | ||
kuwala_admin_level = (SELECT max(kuwala_admin_level) FROM admin_boundary) AND | ||
st_contains(ab.geometry, st_setsrid(st_makepoint(poi.longitude, poi.latitude), 4326)) | ||
SELECT DISTINCT poi_id, city_name | ||
FROM {{ ref('cities') }} AS cities, {{ ref('poi_matched') }} AS poi | ||
WHERE st_contains(cities.geometry, st_setsrid(st_makepoint(poi.longitude, poi.latitude), 4326)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,5 @@ | ||
pandas==1.4.1 | ||
pyspark==3.2.1 | ||
Shapely==1.8.0 | ||
requests==2.28.1 | ||
Shapely==1.8.0 | ||
tqdm==4.64.0 |
Oops, something went wrong.