-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
MWM size prediction research: scripts, descriptions and data
- Loading branch information
1 parent
360d52b
commit 091cddc
Showing
20 changed files
with
34,297 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,177 @@ | ||
# MWM size prediction model | ||
|
||
The web application uses a data-science-based model to predict size MWM files | ||
compiled from the borders. Here described are the efforts that were undertaken | ||
to build such a prediction model. The serialized model resides at `web/app/data/` | ||
in the `model.pkl` and `scaler.pkl` files. Its first variant was trained only | ||
on county-level data and is valid at limited parameters range (see web/app/config.py | ||
for the model limitations). Now we try to extend the model to predict also | ||
province-level regions. | ||
|
||
## Data gathering | ||
|
||
We chosen countries/regions with dense OSM data and took them as the training | ||
dataset. As a first try Germany, Austria, Belgium and Netherlands where taken | ||
giving about 950 borders of different admin levels. The sample was found to | ||
be too small for good training. | ||
Then Norway, Switzerland, Ile-de-France of France, Japan, United Kingdom, Belarus, | ||
4 states: California, Texas, New York, Washington – of the United States | ||
were added. | ||
|
||
#### Geographic data gathering | ||
|
||
First, with the help of the web app I split the forementioned countries/regions down to the | ||
"county"-subregions in general sense of a "county" – it's an admin level | ||
which is too small for MWMs, but regions of one level higher are too big, | ||
so that a usual MWM would be a cluster of "counties". | ||
|
||
Japan was a special case, so `extract_mwm_geo_data.py` script contains a | ||
function to split the country into subregions. | ||
|
||
The `extract_mwm_geo_data.py` script, endowed with a valid connection | ||
to the database with borders, gathers information about all borders of a | ||
given country/region and its descendants: id, parent_id, admin level, name, | ||
full area, land area (so the table with land borders of the planet is necessary), | ||
city/town count and population, hamlet/village count and population. | ||
|
||
One should keep in mind that some borders may be absent in OSM, so a region may | ||
not be fully covered by subregions. So, a region area (or places cout, | ||
or population) may be greater than the sum of areas of its subregions. | ||
One way is to fix borders by hand. Another way, that I followed, is to select | ||
areas, cities and population from the database even for upper-level regions | ||
(except countries, for which the calculation would run too long and is not useful). | ||
|
||
#### Mwm size data gathering | ||
|
||
Having borders division of the training countries in the web app, I download all | ||
borders, changing the poly-file naming procedure so that the name to contain | ||
the region id. The id would be the link between files with geodata and mwm sizes data. | ||
So we have many border file with names like _03565917_Japan_Gunma Prefecture_Numata.poly_ | ||
that I place into the `omim/data/borders/` directory instead of original | ||
borders. | ||
|
||
Also, I did a *.o5m-extract for each country to supply the maps_generator | ||
not with the whole planet-latest.o5m file. I used https://boundingbox.klokantech.com | ||
to find a polygon for an extract, first getting geojson of ten points or so at the | ||
website and then composing a *.poly file in a text editor. With this | ||
`country.poly` file I got a country extract with `osmconvert` tool: | ||
```bash | ||
osmctools/osmconvert planet-latest.o5m -B=country.poly -o=country.o5m | ||
``` | ||
|
||
Then | ||
```bash | ||
md5sum country.o5m > country.o5m.md5 | ||
``` | ||
In `maps_generation.ini` I changed the path to the planet and md5sum file and run | ||
the MWMs generation with | ||
```bash | ||
nohup python -m maps_generator --order="" --skip="Routing,RoutingTransit" \ | ||
--without_countries="World*" --countries="*_Switzerland_*" & | ||
``` | ||
|
||
For the asterisk to work at the beginning of the mask in the `--countries` option, | ||
I made some changes to `omim/tools/generator/maps_generator/__main__.py`: | ||
|
||
```python | ||
def end_star_compare(prefix, full): | ||
return full.startswith(prefix) | ||
|
||
def start_star_compare(suffix, full): | ||
return full.endswith(suffix) | ||
|
||
def both_star_compare(substr, full): | ||
return substr in full | ||
|
||
... | ||
cmp = compare | ||
_raw_country = country_item[:] | ||
if _raw_country: | ||
if all(_raw_country[i] == "*" for i in (0, -1)): | ||
_raw_country = _raw_country.replace("*", "") | ||
cmp = both_star_compare | ||
elif _raw_country[-1] == "*": | ||
_raw_country = _raw_country.replace("*", "") | ||
cmp = end_star_compare | ||
elif _raw_country[0] == "*": | ||
_raw_country = _raw_country.replace("*", "") | ||
cmp = start_star_compare | ||
|
||
``` | ||
|
||
After all mwms for a country had beed generated in a directory like | ||
`maps_build/2021_01_20__18_06_38/210120` | ||
I got their sizes (in Kb) with this command: | ||
|
||
```bash | ||
du maps_build/maps_build/2021_01_20__18_06_38/210120/*.mwm | sort -k2 > Norway.sizes | ||
``` | ||
|
||
In fact, I renamed directory to some 2021_01_20__18_06_38-Norway and used command | ||
```bash | ||
du maps_build/*-Norway/[0-9]*/*.mwm | sort -k2 > Norway.sizes | ||
``` | ||
|
||
#### Combining geo data with sizes data | ||
|
||
Now I had a set of `<Country>_regions.json` and `<Country>.sizes` files | ||
with geo- and sizes-data respectively on several large regions with subregions. | ||
I used the `combine_data.py` script to generate one big `7countries.csv`. | ||
|
||
Yet another `4countries.csv` file with Germany, Austria, Belgium and Netherlands | ||
subregions was already prepared before, it has excluded=1 flag for those | ||
Netherland subregions which contain much water (inner waters, not ocean). Also, | ||
there were not data for upper-lever regions, and the values of area, cities, | ||
population and mwm_size were obtained as the sum of subregions defined by | ||
parent_id column. | ||
|
||
Since data for country-level regions was not collected (due to long sql queries and | ||
mwm generation time), we enrich the `7countries.csv` dataset with country-level | ||
by summing up data of subregions: | ||
```python | ||
import pandas as pd | ||
data7 = pd.read_csv('data/7countries.csv', sep=';') | ||
|
||
# Drop data for countries if it present | ||
data7 = data7[data7['al'] != 2] | ||
|
||
countries = {'id': [-59065, -2978650, -51701, -382313, -62149], | ||
'name': ['Belarus', 'Norway', 'Switzerland', 'Japan', 'United Kingdom'], | ||
'excluded': [0]*5, | ||
'al': [2]*5, | ||
} | ||
sum_fields = ('full_area', 'land_area', 'city_cnt', 'hamlet_cnt', 'city_pop', 'hamlet_pop', 'mwm_size_sum') | ||
|
||
for field in sum_fields: | ||
field_values = [data7[data7['parent_id'] == c_id][field].sum() for c_id in countries['id']] | ||
countries[field] = field_values | ||
|
||
countries_df = pd.DataFrame(countries, columns = list(countries.keys())) | ||
data7 = pd.concat([data7, countries_df]) | ||
data7.to_csv('data/7countries-1.csv', index=False, sep=';') | ||
|
||
# Check, and if all right, do | ||
# import os; os.rename('data/7countries-1.csv', 'data/7countries.csv') | ||
``` | ||
|
||
The union of `4countries.csv` and `7countries.csv` data is the | ||
dataset for data science experiments on mwm size prediction. Keep in mind | ||
that _mwm_size_ field may be NULL (for countries), or _mwm_size_sum_ may be NULL | ||
(in 4countries.csv). Make corrections when getting combined dataset: | ||
|
||
```python | ||
import pandas as pd | ||
import numpy as np | ||
|
||
def fit_mwm_size(df): | ||
df['mwm_size'] = np.where(df['mwm_size'].isnull(), df['mwm_size_sum'], df['mwm_size']) | ||
|
||
data1 = pd.read_csv('data/4countries.csv', sep=';') # Austria, Belgium, Netherlands, Germany | ||
data2 = pd.read_csv('data/7countries.csv', sep=';') # Norway, UK, US(4 states), Switzerland, Japan, Belarus, Ile-de-France | ||
|
||
data = pd.concat([data1, data2]) | ||
|
||
data = data[data.excluded.eq(0) & data.id.notnull()] | ||
|
||
fit_mwm_size(data) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
import json | ||
import csv | ||
|
||
|
||
def get_combined_info(region_name): | ||
with open(f'data/{region_name}_regions.json', newline='') as f: | ||
regions = json.load(f) | ||
regions = {int(k):v for k, v in regions.items()} | ||
|
||
with open(f'data/{region_name}.sizes') as sizes_file: | ||
for line in sizes_file: | ||
mwm_name = line.split('/')[-1][:-4] | ||
#print(f"mwm_name = {mwm_name}") | ||
r_id = -int(mwm_name.split('_')[0]) | ||
if r_id not in regions: | ||
raise Exception(f'id {r_id} not in {region_name} data') | ||
size = int(line.split()[0]) | ||
name = mwm_name.split('_')[-1] | ||
country = mwm_name.split('_')[1] | ||
|
||
regions[r_id].update({ | ||
'mwm_name': mwm_name, | ||
'country': country, | ||
'mwm_size': size, | ||
}) | ||
|
||
|
||
admin_levels = set(x['al'] for x in regions.values()) | ||
|
||
ids_to_remove = [] # Far oversea regions may be counted but no mwm generated for | ||
for al in sorted(admin_levels, reverse=True): | ||
for r_id, r_data in ((r_id, r_data) for r_id, r_data in regions.items() if r_data['al'] == al): | ||
children = [ch for ch in regions.values() if ch['parent_id'] == r_id] | ||
is_leaf = not bool(children) | ||
r_data['is_leaf'] = int(is_leaf) | ||
r_data['excluded'] = 0 | ||
if is_leaf: | ||
if 'mwm_size' not in r_data: | ||
print(f"Mwm not generated for {r_data['name']}") | ||
ids_to_remove.append(r_id) | ||
else: | ||
r_data['mwm_size_sum'] = r_data['mwm_size'] | ||
else: | ||
r_data['mwm_size_sum'] = sum(ch['mwm_size'] for ch in children) | ||
|
||
return {k:v for k,v in regions.items() if k not in ids_to_remove} | ||
|
||
|
||
def main(): | ||
region_names = [ | ||
'Belarus', 'Switzerland', 'Ile-de-France', | ||
'United Kingdom', 'Norway', 'Japan', 'United States' | ||
] | ||
|
||
rows = [] | ||
|
||
# full_area includes ocean. | ||
fieldnames = ['id', 'parent_id', 'al', 'is_leaf', 'excluded', 'name', 'mwm_name', 'country', | ||
'city_cnt', 'city_pop', 'hamlet_cnt', 'hamlet_pop', | ||
'full_area', 'land_area', 'mwm_size', 'mwm_size_sum'] | ||
|
||
with open('data/7countries.csv', 'w', newline='') as csvfile: | ||
writer = csv.DictWriter(csvfile, delimiter=';', fieldnames=fieldnames) | ||
writer.writeheader() | ||
|
||
for region_name in region_names: | ||
regions = get_combined_info(region_name) | ||
rows = sorted(regions.values(), key=lambda reg: (reg['al'], reg['name'])) | ||
writer.writerows(rows) | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
Oops, something went wrong.