restructure data handling for non-danish use cases #18

anastassiavybornova · 2024-02-01T17:44:03Z

@anerv what do you think of this?

general setup:

data/raw subfolders: this is where the user provides input.
data/processed mirorring the data/rawstructure. contains all the gpkg output of the evaluation (currently saved to results) and all files that are currently saved to data/processed/workflow_steps (distributed across subfolders, e.g. all the nodes.. and edges.. files from data/processed/workflow_steps will instead go into data/processed/network)
results to contain only plots and stats subfolders (the gpkg outputs are saved into corresponding data/processed subfolders)

Repo will look like so:

├── data
│   ├── processed
│   │   ├── elevation
│   │   ├── linestring
│   │   ├── network
│   │   ├── point
│   │   └── polygon
│   └── raw
│       ├── elevation
│       ├── linestring
│       ├── network
│       ├── point
│       ├── polygon
│       └── studyarea
├── results
│   ├── plots
│   └── stats
├── scripts
└── src

Required user input

data/raw/studyarea study area polygon
data/raw/polygon polygon layers to evaluate (hardcoded options: nature, culture, agriculture, tourism, verify). each layer is optional.
data/raw/point point layers to evaluate (hardcoded options: facility, service, poi)
??? data/raw/linestring potentially: feature-match (like BikeDNA) to other network provided in linestring format? but let's rather leave that as future feature requeset
data/raw/elevation elevation data for study area

User configurations

In config file:

study area name
proj crs to use
buffer to use for polygon layers
buffer to use for point layers
segment lengths
elevation bands

Denmark use case:

We say that in general: user needs to generate input themselves. However in the case of DK: We provide (in separate repo) all the code and data necessary to generate the inputs to data/raw, for a user-provided region (defined by municipality codes). Output: the output folder, after running the single script ("merge study layers") for the DK municipalities indicated by the user in the config file here, will contain exactly the folders and data that are needed as user input for the "general" repo above.

DK-repo will look like so:

├── data
│   ├── elevation
│   │   ├── 1234
│   │   ├── 5678
│   │   └── 9012
│   ├── linestring
│   │   ├── 1234
│   │   ├── 5678
│   │   └── 9012
│   ├── network
│   │   └── technical_network_allDK.gpkg
│   ├── point
│   │   ├── 1234
│   │   ├── 5678
│   │   └── 9012
│   ├── polygon
│   │   ├── 1234
│   │   ├── 5678
│   │   └── 9012
│   └── studyarea
│       └── municipality_boundaries.gpkg
├── output
│   ├── elevation
│   ├── linestring
│   ├── network
│   ├── point
│   ├── polygon
│   └── studyarea
├── scripts
│   └── merge_layers_for_study_area.py
└── src

The text was updated successfully, but these errors were encountered:

anastassiavybornova · 2024-02-01T17:46:01Z

and re technical vs. communication layer:

technical layer should be provided in data/raw/network
communication layer is created by script and saved into data/processed/network

tbd what to do if user provides communication layer directly.

anerv · 2024-02-05T13:02:45Z

looks good!
The only thing I can think of top of my head is to have more than one nature category (per the feedback from Faxe).
Potentially we could avoid hardcoding the categories and instead have a script that generates folders according to the desired categories? and then ofc also incorporate that in the evaluation part.

anerv · 2024-02-05T13:05:04Z

For the last comment (what to do if user provides comm layer directly), the script for creating the communicating could just check if a comm layer already exists/is provided? And if not, then it should use the tech layer to create a comm layer.
Potentially the comm script should also check if a provided comm layer fulfills the specification

anastassiavybornova · 2024-02-05T14:45:06Z

re not hardcoding categories: very good point. then what about just having the subfolders polygon, point, (linestring, tbd) and say that the evaluation will look at whichever gpkg files are inside of there separately?

anastassiavybornova · 2024-02-05T14:50:57Z

re comm script: yes cool, so what about this:

the input folders for the network will be data/raw/network/technical and data/raw/network/communication.
for users who have a technical network as starting point: the comm script checks, sees that there is no file in data/raw/network/communication/, then takes data/raw/network/technical/*.gpkg and creates a communication network which is saved into data/raw/network/communication/ (not 100% clean since the communication network has been processed actually, but i find this the least confusing?)
for users who have a communication network as starting point: the comm script checks, sees that there is already a communication network in data/raw/network/communication/, and takes it from there

anastassiavybornova · 2024-02-05T14:51:59Z

re checking specifications: i think since we are asking for a lot of hand-crafted data sets, we could have a script to be run in the very beginning which checks whether all data sets are there and whether they are in the right format, and it not, tells you what is missing / what is wrong?

anerv · 2024-02-05T14:53:50Z

re not hardcoding categories: very good point. then what about just having the subfolders polygon, point, (linestring, tbd) and say that the evaluation will look at whichever gpkg files are inside of there separately?

yes that makes sense - but maybe also providing some list of names/labels in a config? I was thinking something like providing a list of polygon types ['nature','forest','bad','culture'] and then asking that the gpkg files are named accordingly (i.e. nature.gpkg, forest.gpkg, etc)

anastassiavybornova · 2024-02-05T14:58:03Z

re not hardcoding categories: very good point. then what about just having the subfolders polygon, point, (linestring, tbd) and say that the evaluation will look at whichever gpkg files are inside of there separately?

yes that makes sense - but maybe also providing some list of names/labels in a config? I was thinking something like providing a list of polygon types ['nature','forest','bad','culture'] and then asking that the gpkg files are named accordingly (i.e. nature.gpkg, forest.gpkg, etc)

could we also just drop the whole name-label in the config - and use the filename as name of the category? like you as a user provide, in the polygon folder:
nature.gpkg, culture.gpkg, whatever.gpkg and then the polygon evaluation is done for 3 layers with the names "nature", "culture", and "whatever"

anerv · 2024-02-05T15:02:04Z

yes absolutely (and then just state that the file names are used as category names (i.e. rename your files if you want plots with nice and clean labels)

anastassiavybornova · 2024-02-06T09:38:55Z

change raw and processed into input and output

anastassiavybornova self-assigned this Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restructure data handling for non-danish use cases #18

restructure data handling for non-danish use cases #18

anastassiavybornova commented Feb 1, 2024

anastassiavybornova commented Feb 1, 2024

anerv commented Feb 5, 2024

anerv commented Feb 5, 2024

anastassiavybornova commented Feb 5, 2024

anastassiavybornova commented Feb 5, 2024

anastassiavybornova commented Feb 5, 2024

anerv commented Feb 5, 2024

anastassiavybornova commented Feb 5, 2024

anerv commented Feb 5, 2024

anastassiavybornova commented Feb 6, 2024

restructure data handling for non-danish use cases #18

restructure data handling for non-danish use cases #18

Comments

anastassiavybornova commented Feb 1, 2024

anastassiavybornova commented Feb 1, 2024

anerv commented Feb 5, 2024

anerv commented Feb 5, 2024

anastassiavybornova commented Feb 5, 2024

anastassiavybornova commented Feb 5, 2024

anastassiavybornova commented Feb 5, 2024

anerv commented Feb 5, 2024

anastassiavybornova commented Feb 5, 2024

anerv commented Feb 5, 2024

anastassiavybornova commented Feb 6, 2024