Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restructure data handling for non-danish use cases #18

Open
anastassiavybornova opened this issue Feb 1, 2024 · 10 comments
Open

restructure data handling for non-danish use cases #18

anastassiavybornova opened this issue Feb 1, 2024 · 10 comments
Assignees

Comments

@anastassiavybornova
Copy link
Owner

@anerv what do you think of this?

general setup:

  • data/raw subfolders: this is where the user provides input.
  • data/processed mirorring the data/rawstructure. contains all the gpkg output of the evaluation (currently saved to results) and all files that are currently saved to data/processed/workflow_steps (distributed across subfolders, e.g. all the nodes.. and edges.. files from data/processed/workflow_steps will instead go into data/processed/network)
  • results to contain only plots and stats subfolders (the gpkg outputs are saved into corresponding data/processed subfolders)

Repo will look like so:

├── data
│   ├── processed
│   │   ├── elevation
│   │   ├── linestring
│   │   ├── network
│   │   ├── point
│   │   └── polygon
│   └── raw
│       ├── elevation
│       ├── linestring
│       ├── network
│       ├── point
│       ├── polygon
│       └── studyarea
├── results
│   ├── plots
│   └── stats
├── scripts
└── src

Required user input

  • data/raw/studyarea study area polygon
  • data/raw/polygon polygon layers to evaluate (hardcoded options: nature, culture, agriculture, tourism, verify). each layer is optional.
  • data/raw/point point layers to evaluate (hardcoded options: facility, service, poi)
  • ??? data/raw/linestring potentially: feature-match (like BikeDNA) to other network provided in linestring format? but let's rather leave that as future feature requeset
  • data/raw/elevation elevation data for study area

User configurations

In config file:

  • study area name
  • proj crs to use
  • buffer to use for polygon layers
  • buffer to use for point layers
  • segment lengths
  • elevation bands

Denmark use case:

We say that in general: user needs to generate input themselves. However in the case of DK: We provide (in separate repo) all the code and data necessary to generate the inputs to data/raw, for a user-provided region (defined by municipality codes). Output: the output folder, after running the single script ("merge study layers") for the DK municipalities indicated by the user in the config file here, will contain exactly the folders and data that are needed as user input for the "general" repo above.

DK-repo will look like so:

├── data
│   ├── elevation
│   │   ├── 1234
│   │   ├── 5678
│   │   └── 9012
│   ├── linestring
│   │   ├── 1234
│   │   ├── 5678
│   │   └── 9012
│   ├── network
│   │   └── technical_network_allDK.gpkg
│   ├── point
│   │   ├── 1234
│   │   ├── 5678
│   │   └── 9012
│   ├── polygon
│   │   ├── 1234
│   │   ├── 5678
│   │   └── 9012
│   └── studyarea
│       └── municipality_boundaries.gpkg
├── output
│   ├── elevation
│   ├── linestring
│   ├── network
│   ├── point
│   ├── polygon
│   └── studyarea
├── scripts
│   └── merge_layers_for_study_area.py
└── src

@anastassiavybornova anastassiavybornova self-assigned this Feb 1, 2024
@anastassiavybornova
Copy link
Owner Author

and re technical vs. communication layer:

  1. technical layer should be provided in data/raw/network
  2. communication layer is created by script and saved into data/processed/network

tbd what to do if user provides communication layer directly.

@anerv
Copy link
Collaborator

anerv commented Feb 5, 2024

looks good!
The only thing I can think of top of my head is to have more than one nature category (per the feedback from Faxe).
Potentially we could avoid hardcoding the categories and instead have a script that generates folders according to the desired categories? and then ofc also incorporate that in the evaluation part.

@anerv
Copy link
Collaborator

anerv commented Feb 5, 2024

For the last comment (what to do if user provides comm layer directly), the script for creating the communicating could just check if a comm layer already exists/is provided? And if not, then it should use the tech layer to create a comm layer.
Potentially the comm script should also check if a provided comm layer fulfills the specification

@anastassiavybornova
Copy link
Owner Author

re not hardcoding categories: very good point. then what about just having the subfolders polygon, point, (linestring, tbd) and say that the evaluation will look at whichever gpkg files are inside of there separately?

@anastassiavybornova
Copy link
Owner Author

re comm script: yes cool, so what about this:

  • the input folders for the network will be data/raw/network/technical and data/raw/network/communication.
  • for users who have a technical network as starting point: the comm script checks, sees that there is no file in data/raw/network/communication/, then takes data/raw/network/technical/*.gpkg and creates a communication network which is saved into data/raw/network/communication/ (not 100% clean since the communication network has been processed actually, but i find this the least confusing?)
  • for users who have a communication network as starting point: the comm script checks, sees that there is already a communication network in data/raw/network/communication/, and takes it from there

@anastassiavybornova
Copy link
Owner Author

re checking specifications: i think since we are asking for a lot of hand-crafted data sets, we could have a script to be run in the very beginning which checks whether all data sets are there and whether they are in the right format, and it not, tells you what is missing / what is wrong?

@anerv
Copy link
Collaborator

anerv commented Feb 5, 2024

re not hardcoding categories: very good point. then what about just having the subfolders polygon, point, (linestring, tbd) and say that the evaluation will look at whichever gpkg files are inside of there separately?

yes that makes sense - but maybe also providing some list of names/labels in a config? I was thinking something like providing a list of polygon types ['nature','forest','bad','culture'] and then asking that the gpkg files are named accordingly (i.e. nature.gpkg, forest.gpkg, etc)

@anastassiavybornova
Copy link
Owner Author

re not hardcoding categories: very good point. then what about just having the subfolders polygon, point, (linestring, tbd) and say that the evaluation will look at whichever gpkg files are inside of there separately?

yes that makes sense - but maybe also providing some list of names/labels in a config? I was thinking something like providing a list of polygon types ['nature','forest','bad','culture'] and then asking that the gpkg files are named accordingly (i.e. nature.gpkg, forest.gpkg, etc)

could we also just drop the whole name-label in the config - and use the filename as name of the category? like you as a user provide, in the polygon folder:
nature.gpkg, culture.gpkg, whatever.gpkg and then the polygon evaluation is done for 3 layers with the names "nature", "culture", and "whatever"

@anerv
Copy link
Collaborator

anerv commented Feb 5, 2024

yes absolutely (and then just state that the file names are used as category names (i.e. rename your files if you want plots with nice and clean labels)

@anastassiavybornova
Copy link
Owner Author

change raw and processed into input and output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants