Extract, transform and ingest planetary surface metadata into a STAC API Catalog.
The PDSSP Crawler is the software component responsible for the extraction, transformation and ingestion of planetary surface data products metadata into the PDSSP STAC Catalog service (RESTO). Through an Airflow web interface, it also provides a way to orchestrate and manage the PDSSP Crawler's tasks and behaviour.
Collection and products metadata are extracted from OGC data catalog services (WFS, Features API, STAC API), defined in the PDSSP Services Registry. Metadata can also be extracted from non-OGC data catalog services (PDS ODE API, ENP-TAP, HTTP GET, PDS API, ...), locally defined by the PDSSP operator (see data/services).
In both cases, metadata ingested into the PDSSP STAC Catalog are compliant to the PDSSP Data Model, which consists of the STAC data model extended through existing and new STAC extensions, including the Solar System STAC extension.
⚠️ The PDSSP Crawler is under development, and the current version is not fully functional nor stable. See the online documentation for more information.
In a next version installation via Docker will be possible, enabling deployment to the PDSSP server. For now, follow these steps:
- Set up conda environment
Create environment:
conda create --name pdssp-env python=3.9
Activate environment:
conda activate pdssp-env
- Create and go to your working directory
mkdir -p </path/to/pdssp/crawler>
cd </path/to/pdssp/crawler>
- Download and install package
git clone https://github.com/pdssp/pdssp-crawler.git
pip install -e pdssp-crawler
pip install -r pdssp-crawler/tests/requirements.txt
pip install -r pdssp-crawler/docs/requirements.txt
The configuration mechanism will be improved in the next versions. For now:
- Create the incoming source and the outgoing STAC directories in your working directories. For example:
mkdir crawler-data
mkdir pdssp-stac-repo
- Edit the crawler/config.py file to change the following variables accordingly. In the following example,
/Users/nmanaud/workspace/pdssp
is the working directory and the source and STAC data directories are respectively namedcrawler-data
andpdssp-stac-repo
.
SOURCE_DATA_DIR = '/Users/nmanaud/workspace/pdssp/crawler-data'
STAC_DATA_DIR = '/Users/nmanaud/workspace/pdssp/pdssp-stac-repo'
PDSSP_REGISTRY_ENDPOINT = 'https://pdssp.ias.universite-paris-saclay.fr/registry/services'
LOCAL_REGISTRY_DIRECTORY = '/Users/nmanaud/workspace/pdssp/pdssp-crawler/data/services'
STAC_CATALOG_PARENT_ENDPOINT = 'https://pdssp.ias.universite-paris-saclay.fr'
Set the RESTO_ADMIN_AUTH_TOKEN
environment variable, required for ingestion POST request to PDSSP Catalog STAC API (RESTO).
Airflow configutation [TBD]
List CLI commands and get help:
crawler --help
Initialise data store with available source collections (run once):
crawler initds
Display/filter source collections status:
crawler collections --target=mars
Extract, transform and ingest:
crawler extract --id='MRO_HIRISE_RDRV11'
crawler transform --id='MRO_HIRISE_RDRV11'
crawler ingest --id='MRO_HIRISE_RDRV11'
or just:
crawler ingest --id='MRO_HIRISE_RDRV11'
Process all source collections associated to Mars:
crawler process --target=mars
For example:
from crawler.crawler import Crawler
crawler = Crawler()
services = crawler.get_registered_services()
See Crawler Python API Reference
https://pdssp.ias.universite-paris-saclay.fr/crawler (in development)
Keeping in mind that this project is in active development... if you are interested in the general topic of planetary geospatial data catalog interoperability or the PDSSP Crawler in particular, feel to reach out to us, raise your questions, suggestions, or issues using PDSSP Crawler GitHub Issues.
- Nicolas Manaud (initial design/implementation work)
- Jérôme Gasperi ("stac2resto", Dockerizing)
- Jean-Christophe Malapert (project initiator/management)
See also the list of contributors who is participating in the development of the PDSSP Crawler.
This project is licenced under Apache License 2.0 [TBC].