PDSSP Crawler

Extract, transform and ingest planetary surface metadata into a STAC API Catalog.

The PDSSP Crawler is the software component responsible for the extraction, transformation and ingestion of planetary surface data products metadata into the PDSSP STAC Catalog service (RESTO). Through an Airflow web interface, it also provides a way to orchestrate and manage the PDSSP Crawler's tasks and behaviour.

Collection and products metadata are extracted from OGC data catalog services (WFS, Features API, STAC API), defined in the PDSSP Services Registry. Metadata can also be extracted from non-OGC data catalog services (PDS ODE API, ENP-TAP, HTTP GET, PDS API, ...), locally defined by the PDSSP operator (see data/services).

In both cases, metadata ingested into the PDSSP STAC Catalog are compliant to the PDSSP Data Model, which consists of the STAC data model extended through existing and new STAC extensions, including the Solar System STAC extension.

⚠️ The PDSSP Crawler is under development, and the current version is not fully functional nor stable. See the online documentation for more information.

Installation

In a next version installation via Docker will be possible, enabling deployment to the PDSSP server. For now, follow these steps:

Set up conda environment

Create environment:

conda create --name pdssp-env python=3.9

Activate environment:

conda activate pdssp-env

Create and go to your working directory

mkdir -p </path/to/pdssp/crawler>
cd </path/to/pdssp/crawler>

Download and install package

git clone https://github.com/pdssp/pdssp-crawler.git
pip install -e pdssp-crawler
pip install -r pdssp-crawler/tests/requirements.txt
pip install -r pdssp-crawler/docs/requirements.txt

Configuration

The configuration mechanism will be improved in the next versions. For now:

Create the incoming source and the outgoing STAC directories in your working directories. For example:

mkdir crawler-data
mkdir pdssp-stac-repo

Edit the crawler/config.py file to change the following variables accordingly. In the following example, /Users/nmanaud/workspace/pdssp is the working directory and the source and STAC data directories are respectively named crawler-data and pdssp-stac-repo.

SOURCE_DATA_DIR = '/Users/nmanaud/workspace/pdssp/crawler-data'
STAC_DATA_DIR = '/Users/nmanaud/workspace/pdssp/pdssp-stac-repo'
PDSSP_REGISTRY_ENDPOINT = 'https://pdssp.ias.universite-paris-saclay.fr/registry/services'
LOCAL_REGISTRY_DIRECTORY = '/Users/nmanaud/workspace/pdssp/pdssp-crawler/data/services'
STAC_CATALOG_PARENT_ENDPOINT = 'https://pdssp.ias.universite-paris-saclay.fr'

Set the RESTO_ADMIN_AUTH_TOKEN environment variable, required for ingestion POST request to PDSSP Catalog STAC API (RESTO).

Airflow configutation [TBD]

Usage

Crawler CLI

List CLI commands and get help:

crawler --help

Initialise data store with available source collections (run once):

crawler initds

Display/filter source collections status:

crawler collections --target=mars

Extract, transform and ingest:

crawler extract --id='MRO_HIRISE_RDRV11'
crawler transform --id='MRO_HIRISE_RDRV11'
crawler ingest --id='MRO_HIRISE_RDRV11'

or just:

crawler ingest --id='MRO_HIRISE_RDRV11'

Process all source collections associated to Mars:

crawler process --target=mars

See Crawler CLI Reference

Crawler Python API

For example:

from crawler.crawler import Crawler

crawler = Crawler()
services = crawler.get_registered_services()

See Crawler Python API Reference

Crawler Web Interface (Airflow)

https://pdssp.ias.universite-paris-saclay.fr/crawler (in development)

Contributing

Keeping in mind that this project is in active development... if you are interested in the general topic of planetary geospatial data catalog interoperability or the PDSSP Crawler in particular, feel to reach out to us, raise your questions, suggestions, or issues using PDSSP Crawler GitHub Issues.

Authors

Nicolas Manaud (initial design/implementation work)
Jérôme Gasperi ("stac2resto", Dockerizing)
Jean-Christophe Malapert (project initiator/management)

See also the list of contributors who is participating in the development of the PDSSP Crawler.

License

This project is licenced under Apache License 2.0 [TBC].

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
crawler		crawler
data		data
docs		docs
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
airflow.cfg		airflow.cfg
requirements-dev.txt		requirements-dev.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDSSP Crawler

Installation

Configuration

Usage

Crawler CLI

Crawler Python API

Crawler Web Interface (Airflow)

Contributing

Authors

License

About

Releases

Packages

Languages

License

pdssp/pdssp-crawler

Folders and files

Latest commit

History

Repository files navigation

PDSSP Crawler

Installation

Configuration

Usage

Crawler CLI

Crawler Python API

Crawler Web Interface (Airflow)

Contributing

Authors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages