Skip to content

pdssp/pdssp-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDSSP Crawler

Extract, transform and ingest planetary surface metadata into a STAC API Catalog.

The PDSSP Crawler is the software component responsible for the extraction, transformation and ingestion of planetary surface data products metadata into the PDSSP STAC Catalog service (RESTO). Through an Airflow web interface, it also provides a way to orchestrate and manage the PDSSP Crawler's tasks and behaviour.

Collection and products metadata are extracted from OGC data catalog services (WFS, Features API, STAC API), defined in the PDSSP Services Registry. Metadata can also be extracted from non-OGC data catalog services (PDS ODE API, ENP-TAP, HTTP GET, PDS API, ...), locally defined by the PDSSP operator (see data/services).

In both cases, metadata ingested into the PDSSP STAC Catalog are compliant to the PDSSP Data Model, which consists of the STAC data model extended through existing and new STAC extensions, including the Solar System STAC extension.

⚠️ The PDSSP Crawler is under development, and the current version is not fully functional nor stable. See the online documentation for more information.

Installation

In a next version installation via Docker will be possible, enabling deployment to the PDSSP server. For now, follow these steps:

  1. Set up conda environment

Create environment:

conda create --name pdssp-env python=3.9

Activate environment:

conda activate pdssp-env
  1. Create and go to your working directory
mkdir -p </path/to/pdssp/crawler>
cd </path/to/pdssp/crawler>
  1. Download and install package
git clone https://github.com/pdssp/pdssp-crawler.git
pip install -e pdssp-crawler
pip install -r pdssp-crawler/tests/requirements.txt
pip install -r pdssp-crawler/docs/requirements.txt

Configuration

The configuration mechanism will be improved in the next versions. For now:

  1. Create the incoming source and the outgoing STAC directories in your working directories. For example:
mkdir crawler-data
mkdir pdssp-stac-repo
  1. Edit the crawler/config.py file to change the following variables accordingly. In the following example, /Users/nmanaud/workspace/pdssp is the working directory and the source and STAC data directories are respectively named crawler-data and pdssp-stac-repo.
SOURCE_DATA_DIR = '/Users/nmanaud/workspace/pdssp/crawler-data'
STAC_DATA_DIR = '/Users/nmanaud/workspace/pdssp/pdssp-stac-repo'
PDSSP_REGISTRY_ENDPOINT = 'https://pdssp.ias.universite-paris-saclay.fr/registry/services'
LOCAL_REGISTRY_DIRECTORY = '/Users/nmanaud/workspace/pdssp/pdssp-crawler/data/services'
STAC_CATALOG_PARENT_ENDPOINT = 'https://pdssp.ias.universite-paris-saclay.fr'

Set the RESTO_ADMIN_AUTH_TOKEN environment variable, required for ingestion POST request to PDSSP Catalog STAC API (RESTO).

Airflow configutation [TBD]

Usage

Crawler CLI

List CLI commands and get help:

crawler --help

Initialise data store with available source collections (run once):

crawler initds

Display/filter source collections status:

crawler collections --target=mars

Extract, transform and ingest:

crawler extract --id='MRO_HIRISE_RDRV11'
crawler transform --id='MRO_HIRISE_RDRV11'
crawler ingest --id='MRO_HIRISE_RDRV11'

or just:

crawler ingest --id='MRO_HIRISE_RDRV11'

Process all source collections associated to Mars:

crawler process --target=mars

See Crawler CLI Reference

Crawler Python API

For example:

from crawler.crawler import Crawler

crawler = Crawler()
services = crawler.get_registered_services()

See Crawler Python API Reference

Crawler Web Interface (Airflow)

https://pdssp.ias.universite-paris-saclay.fr/crawler (in development)

Contributing

Keeping in mind that this project is in active development... if you are interested in the general topic of planetary geospatial data catalog interoperability or the PDSSP Crawler in particular, feel to reach out to us, raise your questions, suggestions, or issues using PDSSP Crawler GitHub Issues.

Authors

See also the list of contributors who is participating in the development of the PDSSP Crawler.

License

This project is licenced under Apache License 2.0 [TBC].