qc through sensorthings API

Description

The goal of this project is to perform basic quality checks on data. The project adds quality flags to the database (if configured). The project should/will be structured that adding a new algorithm is as simple as changing a config file (yaml).

Requirements

(Optional) SeaVox db

To query the sea region, seavox db is used.

Config file

To configure the setup parameters, performed tests and values, a configuration file in yaml format is required. An example can be found on github config.yaml

The config folder from this repository can be cloned, without the source code, with

mkdir quality_assurance_tool
cd quality_assurance_tool
git init
git remote add -f origin https://github.com/naturalsciences/qualityAssuranceTool.git
git config core.sparseCheckout true
echo "conf/" >> .git/info/sparse-checkout
git pull origin main

Installation/executing

Proposed workflow (docker, keyctl)

create .env file with CONFIG_FOLDER, OUTPUT_FOLDER and QAT_TAG

An overview of the tags can be found here. It is recommended to use a specific tag (v0.3 instead of latest).
```
CONFIG_FOLDER=$(pwd)/conf/
OUTPUT_FOLDER=$(pwd)/outputs/
QAT_TAG=v0.3
```
use keyctl to store SENSORS_USER and SENSORS_PASS

A space in front of the command, generally, excludes it from the history. Verify this on your system!

Instead of the default user keyring, a persistent keyring can be used: keyctl get_persistent @u.
```
 keyctl add user SENSORS_USER XXXXX @u
 keyctl add user SENSORS_PASS XXXXX @u
```

create and run docker container

It is recommended to specify the exact version and not latest. A list can be found here.

docker run --rm --network=host --user "$(id -u):$(id -g)" --name qat --workdir /app -v $CONFIG_FOLDER:/app/conf -v $OUTPUT_FOLDER:/app/outputs -e DEV_SENSORS_USER=$(keyctl print $(keyctl search @u user SENSORS_USER)) -e DEV_SENSORS_PASS=$(keyctl print $(keyctl search @u user SENSORS_PASS)) rbinsbmdc/quality_assurance_tool:$QAT_TAG "time.start=$(date --date=$now-'16minutes' +'%Y-%m-%d %H:%M')" "time.end=$(date --date=$now-'1minute' +'%Y-%m-%d %H:%M')"

IF the run command can only be executed once

change ENTRYPOINT in the docker run command

source .env && docker run -d --rm --entrypoint tail --network=host --user "$(id -u):$(id -g)" --name qat --workdir /app -v $CONFIG_FOLDER:/app/conf -v $OUTPUT_FOLDER:/app/outputs -e DEV_SENSORS_USER=$(keyctl print $(keyctl search @u user SENSORS_USER)) -e DEV_SENSORS_PASS=$(keyctl print $(keyctl search @u user SENSORS_PASS)) rbinsbmdc/quality_assurance_tool -f /dev/null

user `docker exec `

docker exec -u "$(id -u):$(id -g)" c8c4a820caeb /bin/bash -c  "python src/main.py \"time.start=$(date --date=$now-'160minutes' +'%Y-%m-%d %H:%M')\" \"time.end=$(date --date=$now-'1minute' +'%Y-%m-%d %H:%M')\""

removing entries from keyctl #keyctl

Docker

Getting the image: a docker image is available from the docker hub registry. docker pull rbinsbmdc/quality_assurance_tool:latest will pull the latest available image.
Running container: a couple of options and flags need to be provided through the command line.

Generic docker run command

docker run --rm --network=host --user "$(id -u):$(id -g)" --workdir /app -v CONFIG_FOLDER:/app/conf -v OUTPUT_FOLDER:/app/outputs -e DEV_SENSORS_USER=$DEV_SENSORS_USER -e DEV_SENSORS_PASS=$DEV_SENSORS_PASS rbinsbmdc/quality_assurance_tool:latest CONFIG_OVERRIDES

--network=host: let container use the host network. In the future, this will be changed for better security.
--user "$(id -u):$(id -g)": sets the user and group id of the user in the docker container. Without this option, the logs will not be readable by a normal user.
--workdir /app: sets the working directory within the container (this should not be changed)
--v CONFIG_FOLDER:/app/conf: mounts a folder from the host in the container. CONFIG_FOLDER needs to be adapted to the path with the yaml config file(s).
--v OUTPUT_FOLDER:/app/outputs: mounts a folder for the output.
-e DEV_SENSORS_USER=$DEV_SENSORS_USER -e DEV_SENSORS_PASS=$DEV_SENSORS_PASS: passes the sensorthings user and password from the env to the container.
rbinsbmdc/quality_assurance_tool:latest: the image
CONFIG_OVERRIDES: override parameters through the cli.

Example docker run command

docker run --rm --network=host --user "$(id -u):$(id -g)" --workdir /app -v ./conf:/app/conf -v ./outputs:/app/outputs -e DEV_SENSORS_USER=$DEV_SENSORS_USER -e DEV_SENSORS_PASS=$DEV_SENSORS_PASS rbinsbmdc/quality_assurance_tool:latest "time.start=2023-06-01 00:00" "time.end=2023-07-01 00:00"

From source

Python

As it is a python project, no real installation is needed, but a runtime python environment needs to be created where the needed packages are available. The needed packages are listed in the file requirements.txt.

python /app/src/main.py OPTIONS

Build image

docker buildx build -t TAG .

or

docker build  --no-cache -t TAG .

Run periodically

There are multiple options here:

(host) systemd --user: see systemd_user/README.adoc for more information
(host) cron
(container) cron
- requires adapting the image
- no parallel processing if interval is shorting than execution time

Quality flags

The available flags are listed here. The order/priority of the flags are determined by the order in which they are sorted in the enum definition.

Additional tools/scripts/files

QC historical (folder)

Script and env file in order to QC data within a range in fixed, overlapping time windows.

Usage qc_historical.sh

./qc_historical.sh  -s START -e END -d total_time_window -o time_window_overlap [-i IMAGE_TAG ] [-c CONFIG_NAME ] [ -t ]

-s: start date time (+%Y-%m-%d %H:%M:%S)
-e: end date time
-d: total width of the time window (integer followed by unit, i.g. "60min")
-o: time window overlap; the time overlap (same units as the total width of the time windows above) with previous window
-i: tag of the docker image (see docker hub)
-t: flag (no argument) to turn on test-mode, appending the env source file names with the "_testing" (is hardcoded in script)

Example qc_historical.sh usage

./qc_historical.sh -s "2023-05-24 09:30:00" -e "2023-05-24 10:30:00" -d "60min" -o "10" -i "tmp" -c "config.yaml" -t >> qc_historical_$(date "+%Y%m%d").log 2>&1

Crontab (folder)

Possible quality checks

Regions

The location associated with each observation is compared with the SeaVox database. A region and sub-region (lowest found level) are associated with the location.
The name is verified to not contain mainland. These are marked as bad

Warning

the layers don’t seem to follow the coastlines very accurately. For internal waters in for example Iceland and Greenland, a lot of location return None. These location get a probably bad flag.

Locations

Velocity

The velocity, calculated based on the distance traveled from the current point to the next is compared with a maximal (allowed) velocity. When a single record is flagged, it is possibly an issue with the timestamp. If two or more records are flagged, it is possibly related to the gps location.

Acceleration

The acceleration, calculated from the difference between consecutive distances (calculated between this and the next point) are compared with a maximal acceleration value. One incorrect location, can give rise to multiple flagged records.

Outliers

A rolling windows (see pandas documentation for more information) is used to calculate the median latitude and longitude. Then each location is compared with the median location. This distance is compared with the max distance within the considered window.

Note	This solution is not ideal. Calculating the angle between each line segment and comparing with a threshold might be better. This value will however be a function of the sample frequency and velocity.

Bedrock height

The sea region detection described in Regions sometimes fails to label points close to the coast, in a harbour or in internal waters (Iceland and Greenland). Therefore a second test is included that determines the bedrock height at all points. Doing so, one can for example set the flag to Probably good if no region is identified, but the depth is below a threshold value.

Range

This test verifies that the range (min/max) of the measurement is correct. It is planned to allow for location dependent ranges.

Gradient

The gradient over time is calculated. If the gradient is outside of a given range, the result is flagged.

Dependent

The accuracy, quality or validity of some measurements depends on other quantities. To link the independent and dependent values, a difference between the timestamps of maximum 0.5 seconds is allowed.

There are two possible dependencies:

Directly linked flags: the measurement of the dependent quantity need to assume the same flag as the independent quantity measurement (at the same time), if this flag is different from Good or No Quality Control. If the measured water temperature is impossible, the dependent salinity measurement can’t possibly be correct.
Quality check: the measurement of the dependent quantity needs to be set according to the value of the independent quantity measurement (at the same time). The difference with the first dependent qc, is that the flags themselves are not linked. The flow of a scientific water circuit can be measured correctly to be zero (flagged as Good), but the dependent quantity measurements can’t possible be correct!

Configuration

This project uses hydra for (most) configurations and is done through a yaml file. All config files can be found in the conf folder.

time

format: input format of date/time
start: datetime (formatted according to time.format) used as left boundary
end: datetime (formatted according to time.format) used as right boundary
date
format: format for the date used in the output folder

hydra

verbose: Log level (True or __main__)
run
dir: output dir

data_api

base_url

url to the sensorthings instance

things

id: thing identifier (integer)

filter

phenomenonTime

format:

expression how time/date is formatted (for example"%Y-%m-%d %H:%M" )

range:: start and end date/time following specified format

location

connection

database: postgresql database name
user: user name
host: hostname
port: port that is used
passphrase: passphrase for user

crs

crs of db (EPS:4326)

time_window

The time window used for the rolling median.

max_dx_dt

The maximal velocity of the vessel, used for outlier detection.

QC_dependent

list if quantity dependent relations. 2 checks can be performed. If the independent quantity has a quality flag different from good, the dependent quantity wil get the same label (in the default use case. This can also be changed in the main file).

independent

identifier (sensorthings) of independent quantity

dependent

identifier (sensorthings) of dependent quantity

QC

type of quality check (only range is implemented)

range: list of 2 values (min, max)

QC

normal quality checks. only two are defined: range and gradient

name: the name of the observed feature
range: expected range of the feature values
gradient: expected range of the gradient.

License

License file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.adoc

README.adoc

qc through sensorthings API

Description

Requirements

(Optional) SeaVox db

Config file

Installation/executing

Proposed workflow (docker, keyctl)

Docker

From source

Python

Build image

Run periodically

Quality flags

Additional tools/scripts/files

QC historical (folder)

Crontab (folder)

Possible quality checks

Regions

Locations

Velocity

Acceleration

Outliers

Bedrock height

Range

Gradient

Dependent

Configuration

License

Files

README.adoc

Latest commit

History

README.adoc

File metadata and controls

qc through sensorthings API

Description

Requirements

(Optional) SeaVox db

Config file

Installation/executing

Proposed workflow (docker, keyctl)

Docker

From source

Python

Build image

Run periodically

Quality flags

Additional tools/scripts/files

QC historical (folder)

Crontab (folder)

Possible quality checks

Regions

Locations

Velocity

Acceleration

Outliers

Bedrock height

Range

Gradient

Dependent

Configuration

License