Skip to content

The goal of this project is to perform basic quality checks on data. The project adds quality flags to the database (if configured). The project should/will be structured that adding a new algorithm is as simple as changing a config file (yaml).

License

Notifications You must be signed in to change notification settings

naturalsciences/qualityAssuranceTool

Repository files navigation

qc through sensorthings API

Description

The goal of this project is to perform basic quality checks on data. The project adds quality flags to the database (if configured). The project should/will be structured that adding a new algorithm is as simple as changing a config file (yaml).

Requirements

(Optional) SeaVox db

To query the sea region, seavox db is used.

Config file

To configure the setup parameters, performed tests and values, a configuration file in yaml format is required. An example can be found on github config.yaml

The config folder from this repository can be cloned, without the source code, with

mkdir quality_assurance_tool
cd quality_assurance_tool
git init
git remote add -f origin https://github.com/naturalsciences/qualityAssuranceTool.git
git config core.sparseCheckout true
echo "conf/" >> .git/info/sparse-checkout
git pull origin main

Installation/executing

Proposed workflow (docker, keyctl)

  1. create .env file with CONFIG_FOLDER, OUTPUT_FOLDER and QAT_TAG

    An overview of the tags can be found here. It is recommended to use a specific tag (v0.3 instead of latest).

    CONFIG_FOLDER=$(pwd)/conf/
    OUTPUT_FOLDER=$(pwd)/outputs/
    QAT_TAG=v0.3
  2. use keyctl to store SENSORS_USER and SENSORS_PASS

    A space in front of the command, generally, excludes it from the history. Verify this on your system!

    Instead of the default user keyring, a persistent keyring can be used: keyctl get_persistent @u.

     keyctl add user SENSORS_USER XXXXX @u
     keyctl add user SENSORS_PASS XXXXX @u
  3. create and run docker container

    It is recommended to specify the exact version and not latest. A list can be found here.

    docker run --rm --network=host --user "$(id -u):$(id -g)" --name qat --workdir /app -v $CONFIG_FOLDER:/app/conf -v $OUTPUT_FOLDER:/app/outputs -e DEV_SENSORS_USER=$(keyctl print $(keyctl search @u user SENSORS_USER)) -e DEV_SENSORS_PASS=$(keyctl print $(keyctl search @u user SENSORS_PASS)) rbinsbmdc/quality_assurance_tool:$QAT_TAG "time.start=$(date --date=$now-'16minutes' +'%Y-%m-%d %H:%M')" "time.end=$(date --date=$now-'1minute' +'%Y-%m-%d %H:%M')"
  4. IF the run command can only be executed once

    • change ENTRYPOINT in the docker run command

      source .env && docker run -d --rm --entrypoint tail --network=host --user "$(id -u):$(id -g)" --name qat --workdir /app -v $CONFIG_FOLDER:/app/conf -v $OUTPUT_FOLDER:/app/outputs -e DEV_SENSORS_USER=$(keyctl print $(keyctl search @u user SENSORS_USER)) -e DEV_SENSORS_PASS=$(keyctl print $(keyctl search @u user SENSORS_PASS)) rbinsbmdc/quality_assurance_tool -f /dev/null
    • user `docker exec `

      docker exec -u "$(id -u):$(id -g)" c8c4a820caeb /bin/bash -c  "python src/main.py \"time.start=$(date --date=$now-'160minutes' +'%Y-%m-%d %H:%M')\" \"time.end=$(date --date=$now-'1minute' +'%Y-%m-%d %H:%M')\""
  5. removing entries from keyctl #keyctl

Docker

Getting the image

a docker image is available from the docker hub registry. docker pull rbinsbmdc/quality_assurance_tool:latest will pull the latest available image.

Running container

a couple of options and flags need to be provided through the command line.

Generic docker run command
docker run --rm --network=host --user "$(id -u):$(id -g)" --workdir /app -v CONFIG_FOLDER:/app/conf -v OUTPUT_FOLDER:/app/outputs -e DEV_SENSORS_USER=$DEV_SENSORS_USER -e DEV_SENSORS_PASS=$DEV_SENSORS_PASS rbinsbmdc/quality_assurance_tool:latest CONFIG_OVERRIDES
--network=host

let container use the host network. In the future, this will be changed for better security.

--user "$(id -u):$(id -g)"

sets the user and group id of the user in the docker container. Without this option, the logs will not be readable by a normal user.

--workdir /app

sets the working directory within the container (this should not be changed)

--v CONFIG_FOLDER:/app/conf

mounts a folder from the host in the container. CONFIG_FOLDER needs to be adapted to the path with the yaml config file(s).

--v OUTPUT_FOLDER:/app/outputs

mounts a folder for the output.

-e DEV_SENSORS_USER=$DEV_SENSORS_USER -e DEV_SENSORS_PASS=$DEV_SENSORS_PASS

passes the sensorthings user and password from the env to the container.

rbinsbmdc/quality_assurance_tool:latest

the image

CONFIG_OVERRIDES

override parameters through the cli.

Example docker run command
docker run --rm --network=host --user "$(id -u):$(id -g)" --workdir /app -v ./conf:/app/conf -v ./outputs:/app/outputs -e DEV_SENSORS_USER=$DEV_SENSORS_USER -e DEV_SENSORS_PASS=$DEV_SENSORS_PASS rbinsbmdc/quality_assurance_tool:latest "time.start=2023-06-01 00:00" "time.end=2023-07-01 00:00"

From source

Python

As it is a python project, no real installation is needed, but a runtime python environment needs to be created where the needed packages are available. The needed packages are listed in the file requirements.txt.

python /app/src/main.py OPTIONS

Build image

docker buildx build -t TAG .

or

docker build  --no-cache -t TAG .

Run periodically

There are multiple options here:

  1. (host) systemd --user: see systemd_user/README.adoc for more information

  2. (host) cron

  3. (container) cron

    • requires adapting the image

    • no parallel processing if interval is shorting than execution time

Quality flags

The available flags are listed here. The order/priority of the flags are determined by the order in which they are sorted in the enum definition.

Additional tools/scripts/files

QC historical (folder)

Script and env file in order to QC data within a range in fixed, overlapping time windows.

Usage qc_historical.sh
./qc_historical.sh  -s START -e END -d total_time_window -o time_window_overlap [-i IMAGE_TAG ] [-c CONFIG_NAME ] [ -t ]
-s

start date time (+%Y-%m-%d %H:%M:%S)

-e

end date time

-d

total width of the time window (integer followed by unit, i.g. "60min")

-o

time window overlap; the time overlap (same units as the total width of the time windows above) with previous window

-i

tag of the docker image (see docker hub)

-t

flag (no argument) to turn on test-mode, appending the env source file names with the "_testing" (is hardcoded in script)

Example qc_historical.sh usage
./qc_historical.sh -s "2023-05-24 09:30:00" -e "2023-05-24 10:30:00" -d "60min" -o "10" -i "tmp" -c "config.yaml" -t >> qc_historical_$(date "+%Y%m%d").log 2>&1

Crontab (folder)

Possible quality checks

Regions

  1. The location associated with each observation is compared with the SeaVox database. A region and sub-region (lowest found level) are associated with the location.

  2. The name is verified to not contain mainland. These are marked as bad

Warning
the layers don’t seem to follow the coastlines very accurately. For internal waters in for example Iceland and Greenland, a lot of location return None. These location get a probably bad flag.

Locations

Velocity

The velocity, calculated based on the distance traveled from the current point to the next is compared with a maximal (allowed) velocity. When a single record is flagged, it is possibly an issue with the timestamp. If two or more records are flagged, it is possibly related to the gps location.

Acceleration

The acceleration, calculated from the difference between consecutive distances (calculated between this and the next point) are compared with a maximal acceleration value. One incorrect location, can give rise to multiple flagged records.

Outliers

A rolling windows (see pandas documentation for more information) is used to calculate the median latitude and longitude. Then each location is compared with the median location. This distance is compared with the max distance within the considered window.

Note
This solution is not ideal. Calculating the angle between each line segment and comparing with a threshold might be better. This value will however be a function of the sample frequency and velocity.

Bedrock height

The sea region detection described in Regions sometimes fails to label points close to the coast, in a harbour or in internal waters (Iceland and Greenland). Therefore a second test is included that determines the bedrock height at all points. Doing so, one can for example set the flag to Probably good if no region is identified, but the depth is below a threshold value.

Range

This test verifies that the range (min/max) of the measurement is correct. It is planned to allow for location dependent ranges.

Gradient

The gradient over time is calculated. If the gradient is outside of a given range, the result is flagged.

Dependent

The accuracy, quality or validity of some measurements depends on other quantities. To link the independent and dependent values, a difference between the timestamps of maximum 0.5 seconds is allowed.

There are two possible dependencies:

Directly linked flags

the measurement of the dependent quantity need to assume the same flag as the independent quantity measurement (at the same time), if this flag is different from Good or No Quality Control. If the measured water temperature is impossible, the dependent salinity measurement can’t possibly be correct.

Quality check

the measurement of the dependent quantity needs to be set according to the value of the independent quantity measurement (at the same time). The difference with the first dependent qc, is that the flags themselves are not linked. The flow of a scientific water circuit can be measured correctly to be zero (flagged as Good), but the dependent quantity measurements can’t possible be correct!

Configuration

This project uses hydra for (most) configurations and is done through a yaml file. All config files can be found in the conf folder.

time
format

input format of date/time

start

datetime (formatted according to time.format) used as left boundary

end

datetime (formatted according to time.format) used as right boundary

date
format

format for the date used in the output folder

hydra
verbose

Log level (True or __main__)

run
dir

output dir

data_api
base_url

url to the sensorthings instance

things
id

thing identifier (integer)

filter
phenomenonTime
format:

expression how time/date is formatted (for example"%Y-%m-%d %H:%M" )

range:

start and end date/time following specified format

location
connection
database

postgresql database name

user

user name

host

hostname

port

port that is used

passphrase

passphrase for user

crs

crs of db (EPS:4326)

time_window

The time window used for the rolling median.

max_dx_dt

The maximal velocity of the vessel, used for outlier detection.

QC_dependent

list if quantity dependent relations. 2 checks can be performed. If the independent quantity has a quality flag different from good, the dependent quantity wil get the same label (in the default use case. This can also be changed in the main file).

independent

identifier (sensorthings) of independent quantity

dependent

identifier (sensorthings) of dependent quantity

QC

type of quality check (only range is implemented)

range

list of 2 values (min, max)

QC

normal quality checks. only two are defined: range and gradient

name

the name of the observed feature

range

expected range of the feature values

gradient

expected range of the gradient.

License

About

The goal of this project is to perform basic quality checks on data. The project adds quality flags to the database (if configured). The project should/will be structured that adding a new algorithm is as simple as changing a config file (yaml).

Resources

License

Stars

Watchers

Forks

Packages

No packages published