The goal of this project is to perform basic quality checks on data. The project adds quality flags to the database (if configured). The project should/will be structured that adding a new algorithm is as simple as changing a config file (yaml).
To configure the setup parameters, performed tests and values, a configuration file in yaml
format is required.
An example can be found on github config.yaml
The config
folder from this repository can be cloned, without the source code, with
mkdir quality_assurance_tool
cd quality_assurance_tool
git init
git remote add -f origin https://github.com/naturalsciences/qualityAssuranceTool.git
git config core.sparseCheckout true
echo "conf/" >> .git/info/sparse-checkout
git pull origin main
-
create
.env
file withCONFIG_FOLDER
,OUTPUT_FOLDER
andQAT_TAG
An overview of the tags can be found here. It is recommended to use a specific tag (v0.3 instead of latest).
CONFIG_FOLDER=$(pwd)/conf/ OUTPUT_FOLDER=$(pwd)/outputs/ QAT_TAG=v0.3
-
use
keyctl
to storeSENSORS_USER
andSENSORS_PASS
A space in front of the command, generally, excludes it from the
history
. Verify this on your system!Instead of the default user keyring, a persistent keyring can be used:
keyctl get_persistent @u
.keyctl add user SENSORS_USER XXXXX @u keyctl add user SENSORS_PASS XXXXX @u
-
create and run docker container
It is recommended to specify the exact version and not
latest
. A list can be found here.docker run --rm --network=host --user "$(id -u):$(id -g)" --name qat --workdir /app -v $CONFIG_FOLDER:/app/conf -v $OUTPUT_FOLDER:/app/outputs -e DEV_SENSORS_USER=$(keyctl print $(keyctl search @u user SENSORS_USER)) -e DEV_SENSORS_PASS=$(keyctl print $(keyctl search @u user SENSORS_PASS)) rbinsbmdc/quality_assurance_tool:$QAT_TAG "time.start=$(date --date=$now-'16minutes' +'%Y-%m-%d %H:%M')" "time.end=$(date --date=$now-'1minute' +'%Y-%m-%d %H:%M')"
-
IF the run command can only be executed once
-
change ENTRYPOINT in the
docker run
commandsource .env && docker run -d --rm --entrypoint tail --network=host --user "$(id -u):$(id -g)" --name qat --workdir /app -v $CONFIG_FOLDER:/app/conf -v $OUTPUT_FOLDER:/app/outputs -e DEV_SENSORS_USER=$(keyctl print $(keyctl search @u user SENSORS_USER)) -e DEV_SENSORS_PASS=$(keyctl print $(keyctl search @u user SENSORS_PASS)) rbinsbmdc/quality_assurance_tool -f /dev/null
-
user `docker exec `
docker exec -u "$(id -u):$(id -g)" c8c4a820caeb /bin/bash -c "python src/main.py \"time.start=$(date --date=$now-'160minutes' +'%Y-%m-%d %H:%M')\" \"time.end=$(date --date=$now-'1minute' +'%Y-%m-%d %H:%M')\""
-
-
removing entries from
keyctl
#keyctl
- Getting the image
-
a docker image is available from the docker hub registry.
docker pull rbinsbmdc/quality_assurance_tool:latest
will pull the latest available image. - Running container
-
a couple of options and flags need to be provided through the command line.
docker run --rm --network=host --user "$(id -u):$(id -g)" --workdir /app -v CONFIG_FOLDER:/app/conf -v OUTPUT_FOLDER:/app/outputs -e DEV_SENSORS_USER=$DEV_SENSORS_USER -e DEV_SENSORS_PASS=$DEV_SENSORS_PASS rbinsbmdc/quality_assurance_tool:latest CONFIG_OVERRIDES
- --network=host
-
let container use the host network. In the future, this will be changed for better security.
- --user "$(id -u):$(id -g)"
-
sets the user and group id of the user in the docker container. Without this option, the logs will not be readable by a normal user.
- --workdir /app
-
sets the working directory within the container (this should not be changed)
- --v CONFIG_FOLDER:/app/conf
-
mounts a folder from the host in the container.
CONFIG_FOLDER
needs to be adapted to the path with theyaml
config file(s). - --v OUTPUT_FOLDER:/app/outputs
-
mounts a folder for the output.
- -e DEV_SENSORS_USER=$DEV_SENSORS_USER -e DEV_SENSORS_PASS=$DEV_SENSORS_PASS
-
passes the sensorthings user and password from the env to the container.
- rbinsbmdc/quality_assurance_tool:latest
-
the image
- CONFIG_OVERRIDES
-
override parameters through the cli.
docker run --rm --network=host --user "$(id -u):$(id -g)" --workdir /app -v ./conf:/app/conf -v ./outputs:/app/outputs -e DEV_SENSORS_USER=$DEV_SENSORS_USER -e DEV_SENSORS_PASS=$DEV_SENSORS_PASS rbinsbmdc/quality_assurance_tool:latest "time.start=2023-06-01 00:00" "time.end=2023-07-01 00:00"
As it is a python project, no real installation is needed, but a runtime python environment needs to be created where the needed packages are available.
The needed packages are listed in the file requirements.txt
.
python /app/src/main.py OPTIONS
The available flags are listed here. The order/priority of the flags are determined by the order in which they are sorted in the enum definition.
Script and env file in order to QC data within a range in fixed, overlapping time windows.
./qc_historical.sh -s START -e END -d total_time_window -o time_window_overlap [-i IMAGE_TAG ] [-c CONFIG_NAME ] [ -t ]
- -s
-
start date time (+%Y-%m-%d %H:%M:%S)
- -e
-
end date time
- -d
-
total width of the time window (integer followed by unit, i.g. "60min")
- -o
-
time window overlap; the time overlap (same units as the total width of the time windows above) with previous window
- -i
-
tag of the docker image (see docker hub)
- -t
-
flag (no argument) to turn on test-mode, appending the env source file names with the "_testing" (is hardcoded in script)
./qc_historical.sh -s "2023-05-24 09:30:00" -e "2023-05-24 10:30:00" -d "60min" -o "10" -i "tmp" -c "config.yaml" -t >> qc_historical_$(date "+%Y%m%d").log 2>&1
-
The location associated with each observation is compared with the SeaVox database. A region and sub-region (lowest found level) are associated with the location.
-
The name is verified to not contain mainland. These are marked as bad
Warning
|
the layers don’t seem to follow the coastlines very accurately. For internal waters in for example Iceland and Greenland, a lot of location return None. These location get a probably bad flag. |
The velocity, calculated based on the distance traveled from the current point to the next is compared with a maximal (allowed) velocity. When a single record is flagged, it is possibly an issue with the timestamp. If two or more records are flagged, it is possibly related to the gps location.
The acceleration, calculated from the difference between consecutive distances (calculated between this and the next point) are compared with a maximal acceleration value. One incorrect location, can give rise to multiple flagged records.
A rolling windows (see pandas documentation for more information) is used to calculate the median latitude and longitude. Then each location is compared with the median location. This distance is compared with the max distance within the considered window.
Note
|
This solution is not ideal. Calculating the angle between each line segment and comparing with a threshold might be better. This value will however be a function of the sample frequency and velocity. |
The sea region detection described in Regions sometimes fails to label points close to the coast, in a harbour or in internal waters (Iceland and Greenland). Therefore a second test is included that determines the bedrock height at all points. Doing so, one can for example set the flag to Probably good if no region is identified, but the depth is below a threshold value.
This test verifies that the range (min/max) of the measurement is correct. It is planned to allow for location dependent ranges.
The gradient over time is calculated. If the gradient is outside of a given range, the result is flagged.
The accuracy, quality or validity of some measurements depends on other quantities. To link the independent and dependent values, a difference between the timestamps of maximum 0.5 seconds is allowed.
There are two possible dependencies:
- Directly linked flags
-
the measurement of the dependent quantity need to assume the same flag as the independent quantity measurement (at the same time), if this flag is different from
Good
orNo Quality Control
. If the measured water temperature is impossible, the dependent salinity measurement can’t possibly be correct. - Quality check
-
the measurement of the dependent quantity needs to be set according to the value of the independent quantity measurement (at the same time). The difference with the first dependent qc, is that the flags themselves are not linked. The flow of a scientific water circuit can be measured correctly to be zero (flagged as
Good
), but the dependent quantity measurements can’t possible be correct!
This project uses hydra for (most) configurations and is done through a yaml file.
All config files can be found in the conf
folder.
- time
-
- format
-
input format of date/time
- start
-
datetime (formatted according to time.format) used as left boundary
- end
-
datetime (formatted according to time.format) used as right boundary
- date
- format
-
format for the date used in the output folder
- hydra
-
- verbose
-
Log level (True or __main__)
- run
- dir
-
output dir
- data_api
-
- base_url
-
url to the sensorthings instance
- things
-
- id
-
thing identifier (integer)
- filter
-
- phenomenonTime
-
- format:
-
expression how time/date is formatted (for example"%Y-%m-%d %H:%M" )
- range:
-
start and end date/time following specified format
- location
-
- connection
-
- database
-
postgresql database name
- user
-
user name
- host
-
hostname
- port
-
port that is used
- passphrase
-
passphrase for user
- crs
-
crs of db (EPS:4326)
- time_window
-
The time window used for the rolling median.
- max_dx_dt
-
The maximal velocity of the vessel, used for outlier detection.
- QC_dependent
-
list if quantity dependent relations. 2 checks can be performed. If the independent quantity has a quality flag different from good, the dependent quantity wil get the same label (in the default use case. This can also be changed in the main file).
- independent
-
identifier (sensorthings) of independent quantity
- dependent
-
identifier (sensorthings) of dependent quantity
- QC
-
type of quality check (only range is implemented)
- range
-
list of 2 values (min, max)
- QC
-
normal quality checks. only two are defined: range and gradient
- name
-
the name of the observed feature
- range
-
expected range of the feature values
- gradient
-
expected range of the gradient.