-
Notifications
You must be signed in to change notification settings - Fork 1
Database
This database is used for storing configuration data (roughly 1 MB) and also logging messages and information about current alarms (will reach gigabytes after a few years of use)
You'll need one account with read-write permissions that Doberman will use. It's also a good idea to have an admin/root account for various maintenance-related tasks, but Doberman doesn't need to know about it. Export the connection info as an environment variable: export DOBERMAN_MONGO_URI="mongodb://${username}:${password}@${host}:${port}/admin"
so Doberman knows how to connect. There's also a database setup script in scripts
that will create some of the things for you, so have a look at that as well.
Doberman uses one database per experiment, with a number of collections containing various things. This is not alphabetically organized because some things are used more often, and I put those closer to the top.
Each readout device needs one entry in this collection. Each entry should look like this:
{
"name" : <device name here>,
"address" : {
"ip" : "192.168.131.22",
"port" : 5000
},
"params": {
"key": "value"
},
"commands" : [
{
"pattern" : "set setpoint [value]",
"example" : "set setpoint 182.1"
}
],
"host" : "calliope",
"sensors" : [
"T_LAB_01",
"T_LAB_02",
"W_LAB_01"
],
}
What the fields mean:
-
name
: a unique name for this device. If you have multiple identical devices, append numbers. Doberman knows that bothiseries1
andiseries2
use theiseries
plugin. -
address
: how Doberman should connect to this device. Plugins inheriting fromDoberman.LANDevice
must specifyip
andport
fields, plugins inheriting fromDoberman.SerialDevice
must specifytty
andbaud
fields. Plugins inheriting from other base classes don't need to specify anything. -
params
: optional, any additional parameters for the plugin. These are loaded as attributes, so the plugin code can referenceself.key
. -
commands
: optional, a list of dictionaries that contain info about the commands this device accepts. Not used by Doberman, but is really useful for displaying on the website. -
host
: the hostname of the computer where the Doberman instance runs that reads this device out. -
sensors
: a list of all the sensors this device has. These should all correspond to entries in thesensors
collection. Doberman will add a few extra fields for internal use but these are the minimum that must be specified.
Index this collection on the name
field.
Each sensor needs an entry in the sensors
collection. Each entry should look like this:
{
"name" : "T_LAB_01",
"description" : "Lab temperature, by door",
"units" : "C",
"status" : "online",
"topic" : "temperature",
"subsystem" : "lab",
"readout_interval" : 5,
"alarm_recurrence" : 3,
"alarm_thresholds" : [13, 28],
"alarm_level" : 0,
"alarm_is_triggered": False,
"pipelines" : [ ],
"device" : <device name>,
"value_xform": [0, 1],
"readout_command": <command>
}
What the fields mean:
-
name
: a unique name for this sensor. We found<quantity>_<subsystem>_<number>
a scheme that scaled well. -
description
: a text description of what this sensor measures -
units
: measurement units -
status
: eitheronline
oroffline
. Should this sensor be read out if its owning device is online? -
topic
: the physical quantity being measured (temperature, pressure, etc). This determines where in Influx the values go. -
subsystem
: a larger grouping of sensors proved convenient for us. Things likelab
orinner_cryostat
orgas_system
, etc. -
readout_interval
: how often in seconds this sensor should be read out -
alarm_recurrence
: how many subsequent values outside of the alarm thresholds must occur before an alarm state is entered. -
alarm_thresholds
: low and high thresholds that demarcate the "safe" or "acceptable" range of values. -
alarm_level
: the base alarm level. -
alarm_is_triggered
: boolean to show whether the sensor is in a state of alarm. Only updated when corresponding AlarmNode is running. -
pipelines
: a list of pipelines that require this sensor. Used by the website, not by Doberman. -
device
: the name of the readout device -
value_xform
: optional, you can have a polynomial transformation applied to the raw number a sensor returns before it gets sent downstream. This is useful for converting from ADC units into physical quantities. Values are given in little-endian form, and the result is calculated bysum(a_i*x**i for i, a_i in enumerate(value_xform))
so a value of[0, 1]
means no change. -
readout_command
: what bit of text is passed to the corresponding Device to read this quantity out? Might be something likeread:ch1
or some such, see the hardware manual.
Here is another example where the sensor produces integer values and uses a corresponding integer alarm:
{
"name" : "S_LAB_01",
"description" : "Oxygen level",
"units" : "",
"status" : "online",
"topic" : "status",
"subsystem" : "lab",
"readout_interval" : 1,
"alarm_recurrence" : 3,
"alarm_level" : 2,
"alarm_is_triggered": False,
"pipelines" : [ ],
"device" : <device name>,
"value_xform": [0, 1],
"readout_command": <command>,
"is_int": 1,
"alarm_values": {"1": "The oxygen level is low, flashing light is on",
"2": "The oxygen level is low, siren is on"}
}
-
is_int
: Add this field to a sensor when it produces integer values. Note that you should only setis_int
when it is an integer sensor (i.e. don't do something likeis_int: False
in your float sensors). This will be used to create the right type of alarm on the website and to decide in which format the values are stored in InfluxDB. Note that changing this variable after the first entries are written to Influx is not possible without removing the old values from InfluxDB so make sure you correctly set this before starting you sensor. -
alarm_values
: a dictionary containing the integers that shall trigger an alarm (doesn't matter if they are ints or strings as in the example above) and the corresponding alarm messages.
Some sensor will not read a single quantity but a whole list of quantities (e.g. a levelmeter box with six inputs returns an array of those six values given a single readout_command
). This is realised through the MultiSensor
class. To configure a MultiSensor
in the Database, you need to add the following fields:
{
"name" : "L_XS_01",
"description" : "N2 levelmeter 1",
...,
"multi_sensor": ["L_XS_01", "L_XS_02", "L_XS_03", "L_XS_04", "L_XS_05", "L_XS_06"],
}
{
"name" : "L_XS_02",
"description" : "N2 levelmeter 2",
...,
"multi_sensor": "L_XS_01",
}
- multi_sensor: A list for the primary sensor containing all sub-sensors. The name of the primary sensor for all other. Note that
readout_interval
,readout_command
, andstatus
of all sub-sensors except the primary one will be ignored.
See the alarm page for more info about alarms and how they work. Index this collection on the name
field. These values are regularly refreshed by things in Doberman that use them, so expect fairly prompt responses to changes.
There are a variety of things where you just need to store a few pieces of system-wide information, and the experiment_config
collection is designed for this. There isn't any fixed schema, but you'll need three documents that look like this:
{
"name" : "influx",
"url" : "http://192.168.131.2:8096",
"bucket" : "data_bucket",
"org" : "pancake",
"precision" : "ms",
"token" : <token>,
"db" : "slowdata"
}
What the fields mean:
-
name
:influx
-
url
: the URL where things on your subnet can access the InfluxDB instance. -
bucket
: the name of the bucket where you want the sensor data stored -
org
: the name of the organization you set up in Influx -
precision
: the precision you want timestamps to use. -
token
: the access token you generated for Influx -
db
: the name of the mapped database to access the specified bucket.
See below for more info on InfluxDB setup
The hypervisor is the Monitor responsible for making sure the system is running and communicating with itself properly. Its entry should look like this:
{
"name" : "hypervisor",
"period" : 60,
"processes" : {
"managed" : [],
"active" : []
},
"restart_timeout" : 300,
"status" : "offline",
"path" : "/global/software/doberman/scripts",
"remote_heartbeat" : [
{"address" : "user@host",
"port" : 1234,
"directory": "/global"
},
]
"host" : "apollo",
"username": "doberman",
"startup_sequence": {apollo: [],
calliope: [ '[ -e /global/software ] || mount /global' ],
...},
"comms": {
"data": {"send": 8904, "recv": 8905},
"command": {"send": 8906, "recv": 8907}
}
}
What the fields mean:
-
name
:hypervisor
-
period
: how often (in seconds) the main logic loop runs. This number determines how often everything in the entire system checks in with the database to say that it's still alive. -
processes
: a dictionary containingmanaged
andactive
lists. Devices add and remove themselves from theactive
list as they start and stop. The hypervisor will do everything it can to make sure devices in themanaged
list are running. -
restart_timeout
: how often (in seconds) the hypervisor can restart crashing or otherwise non-responsive device readouts. -
status
: eitheronline
oroffline
-
path
: the directory where thestart_process.sh
script lives on your global network drive. -
remote_heartbeat
: who watches the watchmen? If your lab has a power cut there's a good chance that networking infrastructure goes down with it, and without this any alarms don't reach you. The hypervisor will send remote heartbeats to all of the machines in this list. This is done by putting a timestamp and the phone numbers of the current shifters into a file calledremote_hb_<experiment_name>
in the givendirectory
(defaults to/scratch
if not defined) on the remote machined accessed via sshaddress
:port
(port
defaults to22
if not defined). See REF on how to check remote heartbeats. -
host
: the hostname of the machine where the hypervisor runs. -
username
: linux username on this and all other machines (see user management) -
startup_sequence
: These things will be executed locally or via ssh on the given devices when you start the hypervisor (e.g. make sure a directory is properly mounted) -
comms
: You'll need to define four ports here that will be used for all communication between monitors. You can pretty much use anything you want here, just make sure these ports aren't use by someone else.
Global settings about alarms, like connection details and stuff.
{
"name" : "alarm",
"silence_duration" : [
3600,
900,
300
],
"silence_duration_cant_send" : 60,
"escalation_config" : [
24,
8,
2
],
"max_reading_delay" : 30,
"connection_details" : {
"email" : {
"server" : "smtp.gmail.com",
"port" : 587,
"fromaddr" : <email>,
"password" : <password>
},
"sms" : {
"server" : "sms.smscreator.de",
"identification" : <identification>,
"contactaddr" : <email>
},
"twilio" : {
"url" : <url>,
"auth" : [
<auth code 1>,
<auth code 2>
],
"fromnumber" : <mobile number>,
"maxmessagelength" : 1500
}
},
"recipients": [
[
"shifters"
],
[
"shifters"
],
[
"shifters",
"experts"
],
[
"everyone"
]
],
"protocols": [
[
"email"
],
[
"email",
"sms"
],
[
"email",
"sms",
"phone"
],
[
"email",
"sms",
"phone"
],
}
What the fields mean:
-
name
:alarm
-
silence_duration
: pipelines that generate alarms will automatically silence themselves (suppressing further alarms) for this many seconds per level, so you don't get spammed - `silence_duration_cant_send: If there is an exception during sending the alarm the corresponding pipeline while be silenced for this many seconds before it tries again.
-
escalation_config
: how long an alarm stays at a given level (in number of messages) before it's escalated. -
max_reading_delay
: For the DeviceResponding nodes: trigger alarm when the reading is more delayed than this number of seconds. -
connection_details
: a dict with all the info the AlarmMonitor needs to send emails, SMS messages, and phone calls. -
recipients
: an array that defines who gets messages in case of an alarm for a given level (base level + escalation). Allowed values areshifters
,experts
, andeveryone
. -
protocols
: an array that defines how recipients are contacted for a given level (base level + escalation). Allowed values areemail
,sms
, andphone
.
See the alarm page for more info about alarms.
If you want the best performance from the website, you'll need a document here with a few experiment-specific things.
{
"name": "doberview_config",
"subsystems": [
[
"gas_system",
"GS"
],
[
"inner_cryostat",
"IC"
],
...
],
"topics": [
"current",
"flow",
"pressure",
"status",
"temperature",
"voltage",
"weight"
]
}
What the fields mean:
-
name
:doberview_config
-
subsystems
: a list of[name, abbreviation]
entries for each distinct subsystem you want the website to support. We recommendsnake_case
names but nothing should break if youCamelCase
instead. -
topics
: a list of the specific kinds of values your system supports. Most are self-explanatory,status
is a catch-all for integer quantities.
This collection doesn't really need indexing as there are only very few entries.
Pipelines are sufficiently complicated they get their own page here.
Each computer where Doberman will run needs to have an entry in the hosts
collection. Entries should look like this:
{
"name": "apollo",
"plugin_dir": [
"/global/software/doberman_pancake"
]
}
name
is the hostname and plugin_dir
is a list of directories where the plugin codes can be found. You can add an index on the name
field but this collection gets very little traffic.
It's useful to push some log messages to the database, and the go into the log_messages
collection. An entry probably looks like this:
{
"msg": "Some logging message here",
"level": "WARNING",
"name": "who made this message",
"funcname": "which function made the log call",
"lineno": "the line number in the source where the log call was made",
"date": "the Date when the log call was made"
}
The field names should be self-explanatory. Index the date
field and whatever else you think will be useful.
The contacts
collection contains contact information that Doberman will use to distribute alarms. Entries look like this:
{
"name": <unique name>,
"sms": <mobile number>,
"email": <email address>,
"phone": <phone number>,
"first_name": <first_name>,
"last_name": <last_name>,
"on_shift": false,
"expert": true
}
The fields should mostly be self-explanatory. name
could be the first name and last initial, or anything that's unique. If too many people have too similar names, complain to their parents about their lack of creativity. on_shift
and expert
are boolean fields that determine whether a person belongs to one of these recipients groups (see alarm configuration).
The setup for Influx is much simpler. Follow the steps here and you'll probably have most things ready. Make sure to generate an access token. You will need two buckets, one "default" bucket for the bulk of the data, and another sysmon
bucket that's for the system monitors. We use an infinite retention policy for "normal" data and 30 days for system monitor data. Also, map both of these buckets and their retention policies to database names so you can query them using the InfluxDBv1 query style which is better, see here. Put the info about the "default" bucket into the influx
doc described above under experiment_config
.
Data are written to the bucket with the following schema. The measurement is the sensor's topic, tags are the name of the sensor, the name of the readout device, and the name of the subsystem. Fields are the value itself, and also the low and high alarm thresholds. A line-format example is shown here with ms precision on the timestamp
temperature,device=DeviceNameHere,sensor=T_LAB_01,subsystem=lab value=23.5,alarm_low=13,alarm_high=28 1643200124097
While Influx has a python driver, it's much easier to use the http interface. You can use curl
from the command line or the requests
library in python. Here's how to do it with curl:
$ curl "http://${host}:${port}/query?db=${db}&org=${org}" --header "Authorization: Token $INFLUX_TOKEN" --header "Accept: application/csv" --data-urlencode "q=${query}"
If your database or organization names have fancy characters in them then use additional --data-urlencode
arguments rather than direct url parameters. Querying from python:
import requests
import os
params = {
"db": <database name>,
"org": <organization name>,
"q": <query>
}
headers = {
"Authorization": f"Token {os.environ['INFLUX_TOKEN']}",
"Accept": "application/csv"
}
r = requests.get(f"http://{host}:{port}/query", params=params, headers=headers)
blob = r.content.decode()
for line in blob.split('\n'):
# the first line is headers
print(line)
This will get you results in csv format, you can remove the Accept
header to get json format instead, but I find csv easier to parse. Keep in mind that the timestamp you get back in csv format is in nanoseconds. Fill in the appropriate things for the host
, port
, db
, and org
fields. If your Influx access token is stored under a different environment variable then change that as well.
There are usually two kinds of queries that you'll do. One is "give me the most recent value", the other is "give me the trend of this value over some time range". These queries look like this:
SELECT last(value) FROM <topic> WHERE sensor='<sensor>';
Or
SELECT mean(value) FROM <topic> WHERE sensor='<sensor>' AND time > now()-5d GROUP BY time(10m) FILL(none);
Some commentary on this is in order. First, replace <topic>
and <sensor>
with the topic and sensor name. Second, the single quotes around <sensor
> are important. Don't use double quotes. Third, the 5d
and 10m
values represent how far into the past you want, and how fine a binning you want. Also, you can replace mean
with median
or max
, etc, as is necessary. See this page for more examples.