Database

The Databases

MongoDB

This database is used for storing configuration data (roughly 1 MB) and also logging messages and information about current alarms (will reach gigabytes after a few years of use)

General setup

You'll need one account with read-write permissions that Doberman will use. It's also a good idea to have an admin/root account for various maintenance-related tasks, but Doberman doesn't need to know about it. Export the connection info as an environment variable: export DOBERMAN_MONGO_URI="mongodb://${username}:${password}@${host}:${port}/admin" so Doberman knows how to connect. There's also a database setup script in scripts that will create some of the things for you, so have a look at that as well.

Subdivision

Doberman uses one database per experiment, with a number of collections containing various things. This is not alphabetically organized because some things are used more often, and I put those closer to the top.

Devices

Each readout device needs one entry in this collection. Each entry should look like this:

{
    "name" : <device name here>,
    "address" : {
            "ip" : "192.168.131.22",
            "port" : 5000
    },
    "params": {
        "key": "value"
    },
    "commands" : [
            {
                    "pattern" : "set setpoint [value]",
                    "example" : "set setpoint 182.1"
            }
    ],
    "host" : "calliope",
    "sensors" : [
            "T_LAB_01",
            "T_LAB_02",
            "W_LAB_01"
    ],
}

What the fields mean:

name: a unique name for this device. If you have multiple identical devices, append numbers. Doberman knows that both iseries1 and iseries2 use the iseries plugin.
address: how Doberman should connect to this device. Plugins inheriting from Doberman.LANDevice must specify ip and port fields, plugins inheriting from Doberman.SerialDevice must specify tty and baud fields. Plugins inheriting from other base classes don't need to specify anything.
params: optional, any additional parameters for the plugin. These are loaded as attributes, so the plugin code can reference self.key.
commands: optional, a list of dictionaries that contain info about the commands this device accepts. Not used by Doberman, but is really useful for displaying on the website.
host: the hostname of the computer where the Doberman instance runs that reads this device out.
sensors: a list of all the sensors this device has. These should all correspond to entries in the sensors collection. Doberman will add a few extra fields for internal use but these are the minimum that must be specified.

Index this collection on the name field.

Sensors

Each sensor needs an entry in the sensors collection. Each entry should look like this:

{
    "name" : "T_LAB_01",
    "description" : "Lab temperature, by door",
    "units" : "C",
    "status" : "online",
    "topic" : "temperature",
    "subsystem" : "lab",
    "readout_interval" : 5,
    "alarm_recurrence" : 3,
    "alarm_thresholds" : [13, 28],
    "alarm_level" : 0,
    "alarm_is_triggered": False,
    "pipelines" : [ ],
    "device" : <device name>,
    "value_xform": [0, 1],
    "readout_command": <command>
}

What the fields mean:

name: a unique name for this sensor. We found <quantity>_<subsystem>_<number> a scheme that scaled well.
description: a text description of what this sensor measures
units: measurement units
status: either online or offline. Should this sensor be read out if its owning device is online?
topic: the physical quantity being measured (temperature, pressure, etc). This determines where in Influx the values go.
subsystem: a larger grouping of sensors proved convenient for us. Things like lab or inner_cryostat or gas_system, etc.
readout_interval: how often in seconds this sensor should be read out
alarm_recurrence: how many subsequent values outside of the alarm thresholds must occur before an alarm state is entered.
alarm_thresholds: low and high thresholds that demarcate the "safe" or "acceptable" range of values.
alarm_level: the base alarm level.
alarm_is_triggered: boolean to show whether the sensor is in a state of alarm. Only updated when corresponding AlarmNode is running.
pipelines: a list of pipelines that require this sensor. Used by the website, not by Doberman.
device: the name of the readout device
value_xform: optional, you can have a polynomial transformation applied to the raw number a sensor returns before it gets sent downstream. This is useful for converting from ADC units into physical quantities. Values are given in little-endian form, and the result is calculated by sum(a_i*x**i for i, a_i in enumerate(value_xform)) so a value of [0, 1] means no change.
readout_command: what bit of text is passed to the corresponding Device to read this quantity out? Might be something like read:ch1 or some such, see the hardware manual.

Here is another example where the sensor produces integer values and uses a corresponding integer alarm:

{
    "name" : "S_LAB_01",
    "description" : "Oxygen level",
    "units" : "",
    "status" : "online",
    "topic" : "status",
    "subsystem" : "lab",
    "readout_interval" : 1,
    "alarm_recurrence" : 3,
    "alarm_level" : 2,
    "alarm_is_triggered": False,
    "pipelines" : [ ],
    "device" : <device name>,
    "value_xform": [0, 1],
    "readout_command": <command>,
    "is_int": 1,
    "alarm_values": {"1": "The oxygen level is low, flashing light is on", 
                     "2": "The oxygen level is low, siren is on"}
}

is_int: Add this field to a sensor when it produces integer values. Note that you should only set is_int when it is an integer sensor (i.e. don't do something like is_int: False in your float sensors). This will be used to create the right type of alarm on the website and to decide in which format the values are stored in InfluxDB. Note that changing this variable after the first entries are written to Influx is not possible without removing the old values from InfluxDB so make sure you correctly set this before starting you sensor.
alarm_values: a dictionary containing the integers that shall trigger an alarm (doesn't matter if they are ints or strings as in the example above) and the corresponding alarm messages.

Some sensor will not read a single quantity but a whole list of quantities (e.g. a levelmeter box with six inputs returns an array of those six values given a single readout_command). This is realised through the MultiSensor class. To configure a MultiSensor in the Database, you need to add the following fields:

{
    "name" : "L_XS_01",
    "description" : "N2 levelmeter 1",
    ...,
    "multi_sensor": ["L_XS_01", "L_XS_02", "L_XS_03", "L_XS_04", "L_XS_05", "L_XS_06"],

}
{
    "name" : "L_XS_02",
    "description" : "N2 levelmeter 2",
    ...,
    "multi_sensor": "L_XS_01",

}

multi_sensor: A list for the primary sensor containing all sub-sensors. The name of the primary sensor for all other. Note that readout_interval, readout_command, and status of all sub-sensors except the primary one will be ignored.

See the alarm page for more info about alarms and how they work. Index this collection on the name field. These values are regularly refreshed by things in Doberman that use them, so expect fairly prompt responses to changes.

Experiment configuration

There are a variety of things where you just need to store a few pieces of system-wide information, and the experiment_config collection is designed for this. There isn't any fixed schema, but you'll need three documents that look like this:

Influx settings:

{
    "name" : "influx",
    "url" : "http://192.168.131.2:8096",
    "bucket" : "data_bucket",
    "org" : "pancake",
    "precision" : "ms",
    "token" : <token>,
    "db" : "slowdata"
}

What the fields mean:

name: influx
url: the URL where things on your subnet can access the InfluxDB instance.
bucket: the name of the bucket where you want the sensor data stored
org: the name of the organization you set up in Influx
precision: the precision you want timestamps to use.
token: the access token you generated for Influx
db: the name of the mapped database to access the specified bucket.

See below for more info on InfluxDB setup

Hypervisor settings

The hypervisor is the Monitor responsible for making sure the system is running and communicating with itself properly. Its entry should look like this:

{
    "name" : "hypervisor",
    "period" : 60,
    "processes" : {
            "managed" : [],
            "active" : []
    },
    "restart_timeout" : 300,
    "status" : "offline",
    "path" : "/global/software/doberman/scripts",
    "remote_heartbeat" : [
            {"address" : "user@host",
             "port" : 1234,
             "directory": "/global"
            },
            ]
    "host" : "apollo",
    "username": "doberman",
    "startup_sequence": {apollo: [],
                         calliope: [ '[ -e /global/software ] || mount /global' ],
                         ...},
    "comms": {
        "data": {"send": 8904, "recv": 8905},
        "command": {"send": 8906, "recv": 8907}
    }
}

What the fields mean:

name: hypervisor
period: how often (in seconds) the main logic loop runs. This number determines how often everything in the entire system checks in with the database to say that it's still alive.
processes: a dictionary containing managed and active lists. Devices add and remove themselves from the active list as they start and stop. The hypervisor will do everything it can to make sure devices in the managed list are running.
restart_timeout: how often (in seconds) the hypervisor can restart crashing or otherwise non-responsive device readouts.
status: either online or offline
path: the directory where the start_process.sh script lives on your global network drive.
remote_heartbeat: who watches the watchmen? If your lab has a power cut there's a good chance that networking infrastructure goes down with it, and without this any alarms don't reach you. The hypervisor will send remote heartbeats to all of the machines in this list. This is done by putting a timestamp and the phone numbers of the current shifters into a file called remote_hb_<experiment_name> in the given directory (defaults to /scratch if not defined) on the remote machined accessed via ssh address:port (port defaults to 22 if not defined). See REF on how to check remote heartbeats.
host: the hostname of the machine where the hypervisor runs.
username: linux username on this and all other machines (see user management)
startup_sequence: These things will be executed locally or via ssh on the given devices when you start the hypervisor (e.g. make sure a directory is properly mounted)
comms: You'll need to define four ports here that will be used for all communication between monitors. You can pretty much use anything you want here, just make sure these ports aren't use by someone else.

Alarm configuration

Global settings about alarms, like connection details and stuff.

{
    "name" : "alarm",
    "silence_duration" : [
            3600,
            900,
            300
    ],
    "silence_duration_cant_send" : 60,
    "escalation_config" : [
            24,
            8,
            2
    ],
    "max_reading_delay" : 30,
    "connection_details" : {
            "email" : {
                    "server" : "smtp.gmail.com",
                    "port" : 587,
                    "fromaddr" : <email>,
                    "password" : <password>
            },
            "sms" : {
                    "server" : "sms.smscreator.de",
                    "identification" : <identification>,
                    "contactaddr" : <email>
            },
            "twilio" : {
                    "url" : <url>,
                    "auth" : [
                            <auth code 1>,
                            <auth code 2>
                    ],
                    "fromnumber" : <mobile number>,
                    "maxmessagelength" : 1500
            }

    },
    "recipients": [
        [
            "shifters"
        ],
        [
            "shifters"
        ],
        [
            "shifters",
            "experts"
        ],
        [
            "everyone"
        ]
    ],
    "protocols": [
        [
            "email"
        ],
        [
            "email",
            "sms"
        ],
        [
            "email",
            "sms",
            "phone"
        ],
        [
            "email",
            "sms",
            "phone"
        ],
}

What the fields mean:

name: alarm
silence_duration: pipelines that generate alarms will automatically silence themselves (suppressing further alarms) for this many seconds per level, so you don't get spammed
`silence_duration_cant_send: If there is an exception during sending the alarm the corresponding pipeline while be silenced for this many seconds before it tries again.
escalation_config: how long an alarm stays at a given level (in number of messages) before it's escalated.
max_reading_delay: For the DeviceResponding nodes: trigger alarm when the reading is more delayed than this number of seconds.
connection_details: a dict with all the info the AlarmMonitor needs to send emails, SMS messages, and phone calls.
recipients: an array that defines who gets messages in case of an alarm for a given level (base level + escalation). Allowed values are shifters, experts, and everyone.
protocols: an array that defines how recipients are contacted for a given level (base level + escalation). Allowed values are email, sms, and phone.

See the alarm page for more info about alarms.

Doberview config

If you want the best performance from the website, you'll need a document here with a few experiment-specific things.

{
    "name": "doberview_config",
    "subsystems": [
          [
              "gas_system",
              "GS"
          ],
          [
              "inner_cryostat",
              "IC"
          ],
          ...
    ],
    "topics": [
          "current",
          "flow",
          "pressure",
          "status",
          "temperature",
          "voltage",
          "weight"
    ]
}

What the fields mean:

name: doberview_config
subsystems: a list of [name, abbreviation] entries for each distinct subsystem you want the website to support. We recommend snake_case names but nothing should break if you CamelCase instead.
topics: a list of the specific kinds of values your system supports. Most are self-explanatory, status is a catch-all for integer quantities.

This collection doesn't really need indexing as there are only very few entries.

Pipelines

Pipelines are sufficiently complicated they get their own page here.

Hosts

Each computer where Doberman will run needs to have an entry in the hosts collection. Entries should look like this:

{
    "name": "apollo",
    "plugin_dir": [
        "/global/software/doberman_pancake"
    ]
}

name is the hostname and plugin_dir is a list of directories where the plugin codes can be found. You can add an index on the name field but this collection gets very little traffic.

Logs

It's useful to push some log messages to the database, and the go into the log_messages collection. An entry probably looks like this:

{
    "msg": "Some logging message here",
    "level": "WARNING",
    "name": "who made this message",
    "funcname": "which function made the log call",
    "lineno": "the line number in the source where the log call was made",
    "date": "the Date when the log call was made"
}

The field names should be self-explanatory. Index the date field and whatever else you think will be useful.

Contacts

The contacts collection contains contact information that Doberman will use to distribute alarms. Entries look like this:

{
    "name": <unique name>,
    "sms": <mobile number>,
    "email": <email address>,
    "phone": <phone number>,
    "first_name": <first_name>,
    "last_name": <last_name>,
    "on_shift": false,
    "expert": true
}

The fields should mostly be self-explanatory. name could be the first name and last initial, or anything that's unique. If too many people have too similar names, complain to their parents about their lack of creativity. on_shift and expert are boolean fields that determine whether a person belongs to one of these recipients groups (see alarm configuration).

InfluxDB

The setup for Influx is much simpler. Follow the steps here and you'll probably have most things ready. Make sure to generate an access token. You will need two buckets, one "default" bucket for the bulk of the data, and another sysmon bucket that's for the system monitors. We use an infinite retention policy for "normal" data and 30 days for system monitor data. Also, map both of these buckets and their retention policies to database names so you can query them using the InfluxDBv1 query style which is better, see here. Put the info about the "default" bucket into the influx doc described above under experiment_config.

Data are written to the bucket with the following schema. The measurement is the sensor's topic, tags are the name of the sensor, the name of the readout device, and the name of the subsystem. Fields are the value itself, and also the low and high alarm thresholds. A line-format example is shown here with ms precision on the timestamp

temperature,device=DeviceNameHere,sensor=T_LAB_01,subsystem=lab value=23.5,alarm_low=13,alarm_high=28 1643200124097

How to query data

While Influx has a python driver, it's much easier to use the http interface. You can use curl from the command line or the requests library in python. Here's how to do it with curl:

$ curl "http://${host}:${port}/query?db=${db}&org=${org}" --header "Authorization: Token $INFLUX_TOKEN" --header "Accept: application/csv" --data-urlencode "q=${query}"

If your database or organization names have fancy characters in them then use additional --data-urlencode arguments rather than direct url parameters. Querying from python:

import requests
import os

params = {
    "db": <database name>,
    "org": <organization name>,
    "q": <query>
}
headers = {
    "Authorization": f"Token {os.environ['INFLUX_TOKEN']}",
    "Accept": "application/csv"
}

r = requests.get(f"http://{host}:{port}/query", params=params, headers=headers)
blob = r.content.decode()

for line in blob.split('\n'):
    # the first line is headers
    print(line)

This will get you results in csv format, you can remove the Accept header to get json format instead, but I find csv easier to parse. Keep in mind that the timestamp you get back in csv format is in nanoseconds. Fill in the appropriate things for the host, port, db, and org fields. If your Influx access token is stored under a different environment variable then change that as well.

The query

There are usually two kinds of queries that you'll do. One is "give me the most recent value", the other is "give me the trend of this value over some time range". These queries look like this:

SELECT last(value) FROM <topic> WHERE sensor='<sensor>';

Or

SELECT mean(value) FROM <topic> WHERE sensor='<sensor>' AND time > now()-5d GROUP BY time(10m) FILL(none);

Some commentary on this is in order. First, replace <topic> and <sensor> with the topic and sensor name. Second, the single quotes around <sensor> are important. Don't use double quotes. Third, the 5d and 10m values represent how far into the past you want, and how fine a binning you want. Also, you can replace mean with median or max, etc, as is necessary. See this page for more examples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Database

The Databases

MongoDB

General setup

Subdivision

Devices

Sensors

Experiment configuration

Influx settings:

Hypervisor settings

Alarm configuration

Doberview config

Pipelines

Hosts

Logs

Contacts

InfluxDB

How to query data

The query

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally