-
Notifications
You must be signed in to change notification settings - Fork 1
Usage
Doberman is built around a single executable Monitor.py
that has multiple uses.
We'll go over their uses here.
The core principle behind Doberman is that you want some function called at some interval.
This function probably asks a device to read out one sensor, but there are also a large variety of other possiblities.
Doberman is written around this concept, see this page for a discussion around this.
All Monitors respond quickly to ctrl-c
and should shut down promptly.
Sometimes it can take a bit longer if it's in the middle of something that can't be interrupted, but we're still talking a few seconds.
We initially had everything running in systemd, but this proved to be difficult in some situations, mainly in handling some error states, and in starting things remotely.
After a wildly successful experiment involving screen
and an automated software controller in the XENONnT DAQ system, we changed the operational paradigm to have everything run in screen sessions.
This makes automated control much simpler.
This is discussed more in the Hypervisor section below.
All Monitors will periodically heartbeat with the database. The hypervisor periodically checks heartbeats from all running Monitors. If one isn't running that is supposed to be, it gets started. All monitors also listen for internal communications. The port numbers for this start at whatever value you specify for the hypervisor in the global dispatch, each device on each host gets its own port. You don't need to specify anything other than the hypervisor, they will happen automatically.
All communication routes through the dispatcher, which runs as part of the hypervisor.
The dispatcher receives all messages, holds onto them for as long as is necessary, and then sends them to their destination (assuming the recipient is online).
The message buffering is to support scheduling messages for some point in the future, usually this is a pipeline scheduling state changes but other uses also exist.
Sending commands is done via a call to the database API (Database.log_command
).
Each Monitor accepts different commands, so we'll cover them in their sections.
A simple bash script is provided in the scripts
subdirectory.
It acts as a convenient way to automatically start things inside screen sessions.
The main user of this script is the hypervisor, but it's also really useful for humans.
You'll have one of these for each device you want read out.
Start using the provided helper script ./start_process.sh -d <device>
or manually with ./Monitor.py --device <device>
.
The helper script will start the process in a screen.
The monitor will start up, connect to its device, and begin reading out all configured sensors.
Device monitors accept the following commands:
-
stop
: stop and shutdown -
set <quantity> <value>
: tell the device to set<quantity>
to<value>
, whatever those are.<quantity>
and<value>
are forwarded to the device driver for it to deal with. Please note that<value>
may not contain spaces but<quantity>
can, soset valve 3 open
will split intovalve 3
andopen
, respectively, butset heater 1 max power
will split intoheater 1 max
andpower
, which probably isn't what you wanted.
There are three kinds of pipelines: alarm pipelines that handle alarm states, control pipelines to make changes to what the system is doing, and convert pipelines that do some mathematical operation on measurements (or combinations of them) and put them into the database.
There are three kinds of pipeline monitors, one for each kind of pipeline.
The hypervisor will start each of these when it starts up, but you can do it manually via ./start_process.sh --pipeline pl_<flavor>
or ./Monitor.py --pipeline pl_<flavor>
, where <flavor>
is one of alarm control convert
.
There should only be one of each of these running at once.
Pipeline monitors take the following commands:
-
pipelinectl_start <name>
: start the specified pipeline -
pipelinectl_stop <name>
: stop the specified pipeline -
pipelinectl_restart <name>
: restart the specified pipeline -
pipelinectl_silent <name>
: silence the specified pipeline -
pipelinectl_active <name>
: activate the specified pipeline -
stop
: stop the monitor and shut down all owned pipelines.
Note that the pipelinectl
commands require that the specified pipeline is actually owned by the monitor handling the command (start
obviously excluded).
This is a Pipeline monitor that specializes in alarm pipelines.
When alarm states are detected, alarm messages are created and distributed via the specifid methods.
See the alarm page for more details on alarm distribution.
Start with ./start_process.sh --alarm
or ./Monitor.py --alarm
.
None
If something goes wrong and one of your readout machines crashes and reboots in the small hours of the morning (this is exceedingly rare but stay with me), do you want to get woken up by an alarm or only find out when you get to the lab after coffee, or do you want something to automatically restart everything?
This is the job of the hypervisor.
It makes sure that everything that's supposed to be running is running.
It does this via commands over ssh, so be sure to have your ssh permissions set up.
This is also why screen
is more convenient than systemd; slow control doesn't need sudo-level permissions but systemd does, and running commands as root via ssh without a password is a staggeringly massive security risk.
Screen doesn't have this limitation.
The hypervisor keeps a list of everything that's currently running (specifically, Monitors add and remove themselves from this list on startup and shutdown), and also a list of things that are supposed to be running (the "managed" things).
Things that are supposed to be running but aren't will get started.
Note that the hypervisor isn't a panacea and if the problem is the readout device itself then there's not much it can do, but it is still very useful.
The hypervisor also acts as the central dispatcher for interprocess communication, so while you don't need to give it devices to manage, it does still need to run.
Also, the hypervisor will compress all logfiles older than one week.
Start with ./start_process.sh --hypervisor
or ./Monitor.py --hypervisor
.
The obvious next question is 'quis custodiet ipsos custodes?'.
Pretty much everything the hypervisor does is wrapped in try-except
blocks so it's difficult for the hypervisor itself to crash.
The machine hosting it can crash, but the average server running linux should be able to put out continuous years of uptime, and if it randomly crashes on a semiregular basis then the underlying hardware is probably faulty and you should replace it.
It should go without saying that there should only be one hypervisor running at any one time.
The hypervisor accepts the following commands:
-
start <name>
: start whatever<name>
is. If it isn't a device it's assumed to be a pipeline. If it's neither then what are you doing? -
manage <name>
: add<name>
to the list of managed devices. -
unmanage <name>
: remove<name>
from the list of managed devices. -
kill <name>
: whatever<name>
is, take it out back and unceremoniously get rid of it. This forces an unclean shutdown. If<name>
isn't a device it's assumed to be the name of a screen running on localhost.
Note that stop
commands are issued directly to the Monitor in question, so the hypervisor doesn't need to get involved.
Here's how to actually bring the system online. This assumes you've already configured the databases appropriately. If you haven't, do that now. This also assumes all your databases restart automatically, and you don't need to do any manual networking nonsense.
Suppose your UPS runs out before the power comes back up. This is a dirty shutdown. You'll need to do one thing: start the hypervisor, if it didn't start automatically. This will return the system to whatever it was doing when lost power.