No, U dev

This week in Liminix: also, last week and for several weeks preceding, it has been all about the “device database”.

To recap, we wish to run certain services only under particular conditions. The particular use case I have here is my backup server which is a GL.iNet travel router that runs rysncd, with a USB disk plugged into it. (There’s historical resonance here: a lightweight backup server was the original reason I started writing NixWRT). The system shouldn’t try to mount the external drive unless it’s plugged in - if it starts the mount service at boot, the service startup will hang, and that means the machine can’t be cleanly rebooted - or any other change made to the running services - until the disk is attached. Ugly.

The disk is present when there’s a node in sysfs with a uevent file containing the attributes DEVTYPE=partition and PARTNAME=backup-disk.

# cat /sys/class/block/sda1/uevent
MAJOR=8
MINOR=1
DEVNAME=sda1
DEVTYPE=partition
DISKSEQ=7
PARTN=1
PARTNAME=backup-disk

but we don’t know where under /sys the file is: the kernel allocates sdX devices as it sees them, so it might depend on how many other storage devices are plugged in.

The naive solution (don’t do this) would be to recursively walk the whole of /sys every few seconds. Thankfully we don’t have to, because the kernel sends “uevent” netlink messages whenever anyhing changes. So we built “devout”, a service that maintains a model of all the hardware by listening to these messages and updating a database (using the term in its loosest sense) of the state. Then we can have a client (or many clients) connect to the database service and say “send me all the events matching some critiera I am interested in”. Devout will send it the messages for all matching devices it knows about at connection time, then relay further relevant netlink messages to it until it disconnects again.

The client then starts its controlled service when it gets “add” or “change” events and stops it again on “remove”.

“Why go to all this trouble when udev already exists?” It’s a fair question, and I keep asking it myself as well.

The short version is that udev rules afford a level of generality which is (so far, for our purposes) unnecessary, and easy to get wrong, and hard (for me, at least) to reason about, because it allows arbitrary commands on each event and it doesn’t have symmetry - it’s hard to be confident that all the changes to the system which were introduced by some “add” rule are undone correctly when the device is removed. The udev rule language has jumps and conditions which means it’s not simple to know what a rule will do when considered in isolation.

Another consideration is that I am hoping this general pattern - a trigger service wich subscribes to events from another service - will be applicable for other event sources - e.g. SNMP, or rtnetlink messages, or collectd, or (fill in your appplication here).

Levitate me

Where we are right now, though, is that I have reinstalled my backup server, and this time I have enabled it for levitate: the mechanism for rebooting into a maintenance system that I implemented last year. And promptly discovered that for it to to be any actual use, it needed some rejiggering: the tl;dr is that levitate now expects to be passed a whole config fragment not just a list of services. So maybe now you can use levitate (and maybe I should write some documentation for it)

Failover is the mother of success

Next thing on the list is a mechanism for failover to a secondary WAN connection when the primary link goes down.