(Just a small heads-up: I don't blog as much as I used to, I nowadays update my Google+ page a lot more frequently. You might want to subscribe that if you are interested in more frequent technical updates on what we are working on.)
In the past weeks we have been working on a couple of features for systemd that enable a number of new usecases I'd like to shed some light on. Taking benefit of the /usr merge that a number of distributions have completed we want to bring runtime behaviour of Linux systems to the next level. With the /usr merge completed most static vendor-supplied OS data is found exclusively in /usr, only a few additional bits in /var and /etc are necessary to make a system boot. On this we can build to enable a couple of new features:
- A mechanism we call Factory Reset shall flush out /etc and /var, but keep the vendor-supplied /usr, bringing the system back into a well-defined, pristine vendor state with no local state or configuration. This functionality is useful across the board from servers, to desktops, to embedded devices.
- A Stateless System goes one step further: a system like this never stores /etc or /var on persistent storage, but always comes up with pristine vendor state. On systems like this every reboot acts as factor reset. This functionality is particularly useful for simple containers or systems that boot off the network or read-only media, and receive all configuration they need during runtime from vendor packages or protocols like DHCP or are capable of discovering their parameters automatically from the available hardware or periphery.
- Reproducible Systems multiply a vendor image into many containers or systems. Only local configuration or state is stored per-system, while the vendor operating system is pulled in from the same, immutable, shared snapshot. Each system hence has its private /etc and /var for receiving local configuration, however the OS tree in /usr is pulled in via bind mounts (in case of containers) or technologies like NFS (in case of physical systems), or btrfs snapshots from a golden master image. This is particular interesting for containers where the goal is to run thousands of container images from the same OS tree. However, it also has a number of other usecases, for example thin client systems, which can boot the same NFS share a number of times. Furthermore this mechanism is useful to implement very simple OS installers, that simply unserialize a /usr snapshot into a file system, install a boot loader, and reboot.
- Verifiable Systems are closely related to stateless systems: if the underlying storage technology can cryptographically ensure that the vendor-supplied OS is trusted and in a consistent state, then it must be made sure that /etc or /var are either included in the OS image, or simply unnecessary for booting.
Concepts
A number of Linux-based operating systems have tried to implement some of the schemes described out above in one way or another. Particularly interesting are GNOME's OSTree, CoreOS and Google's Android and ChromeOS. They generally found different solutions for the specific problems you have when implementing schemes like this, sometimes taking shortcuts that keep only the specific case in mind, and cannot cover the general purpose. With systemd now being at the core of so many distributions and deeply involved in bringing up and maintaining the system we came to the conclusion that we should attempt to add generic support for setups like this to systemd itself, to open this up for the general purpose distributions to build on. We decided to focus on three kinds of systems:
- The stateful system, the traditional system as we know it with machine-specific /etc, /usr and /var, all properly populated.
- Startup without a populated /var, but with configured /etc. (We will call these volatile systems.)
- Startup without either /etc or /var. (We will call these stateless systems.)
A factory reset is just a special case of the latter two modes, where the system boots up without /var and /etc but the next boot is a normal stateful boot like like the first described mode. Note that a mode where /etc is flushed, but /var is not is nothing we intend to cover (why? well, the user ID question becomes much harder, see below, and we simply saw no usecase for it worth the trouble).
Problems
Booting up a system without a populated /var is relatively straight-forward. With a few lines of tmpfiles configuration it is possible to populate /var with its basic structure in a way that is sufficient to make a system boot cleanly. systemd version 214 and newer ship with support for this. Of course, support for this scheme in systemd is only a small part of the solution. While a lot of software reconstructs the directory hierarchy it needs in /var automatically, many software does not. In case like this it is necessary to ship a couple of additional tmpfiles lines that setup up at boot-time the necessary files or directories in /var to make the software operate, similar to what RPM or DEB packages would set up at installation time.
Booting up a system without a populated /etc is a more difficult task. In /etc we have a lot of configuration bits that are essential for the system to operate, for example and most importantly system user and group information in /etc/passwd and /etc/group. If the system boots up without /etc there must be a way to replicate the minimal information necessary in it, so that the system manages to boot up fully.
To make this even more complex, in order to support "offline" updates of /usr that are replicated into a number of systems possessing private /etc and /var there needs to be a way how these directories can be upgraded transparently when necessary, for example by recreating caches like /etc/ld.so.cache or adding missing system users to /etc/passwd on next reboot.
Starting with systemd 215 (yet unreleased, as I type this) we will ship with a number of features in systemd that make /etc-less boots functional:
A new tool systemd-sysusers as been added. It introduces a new drop-in directory /usr/lib/sysusers.d/. Minimal descriptions of necessary system users and groups can be placed there. Whenever the tool is invoked it will create these users in /etc/passwd and /etc/group should they be missing. It is only suitable for creating system users and groups, not for normal users. It will write to the files directly via the appropriate glibc APIs, which is the right thing to do for system users. (For normal users no such APIs exist, as the users might be stored centrally on LDAP or suchlike, and they are out of focus for our usecase.) The major benefit of this tool is that system user definition can happen offline: a package simply has to drop in a new file to register a user. This makes system user registration declarative instead of imperative -- which is the way how system users are traditionally created from RPM or DEB installation scripts. By being declarative it is easy to replicate the users on next boot to a number of system instances.
To make this new tool interesting for packaging scripts we make it easy to alternatively invoke it during package installation time, thus being a good alternative to invocations of useradd -r and groupadd -r.
Some OS designs use a static, fixed user/group list stored in /usr as primary database for users/groups, which fixed UID/GID mappings. While this works for specific systems, this cannot cover the general purpose. As the UID/GID range for system users/groups is very small (only containing 998 users and groups on most systems), the best has to be made from this space and only UIDs/GIDs necessary on the specific system should be allocated. This means allocation has to be dynamic and adjust to what is necessary.
Also note that this tool has one very nice feature: in addition to fully dynamic, and fully static UID/GID assignment for the users to create, it supports reading UID/GID numbers off existing files in /usr, so that vendors can make use of setuid/setgid binaries owned by specific users.
- We also added a default user definition list which creates the most basic users the system and systemd need. Of course, very likely downstream distributions might need to alter this default list, add new entries and possibly map specific users to particular numeric UIDs.
- A new condition ConditionNeedsUpdate= has been added. With this mechanism it is possible to conditionalize execution of services depending on whether /usr is newer than /etc or /var. The idea is that various services that need to be added into the boot process on upgrades make use of this to not delay boot-ups on normal boots, but run as necessary should /usr have been update since the last boot. This is implemented based on the mtime timestamp of the /usr: if the OS has been updated the packaging software should touch the directory, thus informing all instances that an upgrade of /etc and /var might be necessary.
- We added a number of service files, that make use of the new ConditionNeedsUpdate= switch, and run a couple of services after each update. Among them are the aforementiond systemd-sysusers tool, as well as services that rebuild the udev hardware database, the journal catalog database and the library cache in /etc/ld.so.cache.
- If systemd detects an empty /etc at early boot it will now use the unit preset information to enable all services by default that the vendor or packager declared. It will then proceed booting.
- We added a new tmpfiles snippet that is able to reconstruct the most basic structure of /etc if it is missing.
- tmpfiles also gained the ability copy entire directory trees into place should they be missing. This is particularly useful for copying certain essential files or directories into /etc without which the system refuses to boot. Currently the most prominent candidates for this are /etc/pam.d and /etc/dbus-1. In the long run we hope that packages can be fixed so that they always work correctly without configuration in /etc. Depending on the software this means that they should come with compiled-in defaults that just work should their configuration file be missing, or that they should fall back to static vendor-supplied configuration in /usr that is used whenever /etc doesn't have any configuration. Both the PAM and the D-Bus case are probably candidates for the latter. Given that there are probably many cases like this we are working with a number of folks to introduce a new directory called /usr/share/etc (name is not settled yet) to major distributions, that always contain the full, original, vendor-supplied configuration of all packages. This is very useful here, so that there's an obvious place to copy the original configuration from, but it is also useful completely independently as this provides administrators with an easy place to diff their own configuration in /etc against to see what local changes are in place.
We added a new --tmpfs= switch to systemd-nspawn to make testing of systems with unpopulated /etc and /var easy. For example, to run a fully state-less container, use a command line like this:
# system-nspawn -D /srv/mycontainer --read-only --tmpfs=/var --tmpfs=/etc -b
This command line will invoke the container tree stored in /srv/mycontainer in a read-only way, but with a (writable) tmpfs mounted to /var and /etc. With a very recent git snapshot of systemd invoking a Fedora rawhide system should mostly work OK, modulo the D-Bus and PAM problems mentioned above. A later version of systemd-nspawn is likely to gain a high-level switch --mode={stateful|volatile|stateless} that sets combines this into simple switches reusing the vocabulary introduced earlier.
What's Next
Pulling this all together we are very close to making boots with empty /etc and /var on general purpose Linux operating systems a reality. Of course, while doing the groundwork in systemd gets us some distance, there's a lot of work left. Most importantly: the majority of Linux packages are simply incomptible with this scheme the way they are currently set up. They do not work without configuration in /etc or state directories in /var; they do not drop system user information in /usr/lib/sysusers.d. However, we believe it's our job to do the groundwork, and to start somewhere.
So what does this mean for the next steps? Of course, currently very little of this is available in any distribution (simply already because 215 isn't even released yet). However, this will hopefully change quickly. As soon as that is accomplished we can start working on making the other components of the OS work nicely in this scheme. If you are an upstream developer, please consider making your software work correctly if /etc and/or /var are not populated. This means:
- When you need a state directory in /var and it is missing, create it first. If you cannot do that, because you dropped priviliges or suchlike, please consider dropping in a tmpfiles snippet that creates the directory with the right permissions early at boot, should it be missing.
- When you need configuration files in /etc to work properly, consider changing your application to work nicely when these files are missing, and automatically fall back to either built-in defaults, or to static vendor-supplied configuration files shipped in /usr, so that administrators can override configuration in /etc but if they don't the default configuration counts.
- When you need a system user or group, consider dropping in a file into /usr/lib/sysusers.d describing the users. (Currently documentation on this is minimal, we will provide more docs on this shortly.)
If you are a packager, you can also help on making this all work:
- Ask upstream to implement what we describe above, possibly even preparing a patch for this.
- If upstream will not make these changes, then consider dropping in tmpfiles snippets that copy the bare minimum of configuration files to make your software work from somewhere in /usr into /etc.
- Consider moving from imperative useradd commands in packaging scripts, to declarative sysusers files. Ideally, this is shipped upstream too, but if that's not possible then simply adding this to packages should be good enough.
Of course, before moving to declarative system user definitions you should consult with your distribution whether their packaging policy even allows that. Currently, most distributions will not, so we have to work to get this changed first.
Anyway, so much about what we have been working on and where we want to take this.
Conclusion
Before we finish, let me stress again why we are doing all this:
- For end-user machines like desktops, tablets or mobile phones, we want a generic way to implement factory reset, which the user can make use of when the system is broken (saves you support costs), or when he wants to sell it and get rid of his private data, and renew that "fresh car smell".
- For embedded machines we want a generic way how to reset devices. We also want a way how every single boot can be identical to a factory reset, in a stateless system design.
- For all kinds of systems we want to centralize vendor data in /usr so that it can be strictly read-only, and fully cryptographically verified as one unit.
- We want to enable new kinds of OS installers that simply deserialize a vendor OS /usr snapshot into a new file system, install a boot loader and reboot, leaving all first-time configuration to the next boot.
- We want to enable new kinds of OS updaters that build on this, and manage a number of vendor OS /usr snapshots in verified states, and which can then update /etc and /var simply by rebooting into a newer version.
- We wanto to scale container setups naturally, by sharing a single golden master /usr tree with a large number of instances that simply maintain their own private /etc and /var for their private configuration and state, while still allowing clean updates of /usr.
- We want to make thin clients that share /usr across the network work by allowing stateless bootups. During all discussions on how /usr was to be organized this was fequently mentioned. A setup like this so far only worked in very specific cases, with this scheme we want to make this work in general case.
Of course, we have no illusions, just doing the groundwork for all of this in systemd doesn't make this all a real-life solution yet. Also, it's very unlikely that all of Fedora (or any other general purpose distribution) will support this scheme for all its packages soon, however, we are quite confident that the idea is convincing, that we need to start somewhere, and that getting the most core packages adapted to this shouldn't be out of reach.
Oh, and of course, the concepts behind this are really not new, we know that. However, what's new here is that we try to make them available in a general purpose OS core, instead of special purpose systems.
Anyway, let's get the ball rolling! Late's make stateless systems a reality!
And that's all I have for now. I am sure this leaves a lot of questions open. If you have any, join us on IRC on #systemd on freenode or comment on Google+.