Fitting Everything Together

TLDR: Hermetic /usr/ is awesome; let's popularize image-based OSes with modernized security properties built around immutability, SecureBoot, TPM2, adaptability, auto-updating, factory reset, uniformity – built from traditional distribution packages, but deployed via images.

Over the past years, systemd gained a number of components for building Linux-based operating systems. While these components individually have been adopted by many distributions and products for specific purposes, we did not publicly communicate a broader vision of how they should all fit together in the long run. In this blog story I hope to provide that from my personal perspective, i.e. explain how I personally would build an OS and where I personally think OS development with Linux should go.

I figure this is going to be a longer blog story, but I hope it will be equally enlightening. Please understand though that everything I write about OS design here is my personal opinion, and not one of my employer.

For the last 12 years or so I have been working on Linux OS development, mostly around systemd. In all those years I had a lot of time thinking about the Linux platform, and specifically traditional Linux distributions and their strengths and weaknesses. I have seen many attempts to reinvent Linux distributions in one way or another, to varying success. After all this most would probably agree that the traditional RPM or dpkg/apt-based distributions still define the Linux platform more than others (for 25+ years now), even though some Linux-based OSes (Android, ChromeOS) probably outnumber the installations overall.

And over all those 12 years I kept wondering, how would I actually build an OS for a system or for an appliance, and what are the components necessary to achieve that. And most importantly, how can we make these components generic enough so that they are useful in generic/traditional distributions too, and in other use cases than my own.

The Project

Before figuring out how I would build an OS it's probably good to figure out what type of OS I actually want to build, what purpose I intend to cover. I think a desktop OS is probably the most interesting. Why is that? Well, first of all, I use one of these for my job every single day, so I care immediately, it's my primary tool of work. But more importantly: I think building a desktop OS is one of the most complex overall OS projects you can work on, simply because desktops are so much more versatile and variable than servers or embedded devices. If one figures out the desktop case, I think there's a lot more to learn from, and reuse in the server or embedded case, then going the other way. After all, there's a reason why so much of the widely accepted Linux userspace stack comes from people with a desktop background (including systemd, BTW).

So, let's see how I would build a desktop OS. If you press me hard, and ask me why I would do that given that ChromeOS already exists and more or less is a Linux desktop OS: there's plenty I am missing in ChromeOS, but most importantly, I am lot more interested in building something people can easily and naturally rebuild and hack on, i.e. Google-style over-the-wall open source with its skewed power dynamic is not particularly attractive to me. I much prefer building this within the framework of a proper open source community, out in the open, and basing all this strongly on the status quo ante, i.e. the existing distributions. I think it is crucial to provide a clear avenue to build a modern OS based on the existing distribution model, if there shall ever be a chance to make this interesting for a larger audience.

(Let me underline though: even though I am going to focus on a desktop here, most of this is directly relevant for servers as well, in particular container host OSes and suchlike, or embedded devices, e.g. car IVI systems and so on.)

Design Goals

First and foremost, I think the focus must be on an image-based design rather than a package-based one. For robustness and security it is essential to operate with reproducible, immutable images that describe the OS or large parts of it in full, rather than operating always with fine-grained RPM/dpkg style packages. That's not to say that packages are not relevant (I actually think they matter a lot!), but I think they should be less of a tool for deploying code but more one of building the objects to deploy. A different way to see this: any OS built like this must be easy to replicate in a large number of instances, with minimal variability. Regardless if we talk about desktops, servers or embedded devices: focus for my OS should be on "cattle", not "pets", i.e that from the start it's trivial to reuse the well-tested, cryptographically signed combination of software over a large set of devices the same way, with a maximum of bit-exact reuse and a minimum of local variances.
The trust chain matters, from the boot loader all the way to the apps. This means all code that is run must be cryptographically validated before it is run. All storage must be cryptographically protected: public data must be integrity checked; private data must remain confidential.

This is in fact where big distributions currently fail pretty badly. I would go as far as saying that SecureBoot on Linux distributions is mostly security theater at this point, if you so will. That's because the initrd that unlocks your FDE (i.e. the cryptographic concept that protects the rest of your system) is not signed or protected in any way. It's trivial to modify for an attacker with access to your hard disk in an undetectable way, and collect your FDE passphrase. The involved bureaucracy around the implementation of UEFI SecureBoot of the big distributions is to a large degree pointless if you ask me, given that once the kernel is assumed to be in a good state, as the next step the system invokes completely unsafe code with full privileges.

This is a fault of current Linux distributions though, not of SecureBoot in general. Other OSes use this functionality in more useful ways, and we should correct that too.
Pretty much the same thing: offline security matters. I want my data to be reasonably safe at rest, i.e. cryptographically inaccessible even when I leave my laptop in my hotel room, suspended.
Everything should be cryptographically measured, so that remote attestation is supported for as much software shipped on the OS as possible.
Everything should be self descriptive, have single sources of truths that are closely attached to the object itself, instead of stored externally.
Everything should be self-updating. Today we know that software is never bug-free, and thus requires a continuous update cycle. Not only the OS itself, but also any extensions, services and apps running on it.
Everything should be robust in respect to aborted OS operations, power loss and so on. It should be robust towards hosed OS updates (regardless if the download process failed, or the image was buggy), and not require user interaction to recover from them.
There must always be a way to put the system back into a well-defined, guaranteed safe state ("factory reset"). This includes that all sensitive data from earlier uses becomes cryptographically inaccessible.
The OS should enforce clear separation between vendor resources, system resources and user resources: conceptually and when it comes to cryptographical protection.
Things should be adaptive: the system should come up and make the best of the system it runs on, adapt to the storage and hardware. Moreover, the system should support execution on bare metal equally well as execution in a VM environment and in a container environment (i.e. systemd-nspawn).
Things should not require explicit installation. i.e. every image should be a live image. For installation it should be sufficient to dd an OS image onto disk. Thus, strong focus on "instantiate on first boot", rather than "instantiate before first boot".
Things should be reasonably minimal. The image the system starts its life with should be quick to download, and not include resources that can as well be created locally later.
System identity, local cryptographic keys and so on should be generated locally, not be pre-provisioned, so that there's no leak of sensitive data during the transport onto the system possible.
Things should be reasonably democratic and hackable. It should be easy to fork an OS, to modify an OS and still get reasonable cryptographic protection. Modifying your OS should not necessarily imply that your "warranty is voided" and you lose all good properties of the OS, if you so will.
Things should be reasonably modular. The privileged part of the core OS must be extensible, including on the individual system. It's not sufficient to support extensibility just through high-level UI applications.
Things should be reasonably uniform, i.e. ideally the same formats and cryptographic properties are used for all components of the system, regardless if for the host OS itself or the payloads it receives and runs.
Even taking all these goals into consideration, it should still be close to traditional Linux distributions, and take advantage of what they are really good at: integration and security update cycles.

Now that we know our goals and requirements, let's start designing the OS along these lines.

Hermetic `/usr/`

First of all the OS resources (code, data files, …) should be hermetic in an immutable /usr/. This means that a /usr/ tree should carry everything needed to set up the minimal set of directories and files outside of /usr/ to make the system work. This /usr/ tree can then be mounted read-only into the writable root file system that then will eventually carry the local configuration, state and user data in /etc/, /var/ and /home/ as usual.

Thankfully, modern distributions are surprisingly close to working without issues in such a hermetic context. Specifically, Fedora works mostly just fine: it has adopted the /usr/ merge and the declarative systemd-sysusers and systemd-tmpfiles components quite comprehensively, which means the directory trees outside of /usr/ are automatically generated as needed if missing. In particular /etc/passwd and /etc/group (and related files) are appropriately populated, should they be missing entries.

In my model a hermetic OS is hence comprehensively defined within /usr/: combine the /usr/ tree with an empty, otherwise unpopulated root file system, and it will boot up successfully, automatically adding the strictly necessary files, and resources that are necessary to boot up.

Monopolizing vendor OS resources and definitions in an immutable /usr/ opens multiple doors to us:

We can apply dm-verity to the whole /usr/ tree, i.e. guarantee structural, cryptographic integrity on the whole vendor OS resources at once, with full file system metadata.
We can implement updates to the OS easily: by implementing an A/B update scheme on the /usr/ tree we can update the OS resources atomically and robustly, while leaving the rest of the OS environment untouched.
We can implement factory reset easily: erase the root file system and reboot. The hermetic OS in /usr/ has all the information it needs to set up the root file system afresh — exactly like in a new installation.

Initial Look at the Partition Table

So let's have a look at a suitable partition table, taking a hermetic /usr/ into account. Let's conceptually start with a table of four entries:

An UEFI System Partition (required by firmware to boot)
Immutable, Verity-protected, signed file system with the /usr/ tree in version A
Immutable, Verity-protected, signed file system with the /usr/ tree in version B
A writable, encrypted root file system

(This is just for initial illustration here, as we'll see later it's going to be a bit more complex in the end.)

The Discoverable Partitions Specification provides suitable partition types UUIDs for all of the above partitions. Which is great, because it makes the image self-descriptive: simply by looking at the image's GPT table we know what to mount where. This means we do not need a manual /etc/fstab, and a multitude of tools such as systemd-nspawn and similar can operate directly on the disk image and boot it up.

Booting

Now that we have a rough idea how to organize the partition table, let's look a bit at how to boot into that. Specifically, in my model "unified kernels" are the way to go, specifically those implementing Boot Loader Specification Type #2. These are basically kernel images that have an initial RAM disk attached to them, as well as a kernel command line, a boot splash image and possibly more, all wrapped into a single UEFI PE binary. By combining these into one we achieve two goals: they become extremely easy to update (i.e. drop in one file, and you update kernel+initrd) and more importantly, you can sign them as one for the purpose of UEFI SecureBoot.

In my model, each version of such a kernel would be associated with exactly one version of the /usr/ tree: both are always updated at the same time. An update then becomes relatively simple: drop in one new /usr/ file system plus one kernel, and the update is complete.

The boot loader used for all this would be systemd-boot, of course. It's a very simple loader, and implements the aforementioned boot loader specification. This means it requires no explicit configuration or anything: it's entirely sufficient to drop in one such unified kernel file, and it will be picked up, and be made a candidate to boot into.

You might wonder how to configure the root file system to boot from with such a unified kernel that contains the kernel command line and is signed as a whole and thus immutable. The idea here is to use the usrhash= kernel command line option implemented by systemd-veritysetup-generator and systemd-fstab-generator. It does two things: it will search and set up a dm-verity volume for the /usr/ file system, and then mount it. It takes the root hash value of the dm-verity Merkle tree as the parameter. This hash is then also used to find the /usr/ partition in the GPT partition table, under the assumption that the partition UUIDs are derived from it, as per the suggestions in the discoverable partitions specification (see above).

systemd-boot (if not told otherwise) will do a version sort of the kernel image files it finds, and then automatically boot the newest one. Picking a specific kernel to boot will also fixate which version of the /usr/ tree to boot into, because — as mentioned — the Verity root hash of it is built into the kernel command line the unified kernel image contains.

In my model I'd place the kernels directly into the UEFI System Partition (ESP), in order to simplify things. (systemd-boot also supports reading them from a separate boot partition, but let's not complicate things needlessly, at least for now.)

So, with all this, we now already have a boot chain that goes something like this: once the boot loader is run, it will pick the newest kernel, which includes the initial RAM disk and a secure reference to the /usr/ file system to use. This is already great. But a /usr/ alone won't make us happy, we also need a root file system. In my model, that file system would be writable, and the /etc/ and /var/ hierarchies would be located directly on it. Since these trees potentially contain secrets (SSH keys, …) the root file system needs to be encrypted. We'll use LUKS2 for this, of course. In my model, I'd bind this to the TPM2 chip (for compatibility with systems lacking one, we can find a suitable fallback, which then provides weaker guarantees, see below). A TPM2 is a security chip available in most modern PCs. Among other things it contains a persistent secret key that can be used to encrypt data, in a way that only if you possess access to it and can prove you are using validated software you can decrypt it again. The cryptographic measuring I mentioned earlier is what allows this to work. But … let's not get lost too much in the details of TPM2 devices, that'd be material for a novel, and this blog story is going to be way too long already.

What does using a TPM2 bound key for unlocking the root file system get us? We can encrypt the root file system with it, and you can only read or make changes to the root file system if you also possess the TPM2 chip and run our validated version of the OS. This protects us against an evil maid scenario to some level: an attacker cannot just copy the hard disk of your laptop while you leave it in your hotel room, because unless the attacker also steals the TPM2 device it cannot be decrypted. The attacker can also not just modify the root file system, because such changes would be detected on next boot because they aren't done with the right cryptographic key.

So, now we have a system that already can boot up somewhat completely, and run userspace services. All code that is run is verified in some way: the /usr/ file system is Verity protected, and the root hash of it is included in the kernel that is signed via UEFI SecureBoot. And the root file system is locked to the TPM2 where the secret key is only accessible if our signed OS + /usr/ tree is used.

(One brief intermission here: so far all the components I am referencing here exist already, and have been shipped in systemd and other projects already, including the TPM2 based disk encryption. There's one thing missing here however at the moment that still needs to be developed (happy to take PRs!): right now TPM2 based LUKS2 unlocking is bound to PCR hash values. This is hard to work with when implementing updates — what we'd need instead is unlocking by signatures of PCR hashes. TPM2 supports this, but we don't support it yet in our systemd-cryptsetup + systemd-cryptenroll stack.)

One of the goals mentioned above is that cryptographic key material should always be generated locally on first boot, rather than pre-provisioned. This of course has implications for the encryption key of the root file system: if we want to boot into this system we need the root file system to exist, and thus a key already generated that it is encrypted with. But where precisely would we generate it if we have no installer which could generate while installing (as it is done in traditional Linux distribution installers). My proposed solution here is to use systemd-repart, which is a declarative, purely additive repartitioner. It can run from the initrd to create and format partitions on boot, before transitioning into the root file system. It can also format the partitions it creates and encrypt them, automatically enrolling an TPM2-bound key.

So, let's revisit the partition table we mentioned earlier. Here's what in my model we'd actually ship in the initial image:

An UEFI System Partition (ESP)
An immutable, Verity-protected, signed file system with the /usr/ tree in version A

And that's already it. No root file system, no B /usr/ partition, nothing else. Only two partitions are shipped: the ESP with the systemd-boot loader and one unified kernel image, and the A version of the /usr/ partition. Then, on first boot systemd-repart will notice that the root file system doesn't exist yet, and will create it, encrypt it, and format it, and enroll the key into the TPM2. It will also create the second /usr/ partition (B) that we'll need for later A/B updates (which will be created empty for now, until the first update operation actually takes place, see below). Once done the initrd will combine the fresh root file system with the shipped /usr/ tree, and transition into it. Because the OS is hermetic in /usr/ and contains all the systemd-tmpfiles and systemd-sysuser information it can then set up the root file system properly and create any directories and symlinks (and maybe a few files) necessary to operate.

Besides the fact that the root file system's encryption keys are generated on the system we boot from and never leave it, it is also pretty nice that the root file system will be sized dynamically, taking into account the physical size of the backing storage. This is perfect, because on first boot the image will automatically adapt to what it has been dd'ed onto.

Factory Reset

This is a good point to talk about the factory reset logic, i.e. the mechanism to place the system back into a known good state. This is important for two reasons: in our laptop use case, once you want to pass the laptop to someone else, you want to ensure your data is fully and comprehensively erased. Moreover, if you have reason to believe your device was hacked you want to revert the device to a known good state, i.e. ensure that exploits cannot persist. systemd-repart already has a mechanism for it. In the declarations of the partitions the system should have, entries may be marked to be candidates for erasing on factory reset. The actual factory reset is then requested by one of two means: by specifying a specific kernel command line option (which is not too interesting here, given we lock that down via UEFI SecureBoot; but then again, one could also add a second kernel to the ESP that is identical to the first, with only different that it lists this command line option: thus when the user selects this entry it will initiate a factory reset) — and via an EFI variable that can be set and is honoured on the immediately following boot. So here's how a factory reset would then go down: once the factory reset is requested it's enough to reboot. On the subsequent boot systemd-repart runs from the initrd, where it will honour the request and erase the partitions marked for erasing. Once that is complete the system is back in the state we shipped the system in: only the ESP and the /usr/ file system will exist, but the root file system is gone. And from here we can continue as on the original first boot: create a new root file system (and any other partitions), and encrypt/set it up afresh.

So now we have a nice setup, where everything is either signed or encrypted securely. The system can adapt to the system it is booted on automatically on first boot, and can easily be brought back into a well defined state identical to the way it was shipped in.

Modularity

But of course, such a monolithic, immutable system is only useful for very specific purposes. If /usr/ can't be written to, – at least in the traditional sense – one cannot just go and install a new software package that one needs. So here two goals are superficially conflicting: on one hand one wants modularity, i.e. the ability to add components to the system, and on the other immutability, i.e. that precisely this is prohibited.

So let's see what I propose as a middle ground in my model. First, what's the precise use case for such modularity? I see a couple of different ones:

For some cases it is necessary to extend the system itself at the lowest level, so that the components added in extend (or maybe even replace) the resources shipped in the base OS image, so that they live in the same namespace, and are subject to the same security restrictions and privileges. Exposure to the details of the base OS and its interface for this kind of modularity is at the maximum.

Example: a module that adds a debugger or tracing tools into the system. Or maybe an optional hardware driver module.
In other cases, more isolation is preferable: instead of extending the system resources directly, additional services shall be added in that bring their own files, can live in their own namespace (but with "windows" into the host namespaces), however still are system components, and provide services to other programs, whether local or remote. Exposure to the details of the base OS for this kind of modularity is restricted: it mostly focuses on the ability to consume and provide IPC APIs from/to the system. Components of this type can still be highly privileged, but the level of integration is substantially smaller than for the type explained above.

Example: a module that adds a specific VPN connection service to the OS.
Finally, there's the actual payload of the OS. This stuff is relatively isolated from the OS and definitely from each other. It mostly consumes OS APIs, and generally doesn't provide OS APIs. This kind of stuff runs with minimal privileges, and in its own namespace of concepts.

Example: a desktop app, for reading your emails.

Of course, the lines between these three types of modules are blurry, but I think distinguishing them does make sense, as I think different mechanisms are appropriate for each. So here's what I'd propose in my model to use for this.

For the system extension case I think the systemd-sysext images are appropriate. This tool operates on system extension images that are very similar to the host's disk image: they also contain a /usr/ partition, protected by Verity. However, they just include additions to the host image: binaries that extend the host. When such a system extension image is activated, it is merged via an immutable overlayfs mount into the host's /usr/ tree. Thus any file shipped in such a system extension will suddenly appear as if it was part of the host OS itself. For optional components that should be considered part of the OS more or less this is a very simple and powerful way to combine an immutable OS with an immutable extension. Note that most likely extensions for an OS matching this tool should be built at the same time within the same update cycle scheme as the host OS itself. After all, the files included in the extensions will have dependencies on files in the system OS image, and care must be taken that these dependencies remain in order.
For adding in additional somewhat isolated system services in my model, Portable Services are the proposed tool of choice. Portable services are in most ways just like regular system services; they could be included in the system OS image or an extension image. However, portable services use RootImage= to run off separate disk images, thus within their own namespace. Images set up this way have various ways to integrate into the host OS, as they are in most ways regular system services, which just happen to bring their own directory tree. Also, unlike regular system services, for them sandboxing is opt-out rather than opt-in. In my model, here too the disk images are Verity protected and thus immutable. Just like the host OS they are GPT disk images that come with a /usr/ partition and Verity data, along with signing.
Finally, the actual payload of the OS, i.e. the apps. To be useful in real life here it is important to hook into existing ecosystems, so that a large set of apps are available. Given that on Linux flatpak (or on servers OCI containers) are the established format that pretty much won they are probably the way to go. That said, I think both of these mechanisms have relatively weak properties, in particular when it comes to security, since immutability/measurements and similar are not provided. This means, unlike for system extensions and portable services a complete trust chain with attestation and per-app cryptographically protected data is much harder to implement sanely.

What I'd like to underline here is that the main system OS image, as well as the system extension images and the portable service images are put together the same way: they are GPT disk images, with one immutable file system and associated Verity data. The latter two should also contain a PKCS#7 signature for the top-level Verity hash. This uniformity has many benefits: you can use the same tools to build and process these images, but most importantly: by using a single way to validate them throughout the stack (i.e. Verity, in the latter cases with PKCS#7 signatures), validation and measurement is straightforward. In fact it's so obvious that we don't even have to implement it in systemd: the kernel has direct support for this Verity signature checking natively already (IMA).

So, by composing a system at runtime from a host image, extension images and portable service images we have a nicely modular system where every single component is cryptographically validated on every single IO operation, and every component is measured, in its entire combination, directly in the kernel's IMA subsystem.

(Of course, once you add the desktop apps or OCI containers on top, then these properties are lost further down the chain. But well, a lot is already won, if you can close the chain that far down.)

Note that system extensions are not designed to replicate the fine grained packaging logic of RPM/dpkg. Of course, systemd-sysext is a generic tool, so you can use it for whatever you want, but there's a reason it does not bring support for a dependency language: the goal here is not to replicate traditional Linux packaging (we have that already, in RPM/dpkg, and I think they are actually OK for what they do) but to provide delivery of larger, coarser sets of functionality, in lockstep with the underlying OS' life-cycle and in particular with no interdependencies, except on the underlying OS.

Also note that depending on the use case it might make sense to also use system extensions to modularize the initrd step. This is probably less relevant for a desktop OS, but for server systems it might make sense to package up support for specific complex storage in a systemd-sysext system extension, which can be applied to the initrd that is built into the unified kernel. (In fact, we have been working on implementing signed yet modular initrd support to general purpose Fedora this way.)

Note that portable services are composable from system extension too, by the way. This makes them even more useful, as you can share a common runtime between multiple portable service, or even use the host image as common runtime for portable services. In this model a common runtime image is shared between one or more system extensions, and composed at runtime via an overlayfs instance.

More Modularity: Secondary OS Installs

Having an immutable, cryptographically locked down host OS is great I think, and if we have some moderate modularity on top, that's also great. But oftentimes it's useful to be able to depart/compromise for some specific use cases from that, i.e. provide a bridge for example to allow workloads designed around RPM/dpkg package management to coexist reasonably nicely with such an immutable host.

For this purpose in my model I'd propose using systemd-nspawn containers. The containers are focused on OS containerization, i.e. they allow you to run a full OS with init system and everything as payload (unlike for example Docker containers which focus on a single service, and where running a full OS in it is a mess).

Running systemd-nspawn containers for such secondary OS installs has various nice properties. One of course is that systemd-nspawn supports the same level of cryptographic image validation that we rely on for the host itself. Thus, to some level the whole OS trust chain is reasonably recursive if desired: the firmware validates the OS, and the OS can validate a secondary OS installed within it. In fact, we can run our trusted OS recursively on itself and get similar security guarantees! Besides these security aspects, systemd-nspawn also has really nice properties when it comes to integration with the host. For example the --bind-user= permits binding a host user record and their directory into a container as a simple one step operation. This makes it extremely easy to have a single user and $HOME but share it concurrently with the host and a zoo of secondary OSes in systemd-nspawn containers, which each could run different distributions even.

Developer Mode

Superficially, an OS with an immutable /usr/ appears much less hackable than an OS where everything is writable. Moreover, an OS where everything must be signed and cryptographically validated makes it hard to insert your own code, given you are unlikely to possess access to the signing keys.

To address this issue other systems have supported a "developer" mode: when entered the security guarantees are disabled, and the system can be freely modified, without cryptographic validation. While that's a great concept to have I doubt it's what most developers really want: the cryptographic properties of the OS are great after all, it sucks having to give them up once developer mode is activated.

In my model I'd thus propose two different approaches to this problem. First of all, I think there's value in allowing users to additively extend/override the OS via local developer system extensions. With this scheme the underlying cryptographic validation would remain in tact, but — if this form of development mode is explicitly enabled – the developer could add in more resources from local storage, that are not tied to the OS builder's chain of trust, but a local one (i.e. simply backed by encrypted storage of some form).

The second approach is to make it easy to extend (or in fact replace) the set of trusted validation keys, with local ones that are under the control of the user, in order to make it easy to operate with kernel, OS, extension, portable service or container images signed by the local developer without involvement of the OS builder. This is relatively easy to do for components down the trust chain, i.e. the elements further up the chain should optionally allow additional certificates to allow validation with.

(Note that systemd currently has no explicit support for a "developer" mode like this. I think we should add that sooner or later however.)

Democratizing Code Signing

Closely related to the question of developer mode is the question of code signing. If you ask me, the status quo of UEFI SecureBoot code signing in the major Linux distributions is pretty sad. The work to get stuff signed is massive, but in effect it delivers very little in return: because initrds are entirely unprotected, and reside on partitions lacking any form of cryptographic integrity protection any attacker can trivially easily modify the boot process of any such Linux system and freely collected FDE passphrases entered. There's little value in signing the boot loader and kernel in a complex bureaucracy if it then happily loads entirely unprotected code that processes the actually relevant security credentials: the FDE keys.

In my model, through use of unified kernels this important gap is closed, hence UEFI SecureBoot code signing becomes an integral part of the boot chain from firmware to the host OS. Unfortunately, code signing – and having something a user can locally hack, is to some level conflicting. However, I think we can improve the situation here, and put more emphasis on enrolling developer keys in the trust chain easily. Specifically, I see one relevant approach here: enrolling keys directly in the firmware is something that we should make less of a theoretical exercise and more something we can realistically deploy. See this work in progress making this more automatic and eventually safe. Other approaches are thinkable (including some that build on existing MokManager infrastructure), but given the politics involved, are harder to conclusively implement.

Running the OS itself in a container

What I explain above is put together with running on a bare metal system in mind. However, one of the stated goals is to make the OS adaptive enough to also run in a container environment (specifically: systemd-nspawn) nicely. Booting a disk image on bare metal or in a VM generally means that the UEFI firmware validates and invokes the boot loader, and the boot loader invokes the kernel which then transitions into the final system. This is different for containers: here the container manager immediately calls the init system, i.e. PID 1. Thus the validation logic must be different: cryptographic validation must be done by the container manager. In my model this is solved by shipping the OS image not only with a Verity data partition (as is already necessary for the UEFI SecureBoot trust chain, see above), but also with another partition, containing a PKCS#7 signature of the root hash of said Verity partition. This of course is exactly what I propose for both the system extension and portable service image. Thus, in my model the images for all three uses are put together the same way: an immutable /usr/ partition, accompanied by a Verity partition and a PKCS#7 signature partition. The OS image itself then has two ways "into" the trust chain: either through the signed unified kernel in the ESP (which is used for bare metal and VM boots) or by using the PKCS#7 signature stored in the partition (which is used for container/systemd-nspawn boots).

Parameterizing Kernels

A fully immutable and signed OS has to establish trust in the user data it makes use of before doing so. In the model I describe here, for /etc/ and /var/ we do this via disk encryption of the root file system (in combination with integrity checking). But the point where the root file system is mounted comes relatively late in the boot process, and thus cannot be used to parameterize the boot itself. In many cases it's important to be able to parameterize the boot process however.

For example, for the implementation of the developer mode indicated above it's useful to be able to pass this fact safely to the initrd, in combination with other fields (e.g. hashed root password for allowing in-initrd logins for debug purposes). After all, if the initrd is pre-built by the vendor and signed as whole together with the kernel it cannot be modified to carry such data directly (which is in fact how parameterizing of the initrd to a large degree was traditionally done).

In my model this is achieved through system credentials, which allow passing parameters to systems (and services for the matter) in an encrypted and authenticated fashion, bound to the TPM2 chip. This means that we can securely pass data into the initrd so that it can be authenticated and decrypted only on the system it is intended for and with the unified kernel image it was intended for.

Swap

In my model the OS would also carry a swap partition. For the simple reason that only then systemd-oomd.service can provide the best results. Also see In defence of swap: common misconceptions

Updating Images

We have a rough idea how the system shall be organized now, let's next focus on the deployment cycle: software needs regular update cycles, and software that is not updated regularly is a security problem. Thus, I am sure that any modern system must be automatically updated, without this requiring avoidable user interaction.

In my model, this is the job for systemd-sysupdate. It's a relatively simple A/B image updater: it operates either on partitions, on regular files in a directory, or on subdirectories in a directory. Each entry has a version (which is encoded in the GPT partition label for partitions, and in the filename for regular files and directories): whenever an update is initiated the oldest version is erased, and the newest version is downloaded.

With the setup described above a system update becomes a really simple operation. On each update the systemd-sysupdate tool downloads a /usr/ file system partition, an accompanying Verity partition, a PKCS#7 signature partition, and drops it into the host's partition table (where it possibly replaces the oldest version so far stored there). Then it downloads a unified kernel image and drops it into the EFI System Partition's /EFI/Linux (as per Boot Loader Specification; possibly erase the oldest such file there). And that's already the whole update process: four files are downloaded from the server, unpacked and put in the most straightforward of ways into the partition table or file system. Unlike in other OS designs there's no mechanism required to explicitly switch to the newer version, the aforementioned systemd-boot logic will automatically pick the newest kernel once it is dropped in.

Above we talked a lot about modularity, and how to put systems together as a combination of a host OS image, system extension images for the initrd and the host, portable service images and systemd-nspawn container images. I already emphasized that these image files are actually always the same: GPT disk images with partition definitions that match the Discoverable Partition Specification. This comes very handy when thinking about updating: we can use the exact same systemd-sysupdate tool for updating these other images as we use for the host image. The uniformity of the on-disk format allows us to update them uniformly too.

Boot Counting + Assessment

Automatic OS updates do not come without risks: if they happen automatically, and an update goes wrong this might mean your system might be automatically updated into a brick. This of course is less than ideal. Hence it is essential to address this reasonably automatically. In my model, there's systemd's Automatic Boot Assessment for that. The mechanism is simple: whenever a new unified kernel image is dropped into the system it will be stored with a small integer counter value included in the filename. Whenever the unified kernel image is selected for booting by systemd-boot, it is decreased by one. Once the system booted up successfully (which is determined by userspace) the counter is removed from the file name (which indicates "this entry is known to work"). If the counter ever hits zero, this indicates that it tried to boot it a couple of times, and each time failed, thus is apparently "bad". In this case systemd-boot will not consider the kernel anymore, and revert to the next older (that doesn't have a counter of zero).

By sticking the boot counter into the filename of the unified kernel we can directly attach this information to the kernel, and thus need not concern ourselves with cleaning up secondary information about the kernel when the kernel is removed. Updating with a tool like systemd-sysupdate remains a very simple operation hence: drop one old file, add one new file.

Picking the Newest Version

I already mentioned that systemd-boot automatically picks the newest unified kernel image to boot, by looking at the version encoded in the filename. This is done via a simple strverscmp() call (well, truth be told, it's a modified version of that call, different from the one implemented in libc, because real-life package managers use more complex rules for comparing versions these days, and hence it made sense to do that here too). The concept of having multiple entries of some resource in a directory, and picking the newest one automatically is a powerful concept, I think. It means adding/removing new versions is extremely easy (as we discussed above, in systemd-sysupdate context), and allows stateless determination of what to use.

If systemd-boot can do that, what about system extension images, portable service images, or systemd-nspawn container images that do not actually use systemd-boot as the entrypoint? All these tools actually implement the very same logic, but on the partition level: if multiple suitable /usr/ partitions exist, then the newest is determined by comparing the GPT partition label of them.

This is in a way the counterpart to the systemd-sysupdate update logic described above: we always need a way to determine which partition to actually then use after the update took place: and this becomes very easy each time: enumerate possible entries, pick the newest as per the (modified) strverscmp() result.

Home Directory Management

In my model the device's users and their home directories are managed by systemd-homed. This means they are relatively self-contained and can be migrated easily between devices. The numeric UID assignment for each user is done at the moment of login only, and the files in the home directory are mapped as needed via a uidmap mount. It also allows us to protect the data of each user individually with a credential that belongs to the user itself. i.e. instead of binding confidentiality of the user's data to the system-wide full-disk-encryption each user gets their own encrypted home directory where the user's authentication token (password, FIDO2 token, PKCS#11 token, recovery key…) is used as authentication and decryption key for the user's data. This brings a major improvement for security as it means the user's data is cryptographically inaccessible except when the user is actually logged in.

It also allows us to correct another major issue with traditional Linux systems: the way how data encryption works during system suspend. Traditionally on Linux the disk encryption credentials (e.g. LUKS passphrase) is kept in memory also when the system is suspended. This is a bad choice for security, since many (most?) of us probably never turn off their laptop but suspend it instead. But if the decryption key is always present in unencrypted form during the suspended time, then it could potentially be read from there by a sufficiently equipped attacker.

By encrypting the user's home directory with the user's authentication token we can first safely "suspend" the home directory before going to the system suspend state (i.e. flush out the cryptographic keys needed to access it). This means any process currently accessing the home directory will be frozen for the time of the suspend, but that's expected anyway during a system suspend cycle. Why is this better than the status quo ante? In this model the home directory's cryptographic key material is erased during suspend, but it can be safely reacquired on resume, from system code. If the system is only encrypted as a whole however, then the system code itself couldn't reauthenticate the user, because it would be frozen too. By separating home directory encryption from the root file system encryption we can avoid this problem.

Partition Setup

So we discussed the organization of the partitions OS images multiple times in the above, each time focusing on a specific aspect. Let's now summarize how this should look like all together.

In my model, the initial, shipped OS image should look roughly like this:

(1) An UEFI System Partition, with systemd-boot as boot loader and one unified kernel
(2) A /usr/ partition (version "A"), with a label fooOS_0.7 (under the assumption we called our project fooOS and the image version is 0.7).
(3) A Verity partition for the /usr/ partition (version "A"), with the same label
(4) A partition carrying the Verity root hash for the /usr/ partition (version "A"), along with a PKCS#7 signature of it, also with the same label

On first boot this is augmented by systemd-repart like this:

(5) A second /usr/ partition (version "B"), initially with a label _empty (which is the label systemd-sysupdate uses to mark partitions that currently carry no valid payload)
(6) A Verity partition for that (version "B"), similar to the above case, also labelled _empty
(7) And ditto a Verity root hash partition with a PKCS#7 signature (version "B"), also labelled _empty
(8) A root file system, encrypted and locked to the TPM2
(9) A home file system, integrity protected via a key also in TPM2 (encryption is unnecessary, since systemd-homed adds that on its own, and it's nice to avoid duplicate encryption)
(10) A swap partition, encrypted and locked to the TPM2

Then, on the first OS update the partitions 5, 6, 7 are filled with a new version of the OS (let's say 0.8) and thus get their label updated to fooOS_0.8. After a boot, this version is active.

On a subsequent update the three partitions fooOS_0.7 get wiped and replaced by fooOS_0.9 and so on.

On factory reset, the partitions 8, 9, 10 are deleted, so that systemd-repart recreates them, using a new set of cryptographic keys.

Here's a graphic that hopefully illustrates the partition stable from shipped image, through first boot, multiple update cycles and eventual factory reset:

Trust Chain

So let's summarize the intended chain of trust (for bare metal/VM boots) that ensures every piece of code in this model is signed and validated, and any system secret is locked to TPM2.

First, firmware (or possibly shim) authenticates systemd-boot.
Once systemd-boot picks a unified kernel image to boot, it is also authenticated by firmware/shim.
The unified kernel image contains an initrd, which is the first userspace component that runs. It finds any system extensions passed into the initrd, and sets them up through Verity. The kernel will validate the Verity root hash signature of these system extension images against its usual keyring.
The initrd also finds credentials passed in, then securely unlocks (which means: decrypts + authenticates) them with a secret from the TPM2 chip, locked to the kernel image itself.
The kernel image also contains a kernel command line which contains a usrhash= option that pins the root hash of the /usr/ partition to use.
The initrd then unlocks the encrypted root file system, with a secret bound to the TPM2 chip.
The system then transitions into the main system, i.e. the combination of the Verity protected /usr/ and the encrypted root files system. It then activates two more encrypted (and/or integrity protected) volumes for /home/ and swap, also with a secret tied to the TPM2 chip.

Here's an attempt to illustrate the above graphically:

This is the trust chain of the basic OS. Validation of system extension images, portable service images, systemd-nspawn container images always takes place the same way: the kernel validates these Verity images along with their PKCS#7 signatures against the kernel's keyring.

File System Choice

In the above I left the choice of file systems unspecified. For the immutable /usr/ partitions squashfs might be a good candidate, but any other that works nicely in a read-only fashion and generates reproducible results is a good choice, too. The home directories as managed by systemd-homed should certainly use btrfs, because it's the only general purpose file system supporting online grow and shrink, which systemd-homed can take benefit of, to manage storage.

For the root file system btrfs is likely also the best idea. That's because we intend to use LUKS/dm-crypt underneath, which by default only provides confidentiality, not authenticity of the data (unless combined with dm-integrity). Since btrfs (unlike xfs/ext4) does full data checksumming it's probably the best choice here, since it means we don't have to use dm-integrity (which comes at a higher performance cost).

OS Installation vs. OS Instantiation

In the discussion above a lot of focus was put on setting up the OS and completing the partition layout and such on first boot. This means installing the OS becomes as simple as dd-ing (i.e. "streaming") the shipped disk image into the final HDD medium. Simple, isn't it?

Of course, such a scheme is just too simple for many setups in real life. Whenever multi-boot is required (i.e. co-installing an OS implementing this model with another unrelated one), dd-ing a disk image onto the HDD is going to overwrite user data that was supposed to be kept around.

In order to cover for this case, in my model, we'd use systemd-repart (again!) to allow streaming the source disk image into the target HDD in a smarter, additive way. The tool after all is purely additive: it will add in partitions or grow them if they are missing or too small. systemd-repart already has all the necessary provisions to not only create a partition on the target disk, but also copy blocks from a raw installer disk. An install operation would then become a two stop process: one invocation of systemd-repart that adds in the /usr/, its Verity and the signature partition to the target medium, populated with a copy of the same partition of the installer medium. And one invocation of bootctl that installs the systemd-boot boot loader in the ESP. (Well, there's one thing missing here: the unified OS kernel also needs to be dropped into the ESP. For now, this can be done with a simple cp call. In the long run, this should probably be something bootctl can do as well, if told so.)

So, with this scheme we have a simple scheme to cover all bases: we can either just dd an image to disk, or we can stream an image onto an existing HDD, adding a couple of new partitions and files to the ESP.

Of course, in reality things are more complex than that even: there's a good chance that the existing ESP is simply too small to carry multiple unified kernels. In my model, the way to address this is by shipping two slightly different systemd-repart partition definition file sets: the ideal case when the ESP is large enough, and a fallback case, where it isn't and where we then add in an addition XBOOTLDR partition (as per the Discoverable Partitions Specification). In that mode the ESP carries the boot loader, but the unified kernels are stored in the XBOOTLDR partition. This scenario is not quite as simple as the XBOOTLDR-less scenario described first, but is equally well supported in the various tools. Note that systemd-repart can be told size constraints on the partitions it shall create or augment, thus to implement this scheme it's enough to invoke the tool with the fallback partition scheme if invocation with the ideal scheme fails.

Either way: regardless how the partitions, the boot loader and the unified kernels ended up on the system's hard disk, on first boot the code paths are the same again: systemd-repart will be called to augment the partition table with the root file system, and properly encrypt it, as was already discussed earlier here. This means: all cryptographic key material used for disk encryption is generated on first boot only, the installer phase does not encrypt anything.

Live Systems vs. Installer Systems vs. Installed Systems

Traditionally on Linux three types of systems were common: "installed" systems, i.e. that are stored on the main storage of the device and are the primary place people spend their time in; "installer" systems which are used to install them and whose job is to copy and setup the packages that make up the installed system; and "live" systems, which were a middle ground: a system that behaves like an installed system in most ways, but lives on removable media.

In my model I'd like to remove the distinction between these three concepts as much as possible: each of these three images should carry the exact same /usr/ file system, and should be suitable to be replicated the same way. Once installed the resulting image can also act as an installer for another system, and so on, creating a certain "viral" effect: if you have one image or installation it's automatically something you can replicate 1:1 with a simple systemd-repart invocation.

Building Images According to this Model

The above explains how the image should look like and how its first boot and update cycle will modify it. But this leaves one question unanswered: how to actually build the initial image for OS instances according to this model?

Note that there's nothing too special about the images following this model: they are ultimately just GPT disk images with Linux file systems, following the Discoverable Partition Specification. This means you can use any set of tools of your choice that can put together GPT disk images for compliant images.

I personally would use mkosi for this purpose though. It's designed to generate compliant images, and has a rich toolset for SecureBoot and signed/Verity file systems already in place.

What is key here is that this model doesn't depart from RPM and dpkg, instead it builds on top of that: in this model they are excellent for putting together images on the build host, but deployment onto the runtime host does not involve individual packages.

I think one cannot underestimate the value traditional distributions bring, regarding security, integration and general polishing. The concepts I describe above are inherited from this, but depart from the idea that distribution packages are a runtime concept and make it a build-time concept instead.

Note that the above is pretty much independent from the underlying distribution.

Final Words

I have no illusions, general purpose distributions are not going to adopt this model as their default any time soon, and it's not even my goal that they do that. The above is my personal vision, and I don't expect people to buy into it 100%, and that's fine. However, what I am interested in is finding the overlaps, i.e. work with people who buy 50% into this vision, and share the components.

My goals here thus are to:

Get distributions to move to a model where images like this can be built from the distribution easily. Specifically this means that distributions make their OS hermetic in /usr/.
Find the overlaps, share components with other projects to revisit how distributions are put together. This is already happening, see systemd-tmpfiles and systemd-sysuser support in various distributions, but I think there's more to share.
Make people interested in building actual real-world images based on general purpose distributions adhering to the model described above. I'd love a "GnomeBook" image with full trust properties, that is built from true Linux distros, such as Fedora or ArchLinux.

FAQ

What about ostree? Doesn't ostree already deliver what this blog story describes?

ostree is fine technology, but in respect to security and robustness properties it's not too interesting I think, because unlike image-based approaches it cannot really deliver integrity/robustness guarantees over the whole tree easily. To be able to trust an ostree setup you have to establish trust in the underlying file system first, and the complexity of the file system makes that challenging. To provide an effective offline-secure trust chain through the whole depth of the stack it is essential to cryptographically validate every single I/O operation. In an image-based model this is trivially easy, but in ostree model it's with current file system technology not possible and even if this is added in one way or another in the future (though I am not aware of anyone doing on-access file-based integrity that spans a whole hierarchy of files that was compatible with ostree's hardlink farm model) I think validation is still at too high a level, since Linux file system developers made very clear their implementations are not robust to rogue images. (There's this stuff planned, but doing structural authentication ahead of time instead of on access makes the idea to weak — and I'd expect too slow — in my eyes.)

With my design I want to deliver similar security guarantees as ChromeOS does, but ostree is much weaker there, and I see no perspective of this changing. In a way ostree's integrity checks are similar to RPM's and enforced on download rather than on access. In the model I suggest above, it's always on access, and thus safe towards offline attacks (i.e. evil maid attacks). In today's world, I think offline security is absolutely necessary though.

That said, ostree does have some benefits over the model described above: it naturally shares file system inodes if many of the modules/images involved share the same data. It's thus more space efficient on disk (and thus also in RAM/cache to some degree) by default. In my model it would be up to the image builders to minimize shipping overly redundant disk images, by making good use of suitably composable system extensions.
What about configuration management?

At first glance immutable systems and configuration management don't go that well together. However, do note, that in the model I propose above the root file system with all its contents, including /etc/ and /var/ is actually writable and can be modified like on any other typical Linux distribution. The only exception is /usr/ where the immutable OS is hermetic. That means configuration management tools should work just fine in this model – up to the point where they are used to install additional RPM/dpkg packages, because that's something not allowed in the model above: packages need to be installed at image build time and thus on the image build host, not the runtime host.
What about non-UEFI and non-TPM2 systems?

The above is designed around the feature set of contemporary PCs, and this means UEFI and TPM2 being available (simply because the PC is pretty much defined by the Windows platform, and current versions of Windows require both).

I think it's important to make the best of the features of today's PC hardware, and then find suitable fallbacks on more limited hardware. Specifically this means: if there's desire to implement something like the this on non-UEFI or non-TPM2 hardware we should look for suitable fallbacks for the individual functionality, but generally try to add glue to the old systems so that conceptually they behave more like the new systems instead of the other way round. Or in other words: most of the above is not strictly tied to UEFI or TPM2, and for many cases already there are reasonably fallbacks in place for more limited systems. Of course, without TPM2 many of the security guarantees will be weakened.
How would you name an OS built that way?

I think a desktop OS built this way if it has the GNOME desktop should of course be called GnomeBook, to mimic the ChromeBook name. ;-)

But in general, I'd call hermetic, adaptive, immutable OSes like this "particles".

How can you help?

Help making Distributions Hermetic in /usr/!

One of the core ideas of the approach described above is to make the OS hermetic in /usr/, i.e. make it carry a comprehensive description of what needs to be set up outside of it when instantiated. Specifically, this means that system users that are needed are declared in systemd-sysusers snippets, and skeleton files and directories are created via systemd-tmpfiles. Moreover additional partitions should be declared via systemd-repart drop-ins.

At this point some distributions (such as Fedora) are (probably more by accident than on purpose) already mostly hermetic in /usr/, at least for the most basic parts of the OS. However, this is not complete: many daemons require to have specific resources set up in /var/ or /etc/ before they can work, and the relevant packages do not carry systemd-tmpfiles descriptions that add them if missing. So there are two ways you could help here: politically, it would be highly relevant to convince distributions that an OS that is hermetic in /usr/ is highly desirable and it's a worthy goal for packagers to get there. More specifically, it would be desirable if RPM/dpkg packages would ship with enough systemd-tmpfiles information so that configuration files the packages strictly need for operation are symlinked (or copied) from /usr/share/factory/ if they are missing (even better of course would be if packages from their upstream sources on would just work with an empty /etc/ and /var/, and create themselves what they need and default to good defaults in absence of configuration files).

Note that distributions that adopted systemd-sysusers, systemd-tmpfiles and the /usr/ merge are already quite close to providing an OS that is hermetic in /usr/. These were the big, the major advancements: making the image fully hermetic should be less controversial – at least that's my guess.

Also note that making the OS hermetic in /usr/ is not just useful in scenarios like the above. It also means that stuff like this and like this can work well.
Fill in the gaps!

I already mentioned a couple of missing bits and pieces in the implementation of the overall vision. In the systemd project we'd be delighted to review/merge any PRs that fill in the voids.
Build your own OS like this!

Of course, while we built all these building blocks and they have been adopted to various levels and various purposes in the various distributions, no one so far built an OS that puts things together just like that. It would be excellent if we had communities that work on building images like what I propose above. i.e. if you want to work on making a secure GnomeBook as I suggest above a reality that would be more than welcome.

How could this look like specifically? Pick an existing distribution, write a set of mkosi descriptions plus some additional drop-in files, and then build this on some build infrastructure. While doing so, report the gaps, and help us address them.

Further Documentation of Used Components and Concepts

Earlier Blog Stories Related to this Topic

And that's all for now.

Category: projects