Index | Archives | Atom Feed | RSS Feed

Authenticated Boot and Disk Encryption on Linux

The Strange State of Authenticated Boot and Disk Encryption on Generic Linux Distributions

TL;DR: Linux has been supporting Full Disk Encryption (FDE) and technologies such as UEFI SecureBoot and TPMs for a long time. However, the way they are set up by most distributions is not as secure as they should be, and in some ways quite frankly weird. In fact, right now, your data is probably more secure if stored on current ChromeOS, Android, Windows or MacOS devices, than it is on typical Linux distributions.

Generic Linux distributions (i.e. Debian, Fedora, Ubuntu, …) adopted Full Disk Encryption (FDE) more than 15 years ago, with the LUKS/cryptsetup infrastructure. It was a big step forward to a more secure environment. Almost ten years ago the big distributions started adding UEFI SecureBoot to their boot process. Support for Trusted Platform Modules (TPMs) has been added to the distributions a long time ago as well — but even though many PCs/laptops these days have TPM chips on-board it's generally not used in the default setup of generic Linux distributions.

How these technologies currently fit together on generic Linux distributions doesn't really make too much sense to me — and falls short of what they could actually deliver. In this story I'd like to have a closer look at why I think that, and what I propose to do about it.

The Basic Technologies

Let's have a closer look what these technologies actually deliver:

  1. LUKS/dm-crypt/cryptsetup provide disk encryption, and optionally data authentication. Disk encryption means that reading the data in clear-text form is only possible if you possess a secret of some form, usually a password/passphrase. Data authentication means that no one can make changes to the data on disk unless they possess a secret of some form. Most distributions only enable the former though — the latter is a more recent addition to LUKS/cryptsetup, and is not used by default on most distributions (though it probably should be). Closely related to LUKS/dm-crypt is dm-verity (which can authenticate immutable volumes) and dm-integrity (which can authenticate writable volumes, among other things).

  2. UEFI SecureBoot provides mechanisms for authenticating boot loaders and other pre-OS binaries before they are invoked. If those boot loaders then authenticate the next step of booting in a similar fashion there's a chain of trust which can ensure that only code that has some level of trust associated with it will run on the system. Authentication of boot loaders is done via cryptographic signatures: the OS/boot loader vendors cryptographically sign their boot loader binaries. The cryptographic certificates that may be used to validate these signatures are then signed by Microsoft, and since Microsoft's certificates are basically built into all of today's PCs and laptops this will provide some basic trust chain: if you want to modify the boot loader of a system you must have access to the private key used to sign the code (or to the private keys further up the certificate chain).

  3. TPMs do many things. For this text we'll focus one facet: they can be used to protect secrets (for example for use in disk encryption, see above), that are released only if the code that booted the host can be authenticated in some form. This works roughly like this: every component that is used during the boot process (i.e. code, certificates, configuration, …) is hashed with a cryptographic hash function before it is used. The resulting hash is written to some small volatile memory the TPM maintains that is write-only (the so called Platform Configuration Registers, "PCRs"): each step of the boot process will write hashes of the resources needed by the next part of the boot process into these PCRs. The PCRs cannot be written freely: the hashes written are combined with what is already stored in the PCRs — also through hashing and the result of that then replaces the previous value. Effectively this means: only if every component involved in the boot matches expectations the hash values exposed in the TPM PCRs match the expected values too. And if you then use those values to unlock the secrets you want to protect you can guarantee that the key is only released to the OS if the expected OS and configuration is booted. The process of hashing the components of the boot process and writing that to the TPM PCRs is called "measuring". What's also important to mention is that the secrets are not only protected by these PCR values but encrypted with a "seed key" that is generated on the TPM chip itself, and cannot leave the TPM (at least so goes the theory). The idea is that you cannot read out a TPM's seed key, and thus you cannot duplicate the chip: unless you possess the original, physical chip you cannot retrieve the secret it might be able to unlock for you. Finally, TPMs can enforce a limit on unlock attempts per time ("anti-hammering"): this makes it hard to brute force things: if you can only execute a certain number of unlock attempts within some specific time then brute forcing will be prohibitively slow.

How Linux Distributions use these Technologies

As mentioned already, Linux distributions adopted the first two of these technologies widely, the third one not so much.

So typically, here's how the boot process of Linux distributions works these days:

  1. The UEFI firmware invokes a piece of code called "shim" (which is stored in the EFI System Partition — the "ESP" — of your system), that more or less is just a list of certificates compiled into code form. The shim is signed with the aforementioned Microsoft key, that is built into all PCs/laptops. This list of certificates then can be used to validate the next step of the boot process. The shim is measured by the firmware into the TPM. (Well, the shim can do a bit more than what I describe here, but this is outside of the focus of this article.)

  2. The shim then invokes a boot loader (often Grub) that is signed by a private key owned by the distribution vendor. The boot loader is stored in the ESP as well, plus some other places (i.e. possibly a separate boot partition). The corresponding certificate is included in the list of certificates built into the shim. The boot loader components are also measured into the TPM.

  3. The boot loader then invokes the kernel and passes it an initial RAM disk image (initrd), which contains initial userspace code. The kernel itself is signed by the distribution vendor too. It's also validated via the shim. The initrd is not validated, though (!). The kernel is measured into the TPM, the initrd sometimes too.

  4. The kernel unpacks the initrd image, and invokes what is contained in it. Typically, the initrd then asks the user for a password for the encrypted root file system. The initrd then uses that to set up the encrypted volume. No code authentication or TPM measurements take place.

  5. The initrd then transitions into the root file system. No code authentication or TPM measurements take place.

  6. When the OS itself is up the user is prompted for their user name, and their password. If correct, this will unlock the user account: the system is now ready to use. At this point no code authentication, no TPM measurements take place. Moreover, the user's password is not used to unlock any data, it's used only to allow or deny the login attempt — the user's data has already been decrypted a long time ago, by the initrd, as mentioned above.

What you'll notice here of course is that code validation happens for the shim, the boot loader and the kernel, but not for the initrd or the main OS code anymore. TPM measurements might go one step further: the initrd is measured sometimes too, if you are lucky. Moreover, you might notice that the disk encryption password and the user password are inquired by code that is not validated, and is thus not safe from external manipulation. You might also notice that even though TPM measurements of boot loader/OS components are done nothing actually ever makes use of the resulting PCRs in the typical setup.

Attack Scenarios

Of course, before determining whether the setup described above makes sense or not, one should have an idea what one actually intends to protect against.

The most basic attack scenario to focus on is probably that you want to be reasonably sure that if someone steals your laptop that contains all your data then this data remains confidential. The model described above probably delivers that to some degree: the full disk encryption when used with a reasonably strong password should make it hard for the laptop thief to access the data. The data is as secure as the password used is strong. The attacker might attempt to brute force the password, thus if the password is not chosen carefully the attacker might be successful.

Two more interesting attack scenarios go something like this:

  1. Instead of stealing your laptop the attacker takes the harddisk from your laptop while you aren't watching (e.g. while you went for a walk and left it at home or in your hotel room), makes a copy of it, and then puts it back. You'll never notice they did that. The attacker then analyzes the data in their lab, maybe trying to brute force the password. In this scenario you won't even know that your data is at risk, because for you nothing changed — unlike in the basic scenario above. If the attacker manages to break your password they have full access to the data included on it, i.e. everything you so far stored on it, but not necessarily on what you are going to store on it later. This scenario is worse than the basic one mentioned above, for the simple fact that you won't know that you might be attacked. (This scenario could be extended further: maybe the attacker has a chance to watch you type in your password or so, effectively lowering the password strength.)

  2. Instead of stealing your laptop the attacker takes the harddisk from your laptop while you aren't watching, inserts backdoor code on it, and puts it back. In this scenario you won't know your data is at risk, because physically everything is as before. What's really bad though is that the attacker gets access to anything you do on your laptop, both the data already on it, and whatever you will do in the future.

I think in particular this backdoor attack scenario is something we should be concerned about. We know for a fact that attacks like that happen all the time (Pegasus, industry espionage, …), hence we should make them hard.

Are we Safe?

So, does the scheme so far implemented by generic Linux distributions protect us against the latter two scenarios? Unfortunately not at all. Because distributions set up disk encryption the way they do, and only bind it to a user password, an attacker can easily duplicate the disk, and then attempt to brute force your password. What's worse: since code authentication ends at the kernel — and the initrd is not authenticated anymore —, backdooring is trivially easy: an attacker can change the initrd any way they want, without having to fight any kind of protections. And given that FDE unlocking is implemented in the initrd, and it's the initrd that asks for the encryption password things are just too easy: an attacker could trivially easily insert some code that picks up the FDE password as you type it in and send it wherever they want. And not just that: since once they are in they are in, they can do anything they like for the rest of the system's lifecycle, with full privileges — including installing backdoors for versions of the OS or kernel that are installed on the device in the future, so that their backdoor remains open for as long as they like.

That is sad of course. It's particular sad given that the other popular OSes all address this much better. ChromeOS, Android, Windows and MacOS all have way better built-in protections against attacks like this. And it's why one can certainly claim that your data is probably better protected right now if you store it on those OSes then it is on generic Linux distributions.

(Yeah, I know that there are some niche distros which do this better, and some hackers hack their own. But I care about general purpose distros here, i.e. the big ones, that most people base their work on.)

Note that there are more problems with the current setup. For example, it's really weird that during boot the user is queried for an FDE password which actually protects their data, and then once the system is up they are queried again – now asking for a username, and another password. And the weird thing is that this second authentication that appears to be user-focused doesn't really protect the user's data anymore — at that moment the data is already unlocked and accessible. The username/password query is supposed to be useful in multi-user scenarios of course, but how does that make any sense, given that these multiple users would all have to know a disk encryption password that unlocks the whole thing during the FDE step, and thus they have access to every user's data anyway if they make an offline copy of the harddisk?

Can we do better?

Of course we can, and that is what this story is actually supposed to be about.

Let's first figure out what the minimal issues we should fix are (at least in my humble opinion):

  1. The initrd must be authenticated before being booted into. (And measured unconditionally.)

  2. The OS binary resources (i.e. /usr/) must be authenticated before being booted into. (But don't need to be encrypted, since everyone has the same anyway, there's nothing to hide here.)

  3. The OS configuration and state (i.e. /etc/ and /var/) must be encrypted, and authenticated before they are used. The encryption key should be bound to the TPM device; i.e system data should be locked to a security concept belonging to the system, not the user.

  4. The user's home directory (i.e. /home/lennart/ and similar) must be encrypted and authenticated. The unlocking key should be bound to a user password or user security token (FIDO2 or PKCS#11 token); i.e. user data should be locked to a security concept belonging to the user, not the system.

Or to summarize this differently:

  1. Every single component of the boot process and OS needs to be authenticated, i.e. all of shim (done), boot loader (done), kernel (done), initrd (missing so far), OS binary resources (missing so far), OS configuration and state (missing so far), the user's home (missing so far).

  2. Encryption is necessary for the OS configuration and state (bound to TPM), and for the user's home directory (bound to a user password or user security token).

In Detail

Let's see how we can achieve the above in more detail.

How to Authenticate the initrd

At the moment initrds are generated on the installed host via scripts (dracut and similar) that try to figure out a minimal set of binaries and configuration data to build an initrd that contains just enough to be able to find and set up the root file system. What is included in the initrd hence depends highly on the individual installation and its configuration. Pretty likely no two initrds generated that way will be fully identical due to this. This model clearly has benefits: the initrds generated this way are very small and minimal, and support exactly what is necessary for the system to boot, and not less or more. It comes with serious drawbacks too though: the generation process is fragile and sometimes more akin to black magic than following clear rules: the generator script natively has to understand a myriad of storage stacks to determine what needs to be included and what not. It also means that authenticating the image is hard: given that each individual host gets a different specialized initrd, it means we cannot just sign the initrd with the vendor key like we sign the kernel. If we want to keep this design we'd have to figure out some other mechanism (e.g. a per-host signature key – that is generated locally; or by authenticating it with a message authentication code bound to the TPM). While these approaches are certainly thinkable, I am not convinced they actually are a good idea though: locally and dynamically generated per-host initrds is something we probably should move away from.

If we move away from locally generated initrds, things become a lot simpler. If the distribution vendor generates the initrds on their build systems then it can be attached to the kernel image itself, and thus be signed and measured along with the kernel image, without any further work. This simplicity is simply lovely. Besides robustness and reproducibility this gives us an easy route to authenticated initrds.

But of course, nothing is really that simple: working with vendor-generated initrds means that we can't adjust them anymore to the specifics of the individual host: if we pre-build the initrds and include them in the kernel image in immutable fashion then it becomes harder to support complex, more exotic storage or to parameterize it with local network server information, credentials, passwords, and so on. Now, for my simple laptop use-case these things don't matter, there's no need to extend/parameterize things, laptops and their setups are not that wildly different. But what to do about the cases where we want both: extensibility to cover for less common storage subsystems (iscsi, LVM, multipath, drivers for exotic hardware…) and parameterization?

Here's a proposal how to achieve that: let's build a basic initrd into the kernel as suggested, but then do two things to make this scheme both extensible and parameterizable, without compromising security.

  1. Let's define a way how the basic initrd can be extended with additional files, which are stored in separate "extension images". The basic initrd should be able to discover these extension images, authenticate them and then activate them, thus extending the initrd with additional resources on-the-fly.

  2. Let's define a way how we can safely pass additional parameters to the kernel/initrd (and actually the rest of the OS, too) in an authenticated (and possibly encrypted) fashion. Parameters in this context can be anything specific to the local installation, i.e. server information, security credentials, certificates, SSH server keys, or even just the root password that shall be able to unlock the root account in the initrd …

In such a scheme we should be able to deliver everything we are looking for:

  1. We'll have a full trust chain for the code: the boot loader will authenticate and measure the kernel and basic initrd. The initrd extension images will then be authenticated by the basic initrd image.

  2. We'll have authentication for all the parameters passed to the initrd.

This so far sounds very unspecific? Let's make it more specific by looking closer at the components I'd suggest to be used for this logic:

  1. The systemd suite since a few months contains a subsystem implementing system extensions (v248). System extensions are ultimately just disk images (for example a squashfs file system in a GPT envelope) that can extend an underlying OS tree. Extending in this regard means they simply add additional files and directories into the OS tree, i.e. below /usr/. For a longer explanation see systemd-sysext(8). When a system extension is activated it is simply mounted and then merged into the main /usr/ tree via a read-only overlayfs mount. Now what's particularly nice about them in this context we are talking about here is that the extension images may carry dm-verity authentication data, and PKCS#7 signatures (once this is merged, that is, i.e. v250).

  2. The systemd suite also contains a concept called service "credentials". These are small pieces of information passed to services in a secure way. One key feature of these credentials is that they can be encrypted and authenticated in a very simple way with a key bound to the TPM (v250). See LoadCredentialEncrypted= and systemd-creds(1) for details. They are great for safely storing SSL private keys and similar on your system, but they also come handy for parameterizing initrds: an encrypted credential is just a file that can only be decoded if the right TPM is around with the right PCR values set.

  3. The systemd suite contains a component called systemd-stub(7). It's an EFI stub, i.e. a small piece of code that is attached to a kernel image, and turns the kernel image into a regular EFI binary that can be directly executed by the firmware (or a boot loader). This stub has a number of nice features (for example, it can show a boot splash before invoking the Linux kernel itself and such). Once this work is merged (v250) the stub will support one more feature: it will automatically search for system extension image files and credential files next to the kernel image file, measure them and pass them on to the main initrd of the host.

Putting this together we have nice way to provide fully authenticated kernel images, initrd images and initrd extension images; as well as encrypted and authenticated parameters via the credentials logic.

How would a distribution actually make us of this? A distribution vendor would pre-build the basic initrd, and glue it into the kernel image, and sign that as a whole. Then, for each supposed extension of the basic initrd (e.g. one for iscsi support, one for LVM, one for multipath, …), the vendor would use a tool such as mkosi to build an extension image, i.e. a GPT disk image containing the files in squashfs format, a Verity partition that authenticates it, plus a PKCS#7 signature partition that validates the root hash for the dm-verity partition, and that can be checked against a key provided by the boot loader or main initrd. Then, any parameters for the initrd will be encrypted using systemd-creds encrypt -T. The resulting encrypted credentials and the initrd extension images are then simply placed next to the kernel image in the ESP (or boot partition). Done.

This checks all boxes: everything is authenticated and measured, the credentials also encrypted. Things remain extensible and modular, can be pre-built by the vendor, and installation is as simple as dropping in one file for each extension and/or credential.

How to Authenticate the Binary OS Resources

Let's now have a look how to authenticate the Binary OS resources, i.e. the stuff you find in /usr/, i.e. the stuff traditionally shipped to the user's system via RPMs or DEBs.

I think there are three relevant ways how to authenticate this:

  1. Make /usr/ a dm-verity volume. dm-verity is a concept implemented in the Linux kernel that provides authenticity to read-only block devices: every read access is cryptographically verified against a top-level hash value. This top-level hash is typically a 256bit value that you can either encode in the kernel image you are using, or cryptographically sign (which is particularly nice once this is merged). I think this is actually the best approach since it makes the /usr/ tree entirely immutable in a very simple way. However, this also means that the whole of /usr/ needs to be updated as once, i.e. the traditional rpm/apt based update logic cannot work in this mode.

  2. Make /usr/ a dm-integrity volume. dm-integrity is a concept provided by the Linux kernel that offers integrity guarantees to writable block devices, i.e. in some ways it can be considered to be a bit like dm-verity while permitting write access. It can be used in three ways, one of which I think is particularly relevant here. The first way is with a simple hash function in "stand-alone" mode: this is not too interesting here, it just provides greater data safety for file systems that don't hash check their files' data on their own. The second way is in combination with dm-crypt, i.e. with disk encryption. In this case it adds authenticity to confidentiality: only if you know the right secret you can read and make changes to the data, and any attempt to make changes without knowing this secret key will be detected as IO error on next read by those in possession of the secret (more about this below). The third way is the one I think is most interesting here: in "stand-alone" mode, but with a keyed hash function (e.g. HMAC). What's this good for? This provides authenticity without encryption: if you make changes to the disk without knowing the secret this will be noticed on the next read attempt of the data and result in IO errors. This mode provides what we want (authenticity) and doesn't do what we don't need (encryption). Of course, the secret key for the HMAC must be provided somehow, I think ideally by the TPM.

  3. Make /usr/ a dm-crypt (LUKS) + dm-integrity volume. This provides both authenticity and encryption. The latter isn't typically needed for /usr/ given that it generally contains no secret data: anyone can download the binaries off the Internet anyway, and the sources too. By encrypting this you'll waste CPU cycles, but beyond that it doesn't hurt much. (Admittedly, some people might want to hide the precise set of packages they have installed, since it of course does reveal a bit of information about you: i.e. what you are working on, maybe what your job is – think: if you are a hacker you have hacking tools installed – and similar). Going this way might simplify things in some cases, as it means you don't have to distinguish "OS binary resources" (i.e /usr/) and "OS configuration and state" (i.e. /etc/ + /var/, see below), and just make it the same volume. Here too, the secret key must be provided somehow, I think ideally by the TPM.

All three approach are valid. The first approach has my primary sympathies, but for distributions not willing to abandon client-side updates via RPM/dpkg this is not an option, in which case I would propose the other two approaches for these cases.

The LUKS encryption key (and in case of dm-integrity standalone mode the key for the keyed hash function) should be bound to the TPM. Why the TPM for this? You could also use a user password, a FIDO2 or PKCS#11 security token — but I think TPM is the right choice: why that? To reduce the requirement for repeated authentication, i.e. that you first have to provide the disk encryption password, and then you have to login, providing another password. It should be possible that the system boots up unattended and then only one authentication prompt is needed to unlock the user's data properly. The TPM provides a way to do this in a reasonably safe and fully unattended way. Also, when we stop considering just the laptop use-case for a moment: on servers interactive disk encryption prompts don't make much sense — the fact that TPMs can provide secrets without this requiring user interaction and thus the ability to work in entirely unattended environments is quite desirable. Note that crypttab(5) as implemented by systemd (v248) provides native support for authentication via password, via TPM2, via PKCS#11 or via FIDO2, so the choice is ultimately all yours.

How to Encrypt/Authenticate OS Configuration and State

Let's now look at the OS configuration and state, i.e. the stuff in /etc/ and /var/. It probably makes sense to not consider these two hierarchies independently but instead just consider this to be the root file system. If the OS binary resources are in a separate file system it is then mounted onto the /usr/ sub-directory of the root file system.

The OS configuration and state (or: root file system) should be both encrypted and authenticated: it might contain secret keys, user passwords, privileged logs and similar. This data matters and contains plenty data that should remain confidential.

The encryption of choice here is dm-crypt (LUKS) + dm-integrity similar as discussed above, again with the key bound to the TPM.

If the OS binary resources are protected the same way it is safe to merge these two volumes and have a single partition for both (see above)

How to Encrypt/Authenticate the User's Home Directory

The data in the user's home directory should be encrypted, and bound to the user's preferred token of authentication (i.e. a password or FIDO2/PKCS#11 security token). As mentioned, in the traditional mode of operation the user's home directory is not individually encrypted, but only encrypted because FDE is in use. The encryption key for that is a system wide key though, not a per-user key. And I think that's problem, as mentioned (and probably not even generally understood by our users). We should correct that and ensure that the user's password is what unlocks the user's data.

In the systemd suite we provide a service systemd-homed(8) (v245) that implements this in a safe way: each user gets its own LUKS volume stored in a loopback file in /home/, and this is enough to synthesize a user account. The encryption password for this volume is the user's account password, thus it's really the password provided at login time that unlocks the user's data. systemd-homed also supports other mechanisms of authentication, in particular PKCS#11/FIDO2 security tokens. It also provides support for other storage back-ends (such as fscrypt), but I'd always suggest to use the LUKS back-end since it's the only one providing the comprehensive confidentiality guarantees one wants for a UNIX-style home directory.

Note that there's one special caveat here: if the user's home directory (e.g. /home/lennart/) is encrypted and authenticated, what about the file system this data is stored on, i.e. /home/ itself? If that dir is part of the the root file system this would result in double encryption: first the data is encrypted with the TPM root file system key, and then again with the per-user key. Such double encryption is a waste of resources, and unnecessary. I'd thus suggest to make /home/ its own dm-integrity volume with a HMAC, keyed by the TPM. This means the data stored directly in /home/ will be authenticated but not encrypted. That's good not only for performance, but also has practical benefits: it allows extracting the encrypted volume of the various users in case the TPM key is lost, as a way to recover from dead laptops or similar.

Why authenticate /home/, if it only contains per-user home directories that are authenticated on their own anyway? That's a valid question: it's because the kernel file system maintainers made clear that Linux file system code is not considered safe against rogue disk images, and is not tested for that; this means before you mount anything you need to establish trust in some way because otherwise there's a risk that the act of mounting might exploit your kernel.

Summary of Resources and their Protections

So, let's now put this all together. Here's a table showing the various resources we deal with, and how I think they should be protected (in my idealized world).

Resource Needs Authentication Needs Encryption Suggested Technology Validation/Encryption Keys/Certificates acquired via Stored where
Shim yes no SecureBoot signature verification firmware certificate database ESP
Boot loader yes no ditto firmware certificate database/shim ESP/boot partition
Kernel yes no ditto ditto ditto
initrd yes no ditto ditto ditto
initrd parameters yes yes systemd TPM encrypted credentials TPM ditto
initrd extensions yes no systemd-sysext with Verity+PKCS#7 signatures firmware/initrd certificate database ditto
OS binary resources yes no dm-verity root hash linked into kernel image, or firmware/initrd certificate database top-level partition
OS configuration and state yes yes dm-crypt (LUKS) + dm-integrity TPM top-level partition
/home/ itself yes no dm-integrity with HMAC TPM top-level partition
User home directories yes yes dm-crypt (LUKS) + dm-integrity in loopback files User password/FIDO2/PKCS#11 security token loopback file inside /home partition

This should provide all the desired guarantees: everything is authenticated, and the individualized per-host or per-user data is also encrypted. No double encryption takes place. The encryption keys/verification certificates are stored/bound to the most appropriate infrastructure.

Does this address the three attack scenarios mentioned earlier? I think so, yes. The basic attack scenario I described is addressed by the fact that /var/, /etc/ and /home/*/ are encrypted. Brute forcing the former two is harder than in the status quo ante model, since a high entropy key is used instead of one derived from a user provided password. Moreover, the "anti-hammering" logic of the TPM will make brute forcing prohibitively slow. The home directories are protected by the user's password or ideally a personal FIDO2/PKCS#11 security token in this model. Of course, a password isn't better security-wise then the status quo ante. But given the FIDO2/PKCS#11 support built into systemd-homed it should be easier to lock down the home directories securely.

Binding encryption of /var/ and /etc/ to the TPM also addresses the first of the two more advanced attack scenarios: a copy of the harddisk is useless without the physical TPM chip, since the seed key is sealed into that. (And even if the attacker had the chance to watch you type in your password, it won't help unless they possess access to to the TPM chip.) For the home directory this attack is not addressed as long as a plain password is used. However, since binding home directories to FIDO2/PKCS#11 tokens is built into systemd-homed things should be safe here too — provided the user actually possesses and uses such a device.

The backdoor attack scenario is addressed by the fact that every resource in play now is authenticated: it's hard to backdoor the OS if there's no component that isn't verified by signature keys or TPM secrets the attacker hopefully doesn't know.

For general purpose distributions that focus on updating the OS per RPM/dpkg the idealized model above won't work out, since (as mentioned) this implies an immutable /usr/, and thus requires updating /usr/ via an atomic update operation. For such distros a setup like the following is probably more realistic, but see above.

Resource Needs Authentication Needs Encryption Suggested Technology Validation/Encryption Keys/Certificates acquired via Stored where
Shim yes no SecureBoot signature verification firmware certificate database ESP
Boot loader yes no ditto firmware certificate database/shim ESP/boot partition
Kernel yes no ditto ditto ditto
initrd yes no ditto ditto ditto
initrd parameters yes yes systemd TPM encrypted credentials TPM ditto
initrd extensions yes no systemd-sysext with Verity+PKCS#7 signatures firmware/initrd certificate database ditto
OS binary resources, configuration and state yes yes dm-crypt (LUKS) + dm-integrity TPM top-level partition
/home/ itself yes no dm-integrity with HMAC TPM top-level partition
User home directories yes yes dm-crypt (LUKS) + dm-integrity in loopback files User password/FIDO2/PKCS#11 security token loopback file inside /home partition

This means there's only one root file system that contains all of /etc/, /var/ and /usr/.

Recovery Keys

When binding encryption to TPMs one problem that arises is what strategy to adopt if the TPM is lost, due to hardware failure: if I need the TPM to unlock my encrypted volume, what do I do if I need the data but lost the TPM?

The answer here is supporting recovery keys (this is similar to how other OSes approach this). Recovery keys are pretty much the same concept as passwords. The main difference being that they are computer generated rather than user-chosen. Because of that they typically have much higher entropy (which makes them more annoying to type in, i.e you want to use them only when you must, not day-to-day). By having higher entropy they are useful in combination with TPM, FIDO2 or PKCS#11 based unlocking: unlike a combination with passwords they do not compromise the higher strength of protection that TPM/FIDO2/PKCS#11 based unlocking is supposed to provide.

Current versions of systemd-cryptenroll(1) implement a recovery key concept in an attempt to address this problem. You may enroll any combination of TPM chips, PKCS#11 tokens, FIDO2 tokens, recovery keys and passwords on the same LUKS volume. When enrolling a recovery key it is generated and shown on screen both in text form and as QR code you can scan off screen if you like. The idea is write down/store this recovery key at a safe place so that you can use it when you need it. Note that such recovery keys can be entered wherever a LUKS password is requested, i.e. after generation they behave pretty much the same as a regular password.

TPM PCR Brittleness

Locking devices to TPMs and enforcing a PCR policy with this (i.e. configuring the TPM key to be unlockable only if certain PCRs match certain values, and thus requiring the OS to be in a certain state) brings a problem with it: TPM PCR brittleness. If the key you want to unlock with the TPM requires the OS to be in a specific state (i.e. that all OS components' hashes match certain expectations or similar) then doing OS updates might have the affect of making your key inaccessible: the OS updates will cause the code to change, and thus the hashes of the code, and thus certain PCRs. (Thankfully, you unrolled a recovery key, as described above, so this doesn't mean you lost your data, right?).

To address this I'd suggest three strategies:

  1. Most importantly: don't actually use the TPM PCRs that contain code hashes. There are actually multiple PCRs defined, each containing measurements of different aspects of the boot process. My recommendation is to bind keys to PCR 7 only, a PCR that contains measurements of the UEFI SecureBoot certificate databases. Thus, the keys will remain accessible as long as these databases remain the same, and updates to code will not affect it (updates to the certificate databases will, and they do happen too, though hopefully much less frequent then code updates). Does this reduce security? Not much, no, because the code that's run is after all not just measured but also validated via code signatures, and those signatures are validated with the aforementioned certificate databases. Thus binding an encrypted TPM key to PCR 7 should enforce a similar level of trust in the boot/OS code as binding it to a PCR with hashes of specific versions of that code. i.e. using PCR 7 means you say "every code signed by these vendors is allowed to unlock my key" while using a PCR that contains code hashes means "only this exact version of my code may access my key".

  2. Use LUKS key management to enroll multiple versions of the TPM keys in relevant volumes, to support multiple versions of the OS code (or multiple versions of the certificate database, as discussed above). Specifically: whenever an update is done that might result changing the relevant PCRs, pre-calculate the new PCRs, and enroll them in an additional LUKS slot on the relevant volumes. This means that the unlocking keys tied to the TPM remain accessible in both states of the system. Eventually, once rebooted after the update, remove the old slots.

  3. If these two strategies didn't work out (maybe because the OS/firmware was updated outside of OS control, or the update mechanism was aborted at the wrong time) and the TPM PCRs changed unexpectedly, and the user now needs to use their recovery key to get access to the OS back, let's handle this gracefully and automatically reenroll the current TPM PCRs at boot, after the recovery key checked out, so that for future boots everything is in order again.

Other approaches can work too: for example, some OSes simply remove TPM PCR policy protection of disk encryption keys altogether immediately before OS or firmware updates, and then reenable it right after. Of course, this opens a time window where the key bound to the TPM is much less protected than people might assume. I'd try to avoid such a scheme if possible.

Anything Else?

So, given that we are talking about idealized systems: I personally actually think the ideal OS would be much simpler, and thus more secure than this:

I'd try to ditch the Shim, and instead focus on enrolling the distribution vendor keys directly in the UEFI firmware certificate list. This is actually supported by all firmwares too. This has various benefits: it's no longer necessary to bind everything to Microsoft's root key, you can just enroll your own stuff and thus make sure only what you want to trust is trusted and nothing else. To make an approach like this easier, we have been working on doing automatic enrollment of these keys from the systemd-boot boot loader, see this work in progress for details. This way the Firmware will authenticate the boot loader/kernel/initrd without any further component for this in place.

I'd also not bother with a separate boot partition, and just use the ESP for everything. The ESP is required anyway by the firmware, and is good enough for storing the few files we need.

FAQ

Can I implement all of this in my distribution today?

Probably not. While the big issues have mostly been addressed there's a lot of integration work still missing. As you might have seen I linked some PRs that haven't even been merged into our tree yet, and definitely not been released yet or even entered the distributions.

Will this show up in Fedora/Debian/Ubuntu soon?

I don't know. I am making a proposal how these things might work, and am working on getting various building blocks for this into shape. What the distributions do is up to them. But even if they don't follow the recommendations I make 100%, or don't want to use the building blocks I propose I think it's important they start thinking about this, and yes, I think they should be thinking about defaulting to setups like this.

Work for measuring/signing initrds on Fedora has been started, here's a slide deck with some information about it.

But isn't a TPM evil?

Some corners of the community tried (unfortunately successfully to some degree) to paint TPMs/Trusted Computing/SecureBoot as generally evil technologies that stop us from using our systems the way we want. That idea is rubbish though, I think. We should focus on what it can deliver for us (and that's a lot I think, see above), and appreciate the fact we can actually use it to kick out perceived evil empires from our devices instead of being subjected to them. Yes, the way SecureBoot/TPMs are defined puts you in the driver seat if you want — and you may enroll your own certificates to keep out everything you don't like.

What if my system doesn't have a TPM?

TPMs are becoming quite ubiquitous, in particular as the upcoming Windows versions will require them. In general I think we should focus on modern, fully equipped systems when designing all this, and then find fall-backs for more limited systems. Frankly it feels as if so far the design approach for all this was the other way round: try to make the new stuff work like the old rather than the old like the new (I mean, to me it appears this thinking is the main raison d'être for the Grub boot loader).

More specifically, on the systems where we have no TPM we ultimately cannot provide the same security guarantees as for those which have. So depending on the resource to protect we should fall back to different TPM-less mechanisms. For example, if we have no TPM then the root file system should probably be encrypted with a user provided password, typed in at boot as before. And for the encrypted boot credentials we probably should simply not encrypt them, and place them in the ESP unencrypted.

Effectively this means: without TPM you'll still get protection regarding the basic attack scenario, as before, but not the other two.

What if my system doesn't have UEFI?

Many of the mechanisms explained above taken individually do not require UEFI. But of course the chain of trust suggested above requires something like UEFI SecureBoot. If your system lacks UEFI it's probably best to find work-alikes to the technologies suggested above, but I doubt I'll be able to help you there.

rpm/dpkg already cryptographically validates all packages at installation time (gpg), why would I need more than that?

This type of package validation happens once: at the moment of installation (or update) of the package, but not anymore when the data installed is actually used. Thus when an attacker manages to modify the package data after installation and before use they can make any change they like without this ever being noticed. Such package download validation does address certain attack scenarios (i.e. man-in-the-middle attacks on network downloads), but it doesn't protect you from attackers with physical access, as described in the attack scenarios above.

Systems such as ostree aren't better than rpm/dpkg regarding this BTW, their data is not validated on use either, but only during download or when processing tree checkouts.

Key really here is that the scheme explained here provides offline protection for the data "at rest" — even someone with physical access to your device cannot easily make changes that aren't noticed on next use. rpm/dpkg/ostree provide online protection only: as long as the system remains up, and all OS changes are done through the intended program code-paths, and no one has physical access everything should be good. In today's world I am sure this is not good enough though. As mentioned most modern OSes provide offline protection for the data at rest in one way or another. Generic Linux distributions are terribly behind on this.

This is all so desktop/laptop focused, what about servers?

I am pretty sure servers should provide similar security guarantees as outlined above. In a way servers are a much simpler case: there are no users and no interactivity. Thus the discussion of /home/ and what it contains and of user passwords doesn't matter. However, the authenticated initrd and the unattended TPM-based encryption I think are very important for servers too, in a trusted data center environment. It provides security guarantees so far not given by Linux server OSes.

I'd like to help with this, or discuss/comment on this

Submit patches or reviews through GitHub. General discussion about this is best done on the systemd mailing list.


The Wondrous World of Discoverable GPT Disk Images

TL;DR: Tag your GPT partitions with the right, descriptive partition types, and the world will become a better place.

A number of years ago we started the Discoverable Partitions Specification which defines GPT partition type UUIDs and partition flags for the various partitions Linux systems typically deal with. Before the specification all Linux partitions usually just used the same type, basically saying "Hey, I am a Linux partition" and not much else. With this specification the GPT partition type, flags and label system becomes a lot more expressive, as it can tell you:

  1. What kind of data a partition contains (i.e. is this swap data, a file system or Verity data?)
  2. What the purpose/mount point of a partition is (i.e. is this a /home/ partition or a root file system?)
  3. What CPU architecture a partition is intended for (i.e. is this a root partition for x86-64 or for aarch64?)
  4. Shall this partition be mounted automatically? (i.e. without specifically be configured via /etc/fstab)
  5. And if so, shall it be mounted read-only?
  6. And if so, shall the file system be grown to its enclosing partition size, if smaller?
  7. Which partition contains the newer version of the same data (i.e. multiple root file systems, with different versions)

By embedding all of this information inside the GPT partition table disk images become self-descriptive: without requiring any other source of information (such as /etc/fstab) if you look at a compliant GPT disk image it is clear how an image is put together and how it should be used and mounted. This self-descriptiveness in particular breaks one philosophical weirdness of traditional Linux installations: the original source of information which file system the root file system is, typically is embedded in the root file system itself, in /etc/fstab. Thus, in a way, in order to know what the root file system is you need to know what the root file system is. 🤯 🤯 🤯

(Of course, the way this recursion is traditionally broken up is by then copying the root file system information from /etc/fstab into the boot loader configuration, resulting in a situation where the primary source of information for this — i.e. /etc/fstab — is actually mostly irrelevant, and the secondary source — i.e. the copy in the boot loader — becomes the configuration that actually matters.)

Today, the GPT partition type UUIDs defined by the specification have been adopted quite widely, by distributions and their installers, as well as a variety of partitioning tools and other tools.

In this article I want to highlight how the various tools the systemd project provides make use of the concepts the specification introduces.

But before we start with that, let's underline why tagging partitions with these descriptive partition type UUIDs (and the associated partition flags) is a good thing, besides the philosophical points made above.

  1. Simplicity: in particular OS installers become simpler — adjusting /etc/fstab as part of the installation is not necessary anymore, as the partitioning step already put all information into place for assembling the system properly at boot. i.e. installing doesn't mean that you always have to get fdisk and /etc/fstab into place, the former suffices entirely.

  2. Robustness: since partition tables mostly remain static after installation the chance of corruption is much lower than if the data is stored in file systems (e.g. in /etc/fstab). Moreover by associating the metadata directly with the objects it describes the chance of things getting out of sync is reduced. (i.e. if you lose /etc/fstab, or forget to rerun your initrd builder you still know what a partition is supposed to be just by looking at it.)

  3. Programmability: if partitions are self-descriptive it's much easier to automatically process them with various tools. In fact, this blog story is mostly about that: various systemd tools can naturally process disk images prepared like this.

  4. Alternative entry points: on traditional disk images, the boot loader needs to be told which kernel command line option root= to use, which then provides access to the root file system, where /etc/fstab is then found which describes the rest of the file systems. Where precisely root= is configured for the boot loader highly depends on the boot loader and distribution used, and is typically encoded in a Turing complete programming language (Grub…). This makes it very hard to automatically determine the right root file system to use, to implement alternative entry points to the system. By alternative entry points I mean other ways to boot the disk image, specifically for running it as a systemd-nspawn container — but this extends to other mechanisms where the boot loader may be bypassed to boot up the system, for example qemu when configured without a boot loader.

  5. User friendliness: it's simply a lot nicer for the user looking at a partition table if the partition table explains what is what, instead of just saying "Hey, this is a Linux partition!" and nothing else.

Uses for the concept

Now that we cleared up the Why?, lets have a closer look how this is currently used and exposed in systemd's various components.

Use #1: Running a disk image in a container

If a disk image follows the Discoverable Partition Specification then systemd-nspawn has all it needs to just boot it up. Specifically, if you have a GPT disk image in a file foobar.raw and you want to boot it up in a container, just run systemd-nspawn -i foobar.raw -b, and that's it (you can specify a block device like /dev/sdb too if you like). It becomes easy and natural to prepare disk images that can be booted either on a physical machine, inside a virtual machine manager or inside such a container manager: the necessary meta-information is included in the image, easily accessible before actually looking into its file systems.

Use #2: Booting an OS image on bare-metal without /etc/fstab or kernel command line root=

If a disk image follows the specification in many cases you can remove /etc/fstab (or never even install it) — as the basic information needed is already included in the partition table. The systemd-gpt-auto-generator logic implements automatic discovery of the root file system as well as all auxiliary file systems. (Note that the former requires an initrd that uses systemd, some more conservative distributions do not support that yet, unfortunately). Effectively this means you can boot up a kernel/initrd with an entirely empty kernel command line, and the initrd will automatically find the root file system (by looking for a suitably marked partition on the same drive the EFI System Partition was found on).

(Note, if /etc/fstab or root= exist and contain relevant information they always takes precedence over the automatic logic. This is in particular useful to tweaks thing by specifying additional mount options and such.)

Use #3: Mounting a complex disk image for introspection or manipulation

The systemd-dissect tool may be used to introspect and manipulate OS disk images that implement the specification. If you pass the path to a disk image (or block device) it will extract various bits of useful information from the image (e.g. what OS is this? what partitions to mount?) and display it.

With the --mount switch a disk image (or block device) can be mounted to some location. This is useful for looking what is inside it, or changing its contents. This will dissect the image and then automatically mount all contained file systems matching their GPT partition description to the right places, so that you subsequently could chroot into it. (But why chroot if you can just use systemd-nspawn? 😎)

Use #4: Copying files in and out of a disk image

The systemd-dissect tool also has two switches --copy-from and --copy-to which allow copying files out of or into a compliant disk image, taking all included file systems and the resulting mount hierarchy into account.

Use #5: Running services directly off a disk image

The RootImage= setting in service unit files accepts paths to compliant disk images (or block device nodes), and can mount them automatically, running service binaries directly off them (in chroot() style). In fact, this is the base for the Portable Service concept of systemd.

Use #6: Provisioning disk images

systemd provides various tools that can run operations provisioning disk images in an "offline" mode. Specifically:

systemd-tmpfiles

With the --image= switch systemd-tmpfiles can directly operate on a disk image, and for example create all directories and other inodes defined in its declarative configuration files included in the image. This can be useful for example to set up the /var/ or /etc/ tree according to such configuration before first boot.

systemd-sysusers

Similar, the --image= switch of systemd-sysusers tells the tool to read the declarative system user specifications included in the image and synthesizes system users from it, writing them to the /etc/passwd (and related) files in the image. This is useful for provisioning these users before the first boot, for example to ensure UID/GID numbers are pre-allocated, and such allocations not delayed until first boot.

systemd-machine-id-setup

The --image= switch of systemd-machine-id-setup may be used to provision a fresh machine ID into /etc/machine-id of a disk image, before first boot.

systemd-firstboot

The --image= switch of systemd-firstboot may be used to set various basic system setting (such as root password, locale information, hostname, …) on the specified disk image, before booting it up.

Use #7: Extracting log information

The journalctl switch --image= may be used to show the journal log data included in a disk image (or, as usual, the specified block device). This is very useful for analyzing failed systems offline, as it gives direct access to the logs without any further, manual analysis.

Use #8: Automatic repartitioning/growing of file systems

The systemd-repart tool may be used to repartition a disk or image in an declarative and additive way. One primary use-case for it is to run during boot on physical or VM systems to grow the root file system to the disk size, or to add in, format, encrypt, populate additional partitions at boot.

With its --image= switch it the tool may operate on compliant disk images in offline mode of operation: it will then read the partition definitions that shall be grown or created off the image itself, and then apply them to the image. This is particularly useful in combination with the --size= which allows growing disk images to the specified size.

Specifically, consider the following work-flow: you download a minimized disk image foobar.raw that contains only the minimized root file system (and maybe an ESP, if you want to boot it on bare-metal, too). You then run systemd-repart --image=foo.raw --size=15G to enlarge the image to the 15G, based on the declarative rules defined in the repart.d/ drop-in files included in the image (this means this can grow the root partition, and/or add in more partitions, for example for /srv or so, maybe encrypted with a locally generated key or so). Then, you proceed to boot it up with systemd-nspawn --image=foo.raw -b, making use of the full 15G.

Versioning + Multi-Arch

Disk images implementing this specifications can carry OS executables in one of three ways:

  1. Only a root file system

  2. Only a /usr/ file system (in which case the root file system is automatically picked as tmpfs).

  3. Both a root and a /usr/file system (in which case the two are combined, the /usr/ file system mounted into the root file system, and the former possibly in read-only fashion`)

They may also contain OS executables for different architectures, permitting "multi-arch" disk images that can safely boot up on multiple CPU architectures. As the root and /usr/ partition type UUIDs are specific to architectures this is easily done by including one such partition for x86-64, and another for aarch64. If the image is now used on an x86-64 system automatically the former partition is used, on aarch64 the latter.

Moreover, these OS executables may be contained in different versions, to implement a simple versioning scheme: when tools such as systemd-nspawn or systemd-gpt-auto-generator dissect a disk image, and they find two or more root or /usr/ partitions of the same type UUID, they will automatically pick the one whose GPT partition label (a 36 character free-form string every GPT partition may have) is the newest according to strverscmp() (OK, truth be told, we don't use strverscmp() as-is, but a modified version with some more modern syntax and semantics, but conceptually identical).

This logic allows to implement a very simple and natural A/B update scheme: an updater can drop multiple versions of the OS into separate root or /usr/ partitions, always updating the partition label to the version included there-in once the download is complete. All of the tools described here will then honour this, and always automatically pick the newest version of the OS.

Verity

When building modern OS appliances, security is highly relevant. Specifically, offline security matters: an attacker with physical access should have a difficult time modifying the OS in a way that isn't noticed. i.e. think of a car or a cell network base station: these appliances are usually parked/deployed in environments attackers can get physical access to: it's essential that in this case the OS itself sufficiently protected, so that the attacker cannot just mount the OS file system image, make modifications (inserting a backdoor, spying software or similar) and the system otherwise continues to run without this being immediately detected.

A great way to implement offline security is via Linux' dm-verity subsystem: it allows to securely bind immutable disk IO to a single, short trusted hash value: if an attacker manages to offline modify the disk image the modified disk image won't match the trusted hash anymore, and will not be trusted anymore (depending on policy this then just result in IO errors being generated, or automatic reboot/power-off).

The Discoverable Partitions Specification declares how to include Verity validation data in disk images, and how to relate them to the file systems they protect, thus making if very easy to deploy and work with such protected images. For example systemd-nspawn supports a --root-hash= switch, which accepts the Verity root hash and then will automatically assemble dm-verity with this, automatically matching up the payload and verity partitions. (Alternatively, just place a .roothash file next to the image file).

Future

The above already is a powerful tool set for working with disk images. However, there are some more areas I'd like to extend this logic to:

bootctl

Similar to the other tools mentioned above, bootctl (which is a tool to interface with the boot loader, and install/update systemd's own EFI boot loader sd-boot) should learn a --image= switch, to make installation of the boot loader on disk images easy and natural. It would automatically find the ESP and other relevant partitions in the image, and copy the boot loader binaries into them (or update them).

coredumpctl

Similar to the existing journalctl --image= logic the coredumpctl tool should also gain an --image= switch for extracting coredumps from compliant disk images. The combination of journalctl --image= and coredumpctl --image= would make it exceptionally easy to work with OS disk images of appliances and extracting logging and debugging information from them after failures.

And that's all for now. Please refer to the specification and the man pages for further details. If your distribution's installer does not yet tag the GPT partition it creates with the right GPT type UUIDs, consider asking them to do so.

Thank you for your time.


File Descriptor Limits

TL;DR: don't use select() + bump the RLIMIT_NOFILE soft limit to the hard limit in your modern programs.

The primary way to reference, allocate and pin runtime OS resources on Linux today are file descriptors ("fds"). Originally they were used to reference open files and directories and maybe a bit more, but today they may be used to reference almost any kind of runtime resource in Linux userspace, including open devices, memory (memfd_create(2)), timers (timefd_create(2)) and even processes (with the new pidfd_open(2) system call). In a way, the philosophically skewed UNIX concept of "everything is a file" through the proliferation of fds actually acquires a bit of sensible meaning: "everything has a file descriptor" is certainly a much better motto to adopt.

Because of this proliferation of fds, non-trivial modern programs tend to have to deal with substantially more fds at the same time than they traditionally did. Today, you'll often encounter real-life programs that have a few thousand fds open at the same time.

Like on most runtime resources on Linux limits are enforced on file descriptors: once you hit the resource limit configured via RLIMIT_NOFILE any attempt to allocate more is refused with the EMFILE error — until you close a couple of those you already have open.

Because fds weren't such a universal concept traditionally, the limit of RLIMIT_NOFILE used to be quite low. Specifically, when the Linux kernel first invokes userspace it still sets RLIMIT_NOFILE to a low value of 1024 (soft) and 4096 (hard). (Quick explanation: the soft limit is what matters and causes the EMFILE issues, the hard limit is a secondary limit that processes may bump their soft limit to — if they like — without requiring further privileges to do so. Bumping the limit further would require privileges however.). A limit of 1024 fds made fds a scarce resource: APIs tried to be careful with using fds, since you simply couldn't have that many of them at the same time. This resulted in some questionable coding decisions and concepts at various places: often secondary descriptors that are very similar to fds — but were not actually fds — were introduced (e.g. inotify watch descriptors), simply to avoid for them the low limits enforced on true fds. Or code tried to aggressively close fds when not absolutely needing them (e.g. ftw()/nftw()), losing the nice + stable "pinning" effect of open fds.

Worse though is that certain OS level APIs were designed having only the low limits in mind. The worst offender being the BSD/POSIX select(2) system call: it only works with fds in the numeric range of 0…1023 (aka FD_SETSIZE-1). If you have an fd outside of this range, tough luck: select() won't work, and only if you are lucky you'll detect that and can handle it somehow.

Linux fds are exposed as simple integers, and for most calls it is guaranteed that the lowest unused integer is allocated for new fds. Thus, as long as the RLIMIT_NOFILE soft limit is set to 1024 everything remains compatible with select(): the resulting fds will also be below 1024. Yay. If we'd bump the soft limit above this threshold though and at some point in time an fd higher than the threshold is allocated, this fd would not be compatible with select() anymore.

Because of that, indiscriminately increasing the soft RLIMIT_NOFILE resource limit today for every userspace process is problematic: as long as there's userspace code still using select() doing so will risk triggering hard-to-handle, hard-to-debug errors all over the place.

However, given the nowadays ubiquitous use of fds for all kinds of resources (did you know, an eBPF program is an fd? and a cgroup too? and attaching an eBPF program to cgroup is another fd? …), we'd really like to raise the limit anyway. 🤔

So before we continue thinking about this problem, let's make the problem more complex (…uh, I mean… "more exciting") first. Having just one hard and one soft per-process limit on fds is boring. Let's add more limits on fds to the mix. Specifically on Linux there are two system-wide sysctls: fs.nr_open and fs.file-max. (Don't ask me why one uses a dash and the other an underscore, or why there are two of them...) On today's kernels they kinda lost their relevance. They had some originally, because fds weren't accounted by any other counter. But today, the kernel tracks fds mostly as small pieces of memory allocated on userspace requests — because that's ultimately what they are —, and thus charges them to the memory accounting done anyway.

So now, we have four limits (actually: five if you count the memory accounting) on the same kind of resource, and all of them make a resource artificially scarce that we don't want to be scarce. So what to do?

Back in systemd v240 already (i.e. 2019) we decided to do something about it. Specifically:

  • Automatically at boot we'll now bump the two sysctls to their maximum, making them effectively ineffective. This one was easy. We got rid of two pretty much redundant knobs. Nice!

  • The RLIMIT_NOFILE hard limit is bumped substantially to 512K. Yay, cheap fds! You may have an fd, and you, and you as well, everyone may have an fd!

  • But … we left the soft RLIMIT_NOFILE limit at 1024. We weren't quite ready to break all programs still using select() in 2019 yet. But it's not as bad as it might sound I think: given the hard limit is bumped every program can easily opt-in to a larger number of fds, by setting the soft limit to the hard limit early on — without requiring privileges.

So effectively, with this approach fds should be much less scarce (at least for programs that opt into that), and the limits should be much easier to configure, since there are only two knobs now one really needs to care about:

  • Configure the RLIMIT_NOFILE hard limit to the maximum number of fds you actually want to allow a process.

  • In the program code then either bump the soft to the hard limit, or not. If you do, you basically declare "I understood the problem, I promise to not use select(), drown me fds please!". If you don't then effectively everything remains as it always was.

Apparently this approach worked, since the negative feedback on change was even scarcer than fds traditionally were (ha, fun!). We got reports from pretty much only two projects that were bitten by the change (one being a JVM implementation): they already bumped their soft limit automatically to their hard limit during program initialization, and then allocated an array with one entry per possible fd. With the new high limit this resulted in one massive allocation that traditionally was just a few K, and this caused memory checks to be hit.

Anyway, here's the take away of this blog story:

  • Don't use select() anymore in 2021. Use poll(), epoll, iouring, …, but for heaven's sake don't use select(). It might have been all the rage in the 1990s but it doesn't scale and is simply not designed for today's programs. I wished the man page of select() would make clearer how icky it is and that there are plenty of more preferably APIs.

  • If you hack on a program that potentially uses a lot of fds, add some simple code somewhere to its start-up that bumps the RLIMIT_NOFILE soft limit to the hard limit. But if you do this, you have to make sure your code (and any code that you link to from it) refrains from using select(). (Note: there's at least one glibc NSS plugin using select() internally. Given that NSS modules can end up being loaded into pretty much any process such modules should probably be considered just buggy.)

  • If said program you hack on forks off foreign programs, make sure to reset the RLIMIT_NOFILE soft limit back to 1024 for them. Just because your program might be fine with fds >= 1024 it doesn't mean that those foreign programs might. And unfortunately RLIMIT_NOFILE is inherited down the process tree unless explicitly set.

And that's all I have for today. I hope this was enlightening.


Unlocking LUKS2 volumes with TPM2, FIDO2, PKCS#11 Security Hardware on systemd 248

TL;DR: It's now easy to unlock your LUKS2 volume with a FIDO2 security token (e.g. YubiKey, Nitrokey FIDO2, AuthenTrend ATKey.Pro). And TPM2 unlocking is easy now too.

Blogging is a lot of work, and a lot less fun than hacking. I mostly focus on the latter because of that, but from time to time I guess stuff is just too interesting to not be blogged about. Hence here, finally, another blog story about exciting new features in systemd.

With the upcoming systemd v248 the systemd-cryptsetup component of systemd (which is responsible for assembling encrypted volumes during boot) gained direct support for unlocking encrypted storage with three types of security hardware:

  1. Unlocking with FIDO2 security tokens (well, at least with those which implement the hmac-secret extension; most do). i.e. your YubiKeys (series 5 and above), Nitrokey FIDO2, AuthenTrend ATKey.Pro and such.

  2. Unlocking with TPM2 security chips (pretty ubiquitous on non-budget PCs/laptops/…)

  3. Unlocking with PKCS#11 security tokens, i.e. your smartcards and older YubiKeys (the ones that implement PIV). (Strictly speaking this was supported on older systemd already, but was a lot more "manual".)

For completeness' sake, let's keep in mind that the component also allows unlocking with these more traditional mechanisms:

  1. Unlocking interactively with a user-entered passphrase (i.e. the way most people probably already deploy it, supported since about forever)

  2. Unlocking via key file on disk (optionally on removable media plugged in at boot), supported since forever.

  3. Unlocking via a key acquired through trivial AF_UNIX/SOCK_STREAM socket IPC. (Also new in v248)

  4. Unlocking via recovery keys. These are pretty much the same thing as a regular passphrase (and in fact can be entered wherever a passphrase is requested) — the main difference being that they are always generated by the computer, and thus have guaranteed high entropy, typically higher than user-chosen passphrases. They are generated in a way they are easy to type, in many cases even if the local key map is misconfigured. (Also new in v248)

In this blog story, let's focus on the first three items, i.e. those that talk to specific types of hardware for implementing unlocking.

To make working with security tokens and TPM2 easy, a new, small tool was added to the systemd tool set: systemd-cryptenroll. It's only purpose is to make it easy to enroll your security token/chip of choice into an encrypted volume. It works with any LUKS2 volume, and embeds a tiny bit of meta-information into the LUKS2 header with parameters necessary for the unlock operation.

Unlocking with FIDO2

So, let's see how this fits together in the FIDO2 case. Most likely this is what you want to use if you have one of these fancy FIDO2 tokens (which need to implement the hmac-secret extension, as mentioned). Let's say you already have your LUKS2 volume set up, and previously unlocked it with a simple passphrase. Plug in your token, and run:

# systemd-cryptenroll --fido2-device=auto /dev/sda5

(Replace /dev/sda5 with the underlying block device of your volume).

This will enroll the key as an additional way to unlock the volume, and embeds all necessary information for it in the LUKS2 volume header. Before we can unlock the volume with this at boot, we need to allow FIDO2 unlocking via /etc/crypttab. For that, find the right entry for your volume in that file, and edit it like so:

myvolume /dev/sda5 - fido2-device=auto

Replace myvolume and /dev/sda5 with the right volume name, and underlying device of course. Key here is the fido2-device=auto option you need to add to the fourth column in the file. It tells systemd-cryptsetup to use the FIDO2 metadata now embedded in the LUKS2 header, wait for the FIDO2 token to be plugged in at boot (utilizing systemd-udevd, …) and unlock the volume with it.

And that's it already. Easy-peasy, no?

Note that all of this doesn't modify the FIDO2 token itself in any way. Moreover you can enroll the same token in as many volumes as you like. Since all enrollment information is stored in the LUKS2 header (and not on the token) there are no bounds on any of this. (OK, well, admittedly, there's a cap on LUKS2 key slots per volume, i.e. you can't enroll more than a bunch of keys per volume.)

Unlocking with PKCS#11

Let's now have a closer look how the same works with a PKCS#11 compatible security token or smartcard. For this to work, you need a device that can store an RSA key pair. I figure most security tokens/smartcards that implement PIV qualify. How you actually get the keys onto the device might differ though. Here's how you do this for any YubiKey that implements the PIV feature:

# ykman piv reset
# ykman piv generate-key -a RSA2048 9d pubkey.pem
# ykman piv generate-certificate --subject "Knobelei" 9d pubkey.pem
# rm pubkey.pem

(This chain of commands erases what was stored in PIV feature of your token before, be careful!)

For tokens/smartcards from other vendors a different series of commands might work. Once you have a key pair on it, you can enroll it with a LUKS2 volume like so:

# systemd-cryptenroll --pkcs11-token-uri=auto /dev/sda5

Just like the same command's invocation in the FIDO2 case this enrolls the security token as an additional way to unlock the volume, any passphrases you already have enrolled remain enrolled.

For the PKCS#11 case you need to edit your /etc/crypttab entry like this:

myvolume /dev/sda5 - pkcs11-uri=auto

If you have a security token that implements both PKCS#11 PIV and FIDO2 I'd probably enroll it as FIDO2 device, given it's the more contemporary, future-proof standard. Moreover, it requires no special preparation in order to get an RSA key onto the device: FIDO2 keys typically just work.

Unlocking with TPM2

Most modern (non-budget) PC hardware (and other kind of hardware too) nowadays comes with a TPM2 security chip. In many ways a TPM2 chip is a smartcard that is soldered onto the mainboard of your system. Unlike your usual USB-connected security tokens you thus cannot remove them from your PC, which means they address quite a different security scenario: they aren't immediately comparable to a physical key you can take with you that unlocks some door, but they are a key you leave at the door, but that refuses to be turned by anyone but you.

Even though this sounds a lot weaker than the FIDO2/PKCS#11 model TPM2 still bring benefits for securing your systems: because the cryptographic key material stored in TPM2 devices cannot be extracted (at least that's the theory), if you bind your hard disk encryption to it, it means attackers cannot just copy your disk and analyze it offline — they always need access to the TPM2 chip too to have a chance to acquire the necessary cryptographic keys. Thus, they can still steal your whole PC and analyze it, but they cannot just copy the disk without you noticing and analyze the copy.

Moreover, you can bind the ability to unlock the harddisk to specific software versions: for example you could say that only your trusted Fedora Linux can unlock the device, but not any arbitrary OS some hacker might boot from a USB stick they plugged in. Thus, if you trust your OS vendor, you can entrust storage unlocking to the vendor's OS together with your TPM2 device, and thus can be reasonably sure intruders cannot decrypt your data unless they both hack your OS vendor and steal/break your TPM2 chip.

Here's how you enroll your LUKS2 volume with your TPM2 chip:

# systemd-cryptenroll --tpm2-device=auto --tpm2-pcrs=7 /dev/sda5

This looks almost as straightforward as the two earlier sytemd-cryptenroll command lines — if it wasn't for the --tpm2-pcrs= part. With that option you can specify to which TPM2 PCRs you want to bind the enrollment. TPM2 PCRs are a set of (typically 24) hash values that every TPM2 equipped system at boot calculates from all the software that is invoked during the boot sequence, in a secure, unfakable way (this is called "measurement"). If you bind unlocking to a specific value of a specific PCR you thus require the system has to follow the same sequence of software at boot to re-acquire the disk encryption key. Sounds complex? Well, that's because it is.

For now, let's see how we have to modify your /etc/crypttab to unlock via TPM2:

myvolume /dev/sda5 - tpm2-device=auto

This part is easy again: the tpm2-device= option is what tells systemd-cryptsetup to use the TPM2 metadata from the LUKS2 header and to wait for the TPM2 device to show up.

Bonus: Recovery Key Enrollment

FIDO2, PKCS#11 and TPM2 security tokens and chips pair well with recovery keys: since you don't need to type in your password everyday anymore it makes sense to get rid of it, and instead enroll a high-entropy recovery key you then print out or scan off screen and store a safe, physical location. i.e. forget about good ol' passphrase-based unlocking, go for FIDO2 plus recovery key instead! Here's how you do it:

# systemd-cryptenroll --recovery-key /dev/sda5

This will generate a key, enroll it in the LUKS2 volume, show it to you on screen and generate a QR code you may scan off screen if you like. The key has highest entropy, and can be entered wherever you can enter a passphrase. Because of that you don't have to modify /etc/crypttab to make the recovery key work.

Future

There's still plenty room for further improvement in all of this. In particular for the TPM2 case: what the text above doesn't really mention is that binding your encrypted volume unlocking to specific software versions (i.e. kernel + initrd + OS versions) actually sucks hard: if you naively update your system to newer versions you might lose access to your TPM2 enrolled keys (which isn't terrible, after all you did enroll a recovery key — right? — which you then can use to regain access). To solve this some more integration with distributions would be necessary: whenever they upgrade the system they'd have to make sure to enroll the TPM2 again — with the PCR hashes matching the new version. And whenever they remove an old version of the system they need to remove the old TPM2 enrollment. Alternatively TPM2 also knows a concept of signed PCR hash values. In this mode the distro could just ship a set of PCR signatures which would unlock the TPM2 keys. (But quite frankly I don't really see the point: whether you drop in a signature file on each system update, or enroll a new set of PCR hashes in the LUKS2 header doesn't make much of a difference). Either way, to make TPM2 enrollment smooth some more integration work with your distribution's system update mechanisms need to happen. And yes, because of this OS updating complexity the example above — where I referenced your trusty Fedora Linux — doesn't actually work IRL (yet? hopefully…). Nothing updates the enrollment automatically after you initially enrolled it, hence after the first kernel/initrd update you have to manually re-enroll things again, and again, and again … after every update.

The TPM2 could also be used for other kinds of key policies, we might look into adding later too. For example, Windows uses TPM2 stuff to allow short (4 digits or so) "PINs" for unlocking the harddisk, i.e. kind of a low-entropy password you type in. The reason this is reasonably safe is that in this case the PIN is passed to the TPM2 which enforces that not more than some limited amount of unlock attempts may be made within some time frame, and that after too many attempts the PIN is invalidated altogether. Thus making dictionary attacks harder (which would normally be easier given the short length of the PINs).

Postscript

(BTW: Yubico sent me two YubiKeys for testing, Nitrokey a Nitrokey FIDO2, and AuthenTrend three ATKey.Pro tokens, thank you! — That's why you see all those references to YubiKey/Nitrokey/AuthenTrend devices in the text above: it's the hardware I had to test this with. That said, I also tested the FIDO2 stuff with a SoloKey I bought, where it also worked fine. And yes, you!, other vendors!, who might be reading this, please send me your security tokens for free, too, and I might test things with them as well. No promises though. And I am not going to give them back, if you do, sorry. ;-))


ASG! 2019 CfP Re-Opened!

The All Systems Go! 2019 Call for Participation Re-Opened for ONE DAY!

Due to popular request we have re-opened the Call for Participation (CFP) for All Systems Go! 2019 for one day. It will close again TODAY, on 15 of July 2019, midnight Central European Summit Time! If you missed the deadline so far, we’d like to invite you to submit your proposals for consideration to the CFP submission site quickly! (And yes, this is the last extension, there's not going to be any more extensions.)

ASG image

All Systems Go! is everybody's favourite low-level Userspace Linux conference, taking place in Berlin, Germany in September 20-22, 2019.

For more information please visit our conference website!


Walkthrough for Portable Services in Go

Portable Services Walkthrough (Go Edition)

A few months ago I posted a blog story with a walkthrough of systemd Portable Services. The example service given was written in C, and the image was built with mkosi. In this blog story I'd like to revisit the exercise, but this time focus on a different aspect: modern programming languages like Go and Rust push users a lot more towards static linking of libraries than the usual dynamic linking preferred by C (at least in the way C is used by traditional Linux distributions).

Static linking means we can greatly simplify image building: if we don't have to link against shared libraries during runtime we don't have to include them in the portable service image. And that means pretty much all need for building an image from a Linux distribution of some kind goes away as we'll have next to no dependencies that would require us to rely on a distribution package manager or distribution packages. In fact, as it turns out, we only need as few as three files in the portable service image to be fully functional.

So, let's have a closer look how such an image can be put together. All of the following is available in this git repository.

A Simple Go Service

Let's start with a simple Go service, an HTTP service that simply counts how often a page from it is requested. Here are the sources: main.go — note that I am not a seasoned Go programmer, hence please be gracious.

The service implements systemd's socket activation protocol, and thus can receive bound TCP listener sockets from systemd, using the $LISTEN_PID and $LISTEN_FDS environment variables.

The service will store the counter data in the directory indicated in the $STATE_DIRECTORY environment variable, which happens to be an environment variable current systemd versions set based on the StateDirectory= setting in service files.

Two Simple Unit Files

When a service shall be managed by systemd a unit file is required. Since the service we are putting together shall be socket activatable, we even have two: portable-walkthrough-go.service (the description of the service binary itself) and portable-walkthrough-go.socket (the description of the sockets to listen on for the service).

These units are not particularly remarkable: the .service file primarily contains the command line to invoke and a StateDirectory= setting to make sure the service when invoked gets its own private state directory under /var/lib/ (and the $STATE_DIRECTORY environment variable is set to the resulting path). The .socket file simply lists 8088 as TCP/IP port to listen on.

An OS Description File

OS images (and that includes portable service images) generally should include an os-release file. Usually, that is provided by the distribution. Since we are building an image without any distribution let's write our own version of such a file. Later on we can use the portablectl inspect command to have a look at this metadata of our image.

Putting it All Together

The four files described above are already every file we need to build our image. Let's now put the portable service image together. For that I've written a Makefile. It contains two relevant rules: the first one builds the static binary from the Go program sources. The second one then puts together a squashfs file system combining the following:

  1. The compiled, statically linked service binary
  2. The two systemd unit files
  3. The os-release file
  4. A couple of empty directories such as /proc/, /sys/, /dev/ and so on that need to be over-mounted with the respective kernel API file system. We need to create them as empty directories here since Linux insists on directories to exist in order to over-mount them, and since the image we are building is going to be an immutable read-only image (squashfs) these directories cannot be created dynamically when the portable image is mounted.
  5. Two empty files /etc/resolv.conf and /etc/machine-id that can be over-mounted with the same files from the host.

And that's already it. After a quick make we'll have our portable service image portable-walkthrough-go.raw and are ready to go.

Trying it out

Let's now attach the portable service image to our host system:

# portablectl attach ./portable-walkthrough-go.raw
(Matching unit files with prefix 'portable-walkthrough-go'.)
Created directory /etc/systemd/system.attached.
Created directory /etc/systemd/system.attached/portable-walkthrough-go.socket.d.
Written /etc/systemd/system.attached/portable-walkthrough-go.socket.d/20-portable.conf.
Copied /etc/systemd/system.attached/portable-walkthrough-go.socket.
Created directory /etc/systemd/system.attached/portable-walkthrough-go.service.d.
Written /etc/systemd/system.attached/portable-walkthrough-go.service.d/20-portable.conf.
Created symlink /etc/systemd/system.attached/portable-walkthrough-go.service.d/10-profile.conf  /usr/lib/systemd/portable/profile/default/service.conf.
Copied /etc/systemd/system.attached/portable-walkthrough-go.service.
Created symlink /etc/portables/portable-walkthrough-go.raw  /home/lennart/projects/portable-walkthrough-go/portable-walkthrough-go.raw.

The portable service image is now attached to the host, which means we can now go and start it (or even enable it):

# systemctl start portable-walkthrough-go.socket

Let's see if our little web service works, by doing an HTTP request on port 8088:

# curl localhost:8088
Hello! You are visitor #1!

Let's try this again, to check if it counts correctly:

# curl localhost:8088
Hello! You are visitor #2!

Nice! It worked. Let's now stop the service again, and detach the image again:

# systemctl stop portable-walkthrough-go.service portable-walkthrough-go.socket
# portablectl detach portable-walkthrough-go
Removed /etc/systemd/system.attached/portable-walkthrough-go.service.
Removed /etc/systemd/system.attached/portable-walkthrough-go.service.d/10-profile.conf.
Removed /etc/systemd/system.attached/portable-walkthrough-go.service.d/20-portable.conf.
Removed /etc/systemd/system.attached/portable-walkthrough-go.service.d.
Removed /etc/systemd/system.attached/portable-walkthrough-go.socket.
Removed /etc/systemd/system.attached/portable-walkthrough-go.socket.d/20-portable.conf.
Removed /etc/systemd/system.attached/portable-walkthrough-go.socket.d.
Removed /etc/portables/portable-walkthrough-go.raw.
Removed /etc/systemd/system.attached.

And there we go, the portable image file is detached from the host again.

A Couple of Notes

  1. Of course, this is a simplistic example: in real life services will be more than one compiled file, even when statically linked. But you get the idea, and it's very easy to extend the example above to include any additional, auxiliary files in the portable service image.

  2. The service is very nicely sandboxed during runtime: while it runs as regular service on the host (and you thus can watch its logs or do resource management on it like you would do for all other systemd services), it runs in a very restricted environment under a dynamically assigned UID that ceases to exist when the service is stopped again.

  3. Originally I wanted to make the service not only socket activatable but also implement exit-on-idle, i.e. add a logic so that the service terminates on its own when there's no ongoing HTTP connection for a while. I couldn't figure out how to do this race-freely in Go though, but I am sure an interested reader might want to add that? By combining socket activation with exit-on-idle we can turn this project into an excercise of putting together an extremely resource-friendly and robust service architecture: the service is started only when needed and terminates when no longer needed. This would allow to pack services at a much higher density even on systems with few resources.

  4. While the basic concepts of portable services have been around since systemd 239, it's best to try the above with systemd 241 or newer since the portable service logic received a number of fixes since then.

Further Reading

A low-level document introducing Portable Services is shipped along with systemd.

Please have a look at the blog story from a few months ago that did something very similar with a service written in C.

There are also relevant manual pages: portablectl(1) and systemd-portabled(8).


ASG! 2018 Tickets

All Systems Go! 2018 Tickets Selling Out Quickly!

Buy your tickets for All Systems Go! 2018 soon, they are quickly selling out! The conference takes place on September 28-30, in Berlin, Germany, in a bit over two weeks.

Why should you attend? If you are interested in low-level Linux userspace, then All Systems Go! is the right conference for you. It covers all topics relevant to foundational open-source Linux technologies. For details on the covered topics see our schedule for day #1 and for day #2.

For more information please visit our conference website!

See you in Berlin!


ASG! 2018 CfP Closes TODAY

The All Systems Go! 2018 Call for Participation Closes TODAY!

The Call for Participation (CFP) for All Systems Go! 2018 will close TODAY, on 30th of July! We’d like to invite you to submit your proposals for consideration to the CFP submission site quickly!

ASG image

All Systems Go! is everybody's favourite low-level Userspace Linux conference, taking place in Berlin, Germany in September 28-30, 2018.

For more information please visit our conference website!


ASG! 2018 CfP Closes Soon

The All Systems Go! 2018 Call for Participation Closes in One Week!

The Call for Participation (CFP) for All Systems Go! 2018 will close in one week, on 30th of July! We’d like to invite you to submit your proposals for consideration to the CFP submission site quickly!

ASG image

Notification of acceptance and non-acceptance will go out within 7 days of the closing of the CFP.

All topics relevant to foundational open-source Linux technologies are welcome. In particular, however, we are looking for proposals including, but not limited to, the following topics:

  • Low-level container executors and infrastructure
  • IoT and embedded OS infrastructure
  • BPF and eBPF filtering
  • OS, container, IoT image delivery and updating
  • Building Linux devices and applications
  • Low-level desktop technologies
  • Networking
  • System and service management
  • Tracing and performance measuring
  • IPC and RPC systems
  • Security and Sandboxing

While our focus is definitely more on the user-space side of things, talks about kernel projects are welcome, as long as they have a clear and direct relevance for user-space.

For more information please visit our conference website!


Walkthrough for Portable Services

Portable Services with systemd v239

systemd v239 contains a great number of new features. One of them is first class support for Portable Services. In this blog story I'd like to shed some light on what they are and why they might be interesting for your application.

What are "Portable Services"?

The "Portable Service" concept takes inspiration from classic chroot() environments as well as container management and brings a number of their features to more regular system service management.

While the definition of what a "container" really is is hotly debated, I figure people can generally agree that the "container" concept primarily provides two major features:

  1. Resource bundling: a container generally brings its own file system tree along, bundling any shared libraries and other resources it might need along with the main service executables.

  2. Isolation and sand-boxing: a container operates in a name-spaced environment that is relatively detached from the host. Besides living in its own file system namespace it usually also has its own user database, process tree and so on. Access from the container to the host is limited with various security technologies.

Of these two concepts the first one is also what traditional UNIX chroot() environments are about.

Both resource bundling and isolation/sand-boxing are concepts systemd has implemented to varying degrees for a longer time. Specifically, RootDirectory= and RootImage= have been around for a long time, and so have been the various sand-boxing features systemd provides. The Portable Services concept builds on that, putting these features together in a new, integrated way to make them more accessible and usable.

OK, so what precisely is a "Portable Service"?

Much like a container image, a portable service on disk can be just a directory tree that contains service executables and all their dependencies, in a hierarchy resembling the normal Linux directory hierarchy. A portable service can also be a raw disk image, containing a file system containing such a tree (which can be mounted via a loop-back block device), or multiple file systems (in which case they need to follow the Discoverable Partitions Specification and be located within a GPT partition table). Regardless whether the portable service on disk is a simple directory tree or a raw disk image, let's call this concept the portable service image.

Such images can be generated with any tool typically used for the purpose of installing OSes inside some directory, for example dnf --installroot= or debootstrap. There are very few requirements made on these trees, except the following two:

  1. The tree should carry systemd unit files for relevant services in them.

  2. The tree should carry /usr/lib/os-release (or /etc/os-release) OS release information.

Of course, as you might notice, OS trees generated from any of today's big distributions generally qualify for these two requirements without any further modification, as pretty much all of them adopted /usr/lib/os-release and tend to ship their major services with systemd unit files.

A portable service image generated like this can be "attached" or "detached" from a host:

  1. "Attaching" an image to a host is done through the new portablectl attach command. This command dissects the image, reading the os-release information, and searching for unit files in them. It then copies relevant unit files out of the images and into /etc/systemd/system/. After that it augments any copied service unit files in two ways: a drop-in adding a RootDirectory= or RootImage= line is added in so that even though the unit files are now available on the host when started they run the referenced binaries from the image. It also symlinks in a second drop-in which is called a "profile", which is supposed to carry additional security settings to enforce on the attached services, to ensure the right amount of sand-boxing.

  2. "Detaching" an image from the host is done through portable detach. It reverses the steps above: the unit files copied out are removed again, and so are the two drop-in files generated for them.

While a portable service is attached its relevant unit files are made available on the host like any others: they will appear in systemctl list-unit-files, you can enable and disable them, you can start them and stop them. You can extend them with systemctl edit. You can introspect them. You can apply resource management to them like to any other service, and you can process their logs like any other service and so on. That's because they really are native systemd services, except that they have 'twist' if you so will: they have tougher security by default and store their resources in a root directory or image.

And that's already the essence of what Portable Services are.

A couple of interesting points:

  1. Even though the focus is on shipping service unit files in portable service images, you can actually ship timer units, socket units, target units, path units in portable services too. This means you can very naturally do time, socket and path based activation. It's also entirely fine to ship multiple service units in the same image, in case you have more complex applications.

  2. This concept introduces zero new metadata. Unit files are an existing concept, as are os-release files, and — in case you opt for raw disk images — GPT partition tables are already established too. This also means existing tools to generate images can be reused for building portable service images to a large degree as no completely new artifact types need to be generated.

  3. Because the Portable Service concepts introduces zero new metadata and just builds on existing security and resource bundling features of systemd it's implemented in a set of distinct tools, relatively disconnected from the rest of systemd. Specifically, the main user-facing command is portablectl, and the actual operations are implemented in systemd-portabled.service. If you so will, portable services are a true add-on to systemd, just making a specific work-flow nicer to use than with the basic operations systemd otherwise provides. Also note that systemd-portabled provides bus APIs accessible to any program that wants to interface with it, portablectl is just one tool that happens to be shipped along with systemd.

  4. Since Portable Services are a feature we only added very recently we wanted to keep some freedom to make changes still. Due to that we decided to install the portablectl command into /usr/lib/systemd/ for now, so that it does not appear in $PATH by default. This means, for now you have to invoke it with a full path: /usr/lib/systemd/portablectl. We expect to move it into /usr/bin/ very soon though, and make it a fully supported interface of systemd.

  5. You may wonder which unit files contained in a portable service image are the ones considered "relevant" and are actually copied out by the portablectl attach operation. Currently, this is derived from the image name. Let's say you have an image stored in a directory /var/lib/portables/foobar_4711/ (or alternatively in a raw image /var/lib/portables/foobar_4711.raw). In that case the unit files copied out match the pattern foobar*.service, foobar*.socket, foobar*.target, foobar*.path, foobar*.timer.

  6. The Portable Services concept does not define any specific method how images get on the deployment machines, that's entirely up to administrators. You can just scp them there, or wget them. You could even package them as RPMs and then deploy them with dnf if you feel adventurous.

  7. Portable service images can reside in any directory you like. However, if you place them in /var/lib/portables/ then portablectl will find them easily and can show you a list of images you can attach and suchlike.

  8. Attaching a portable service image can be done persistently, so that it remains attached on subsequent boots (which is the default), or it can be attached only until the next reboot, by passing --runtime to portablectl.

  9. Because portable service images are ultimately just regular OS images, it's natural and easy to build a single image that can be used in three different ways:

    1. It can be attached to any host as a portable service image.

    2. It can be booted as OS container, for example in a container manager like systemd-nspawn.

    3. It can be booted as host system, for example on bare metal or in a VM manager.

    Of course, to qualify for the latter two the image needs to contain more than just the service binaries, the os-release file and the unit files. To be bootable an OS container manager such as systemd-nspawn the image needs to contain an init system of some form, for example systemd. To be bootable on bare metal or as VM it also needs a boot loader of some form, for example systemd-boot.

Profiles

In the previous section the "profile" concept was briefly mentioned. Since they are a major feature of the Portable Services concept, they deserve some focus. A "profile" is ultimately just a pre-defined drop-in file for unit files that are attached to a host. They are supposed to mostly contain sand-boxing and security settings, but may actually contain any other settings, too. When a portable service is attached a suitable profile has to be selected. If none is selected explicitly, the default profile called default is used. systemd ships with four different profiles out of the box:

  1. The default profile provides a medium level of security. It contains settings to drop capabilities, enforce system call filters, restrict many kernel interfaces and mount various file systems read-only.

  2. The strict profile is similar to the default profile, but generally uses the most restrictive sand-boxing settings. For example networking is turned off and access to AF_NETLINK sockets is prohibited.

  3. The trusted profile is the least strict of them all. In fact it makes almost no restrictions at all. A service run with this profile has basically full access to the host system.

  4. The nonetwork profile is mostly identical to default, but also turns off network access.

Note that the profile is selected at the time the portable service image is attached, and it applies to all service files attached, in case multiple are shipped in the same image. Thus, the sand-boxing restriction to enforce are selected by the administrator attaching the image and not the image vendor.

Additional profiles can be defined easily by the administrator, if needed. We might also add additional profiles sooner or later to be shipped with systemd out of the box.

What's the use-case for this? If I have containers, why should I bother?

Portable Services are primarily intended to cover use-cases where code should more feel like "extensions" to the host system rather than live in disconnected, separate worlds. The profile concept is supposed to be tunable to the exact right amount of integration or isolation needed for an application.

In the container world the concept of "super-privileged containers" has been touted a lot, i.e. containers that run with full privileges. It's precisely that use-case that portable services are intended for: extensions to the host OS, that default to isolation, but can optionally get as much access to the host as needed, and can naturally take benefit of the full functionality of the host. The concept should hence be useful for all kinds of low-level system software that isn't shipped with the OS itself but needs varying degrees of integration with it. Besides servers and appliances this should be particularly interesting for IoT and embedded devices.

Because portable services are just a relatively small extension to the way system services are otherwise managed, they can be treated like regular service for almost all use-cases: they will appear along regular services in all tools that can introspect systemd unit data, and can be managed the same way when it comes to logging, resource management, runtime life-cycles and so on.

Portable services are a very generic concept. While the original use-case is OS extensions, it's of course entirely up to you and other users to use them in a suitable way of your choice.

Walkthrough

Let's have a look how this all can be used. We'll start with building a portable service image from scratch, before we attach, enable and start it on a host.

Building a Portable Service image

As mentioned, you can use any tool you like that can create OS trees or raw images for building Portable Service images, for example debootstrap or dnf --installroot=. For this example walkthrough run we'll use mkosi, which is ultimately just a fancy wrapper around dnf and debootstrap but makes a number of things particularly easy when repetitively building images from source trees.

I have pushed everything necessary to reproduce this walkthrough locally to a GitHub repository. Let's check it out:

$ git clone https://github.com/systemd/portable-walkthrough.git

Let's have a look in the repository:

  1. First of all, walkthroughd.c is the main source file of our little service. To keep things simple it's written in C, but it could be in any language of your choice. The daemon as implemented won't do much: it just starts up and waits for SIGTERM, at which point it will shut down. It's ultimately useless, but hopefully illustrates how this all fits together. The C code has no dependencies besides libc.

  2. walkthroughd.service is a systemd unit file that starts our little daemon. It's a simple service, hence the unit file is trivial.

  3. Makefile is a short make build script to build the daemon binary. It's pretty trivial, too: it just takes the C file and builds a binary from it. It can also install the daemon. It places the binary in /usr/local/lib/walkthroughd/walkthroughd (why not in /usr/local/bin? because it's not a user-facing binary but a system service binary), and its unit file in /usr/local/lib/systemd/walkthroughd.service. If you want to test the daemon on the host we can now simply run make and then ./walkthroughd in order to check everything works.

  4. mkosi.default is file that tells mkosi how to build the image. We opt for a Fedora-based image here (but we might as well have used Debian here, or any other supported distribution). We need no particular packages during runtime (after all we only depend on libc), but during the build phase we need gcc and make, hence these are the only packages we list in BuildPackages=.

  5. mkosi.build is a shell script that is invoked during mkosi's build logic. All it does is invoke make and make install to build and install our little daemon, and afterwards it extends the distribution-supplied /etc/os-release file with an additional field that describes our portable service a bit.

Let's now use this to build the portable service image. For that we use the mkosi tool. It's sufficient to invoke it without parameter to build the first image: it will automatically discover mkosi.default and mkosi.build which tells it what to do. (Note that if you work on a project like this for a longer time, mkosi -if is probably the better command to use, as it that speeds up building substantially by using an incremental build mode). mkosi will download the necessary RPMs, and put them all together. It will build our little daemon inside the image and after all that's done it will output the resulting image: walkthroughd_1.raw.

Because we opted to build a GPT raw disk image in mkosi.default this file is actually a raw disk image containing a GPT partition table. You can use fdisk -l walkthroughd_1.raw to enumerate the partition table. You can also use systemd-nspawn -i walkthroughd_1.raw to explore the image quickly if you need.

Using the Portable Service Image

Now that we have a portable service image, let's see how we can attach, enable and start the service included within it.

First, let's attach the image:

# /usr/lib/systemd/portablectl attach ./walkthroughd_1.raw
(Matching unit files with prefix 'walkthroughd'.)
Created directory /etc/systemd/system/walkthroughd.service.d.
Written /etc/systemd/system/walkthroughd.service.d/20-portable.conf.
Created symlink /etc/systemd/system/walkthroughd.service.d/10-profile.conf → /usr/lib/systemd/portable/profile/default/service.conf.
Copied /etc/systemd/system/walkthroughd.service.
Created symlink /etc/portables/walkthroughd_1.raw → /home/lennart/projects/portable-walkthrough/walkthroughd_1.raw.

The command will show you exactly what is has been doing: it just copied the main service file out, and added the two drop-ins, as expected.

Let's see if the unit is now available on the host, just like a regular unit, as promised:

# systemctl status walkthroughd.service
● walkthroughd.service - A simple example service
   Loaded: loaded (/etc/systemd/system/walkthroughd.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/walkthroughd.service.d
           └─10-profile.conf, 20-portable.conf
   Active: inactive (dead)

Nice, it worked. We see that the unit file is available and that systemd correctly discovered the two drop-ins. The unit is neither enabled nor started however. Yes, attaching a portable service image doesn't imply enabling nor starting. It just means the unit files contained in the image are made available to the host. It's up to the administrator to then enable them (so that they are automatically started when needed, for example at boot), and/or start them (in case they shall run right-away).

Let's now enable and start the service in one step:

# systemctl enable --now walkthroughd.service
Created symlink /etc/systemd/system/multi-user.target.wants/walkthroughd.service → /etc/systemd/system/walkthroughd.service.

Let's check if it's running:

# systemctl status walkthroughd.service
● walkthroughd.service - A simple example service
   Loaded: loaded (/etc/systemd/system/walkthroughd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/walkthroughd.service.d
           └─10-profile.conf, 20-portable.conf
   Active: active (running) since Wed 2018-06-27 17:55:30 CEST; 4s ago
 Main PID: 45003 (walkthroughd)
    Tasks: 1 (limit: 4915)
   Memory: 4.3M
   CGroup: /system.slice/walkthroughd.service
           └─45003 /usr/local/lib/walkthroughd/walkthroughd

Jun 27 17:55:30 sigma walkthroughd[45003]: Initializing.

Perfect! We can see that the service is now enabled and running. The daemon is running as PID 45003.

Now that we verified that all is good, let's stop, disable and detach the service again:

# systemctl disable --now walkthroughd.service
Removed /etc/systemd/system/multi-user.target.wants/walkthroughd.service.
# /usr/lib/systemd/portablectl detach ./walkthroughd_1.raw
Removed /etc/systemd/system/walkthroughd.service.
Removed /etc/systemd/system/walkthroughd.service.d/10-profile.conf.
Removed /etc/systemd/system/walkthroughd.service.d/20-portable.conf.
Removed /etc/systemd/system/walkthroughd.service.d.
Removed /etc/portables/walkthroughd_1.raw.

And finally, let's see that it's really gone:

# systemctl status walkthroughd
Unit walkthroughd.service could not be found.

Perfect! It worked!

I hope the above gets you started with Portable Services. If you have further questions, please contact our mailing list.

Further Reading

A more low-level document explaining details is shipped along with systemd.

There are also relevant manual pages: portablectl(1) and systemd-portabled(8).

For further information about mkosi see its homepage.

© Lennart Poettering. Built using Pelican. Theme by Giulio Fidente on github. .