Running a Container off the Host /usr/

Apparently, in some parts of this world, the /usr/-merge transition is still ongoing. Let's take the opportunity to have a look at one specific way to take benefit of the /usr/-merge (and associated work) IRL.

I develop system-level software as you might know. Oftentimes I want to run my development code on my PC but be reasonably sure it cannot destroy or otherwise negatively affect my host system. Now I could set up a container tree for that, and boot into that. But often I am too lazy for that, I don't want to bother with a slow package manager setting up a new OS tree for me. So here's what I often do instead — and this only works because of the /usr/-merge.

I run a command like the following (without any preparatory work):

systemd-nspawn \
        --directory=/ \
        --volatile=yes \
        -U \
        --set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) \
        --set-credential=firstboot.locale:C.UTF-8 \
        --bind-user=lennart \
        -b

And then I very quickly get a login prompt on a container that runs the exact same software as my host — but is also isolated from the host. I do not need to prepare any separate OS tree or anything else. It just works. And my host user lennart is just there, ready for me to log into.

So here's what these systemd-nspawn options specifically do:

--directory=/ tells systemd-nspawn to run off the host OS' file hierarchy. That smells like danger of course, running two OS instances off the same directory hierarchy. But don't be scared, because:
--volatile=yes enables volatile mode. Specifically this means what we configured with --directory=/ as root file system is slightly rearranged. Instead of mounting that tree as it is, we'll mount a tmpfs instance as actual root file system, and then mount the /usr/ subdirectory of the specified hierarchy into the /usr/ subdirectory of the container file hierarchy in read-only fashion – and only that directory. So now we have a container directory tree that is basically empty, but imports all host OS binaries and libraries into its /usr/ tree. All software installed on the host is also available in the container with no manual work. This mechanism only works because on /usr/-merged OSes vendor resources are monopolized at a single place: /usr/. It's sufficient to share that one directory with the container to get a second instance of the host OS running. Note that this means /etc/ and /var/ will be entirely empty initially when this second system boots up. Thankfully, forward looking distributions (such as Fedora) have adopted systemd-tmpfiles and systemd-sysusers quite pervasively, so that system users and files/directories required for operation are created automatically should they be missing. Thus, even though at boot the mentioned directories are initially empty, once the system is booted up they are sufficiently populated for things to just work.
-U means we'll enable user namespacing, in fully automatic mode. This does three things: it picks a free host UID range dynamically for the container, then sets up user namespacing for the container processes mapping host UID range to UIDs 0…65534 in the container. It then sets up a similar UID mapped mount on the /usr/ tree of the container. Net effect: file ownerships as set on the host OS tree appear as they belong to the very same users inside of the container environment, except that we use user namespacing for everything, and thus the users are actually neatly isolated from the host.
--set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) passes a credential to the container. Credentials are bits of data that you can pass to systemd services and whole systems. They are actually awesome concepts (e.g. they support TPM2 authentication/encryption that just works!) but I am not going to go into details around that, given it's off-topic in this specific scenario. Here we just take benefit of the fact that systemd-sysusers looks for a credential called passwd.hashed-password.root to initialize the root password of the system from. We set it to mysecret. This means once the system is booted up we can log in as root and the supplied password. Yay. (Remember, /etc/ is initially empty on this container, and thus also carries no /etc/passwd or /etc/shadow, and thus has no root user record, and thus no root password.)

mkpasswd is a tool then converts a plain text password into a UNIX hashed password, which is what this specific credential expects.
Similar, --set-credential=firstboot.locale:C.UTF-8 tells the systemd-firstboot service in the container to initialize /etc/locale.conf with this locale.
--bind-user=lennart binds the host user lennart into the container, also as user lennart. This does two things: it mounts the host user's home directory into the container. It also copies a minimal user record of the specified user into the container that nss-systemd then picks up and includes in the regular user database. This means, once the container is booted up I can log in as lennart with my regular password, and once I logged in I will see my regular host home directory, and can make changes to it. Yippieh! (This does a couple of more things, such as UID mapping and things, but let's not get lost in too much details.)

So, if I run this, I will very quickly get a login prompt, where I can log into as my regular user. I have full access to my host home directory, but otherwise everything is nicely isolated from the host, and changes outside of the home directory are either prohibited or are volatile, i.e. go to a tmpfs instance whose lifetime is bound to the container's lifetime: when I shut down the container I just started, then any changes outside of my user's home directory are lost.

Note that while here I use --volatile=yes in combination with --directory=/ you can actually use it on any OS hierarchy, i.e. just about any directory that contains OS binaries.

Similar, the --bind-user= stuff works with any OS hierarchy too (but do note that only systemd 249 and newer will pick up the user records passed to the container that way, i.e. this requires at least v249 both on the host and in the container to work).

Or in short: the possibilities are endless!

Requirements

For this all to work, you need:

A recent kernel (5.15 should suffice, as it brings UID mapped mounts for the most common file systems, so that -U and --bind-user= can work well.)
A recent systemd (249 should suffice, which brings --bind-user=, and a -U switch backed by UID mapped mounts).
A distribution that adopted the /usr/-merge, systemd-tmpfiles and systemd-sysusers so that the directory hierarchy and user databases are automatically populated when empty at boot. (Fedora 35 should suffice.)

Limitations

While a lot of today's software actually out of the box works well on systems that come up with an unpopulated /etc/ and /var/, and either fall back to reasonable built-in defaults, or deploy systemd-tmpfiles to create what is missing, things aren't perfect: some software typically installed an desktop OSes will fail to start when invoked in such a container, and be visible as ugly failed services, but it won't stop me from logging in and using the system for what I want to use it. It would be excellent to get that fixed, though. This can either be fixed in the relevant software upstream (i.e. if opening your configuration file fails with ENOENT, then just default to reasonable defaults), or in the distribution packaging (i.e. add a tmpfiles.d/ file that copies or symlinks in skeleton configuration from /usr/share/factory/etc/ via the C or L line types).

And then there's certain software dealing with hardware management and similar that simply cannot work in a container (as device APIs on Linux are generally not virtualized for containers) reasonably. It would be excellent if software like that would be updated to carry ConditionVirtualization=!container or ConditionPathIsReadWrite=/sys conditionalization in their unit files, so that it is automatically – cleanly – skipped when executed in such a container environment.

And that's all for now.

Category: projects