Apparently, in some parts of this
transition is still ongoing. Let's take the opportunity to have a look
at one specific way to take benefit of the
associated work) IRL.
I develop system-level software as you might know. Oftentimes I want
to run my development code on my PC but be reasonably sure it cannot
destroy or otherwise negatively affect my host system. Now I could set
up a container tree for that, and boot into that. But often I am too
lazy for that, I don't want to bother with a slow package manager
setting up a new OS tree for me. So here's what I often do instead —
and this only works because of the
I run a command like the following (without any preparatory work):
systemd-nspawn \ --directory=/ \ --volatile=yes \ -U \ --set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) \ --set-credential=firstboot.locale:C.UTF-8 \ --bind-user=lennart \ -b
And then I very quickly get a login prompt on a container that runs
the exact same software as my host — but is also isolated from the
host. I do not need to prepare any separate OS tree or anything
else. It just works. And my host user
lennart is just there,
ready for me to log into.
So here's what these
options specifically do:
systemd-nspawnto run off the host OS' file hierarchy. That smells like danger of course, running two OS instances off the same directory hierarchy. But don't be scared, because:
--volatile=yesenables volatile mode. Specifically this means what we configured with
--directory=/as root file system is slightly rearranged. Instead of mounting that tree as it is, we'll mount a
tmpfsinstance as actual root file system, and then mount the
/usr/subdirectory of the specified hierarchy into the
/usr/subdirectory of the container file hierarchy in read-only fashion – and only that directory. So now we have a container directory tree that is basically empty, but imports all host OS binaries and libraries into its
/usr/tree. All software installed on the host is also available in the container with no manual work. This mechanism only works because on
/usr/-merged OSes vendor resources are monopolized at a single place:
/usr/. It's sufficient to share that one directory with the container to get a second instance of the host OS running. Note that this means
/var/will be entirely empty initially when this second system boots up. Thankfully, forward looking distributions (such as Fedora) have adopted
systemd-sysusersquite pervasively, so that system users and files/directories required for operation are created automatically should they be missing. Thus, even though at boot the mentioned directories are initially empty, once the system is booted up they are sufficiently populated for things to just work.
-Umeans we'll enable user namespacing, in fully automatic mode. This does three things: it picks a free host UID range dynamically for the container, then sets up user namespacing for the container processes mapping host UID range to UIDs 0…65534 in the container. It then sets up a similar UID mapped mount on the
/usr/tree of the container. Net effect: file ownerships as set on the host OS tree appear as they belong to the very same users inside of the container environment, except that we use user namespacing for everything, and thus the users are actually neatly isolated from the host.
--set-credential=passwd.hashed-password.root:$(mkpasswd mysecret)passes a credential to the container. Credentials are bits of data that you can pass to systemd services and whole systems. They are actually awesome concepts (e.g. they support TPM2 authentication/encryption that just works!) but I am not going to go into details around that, given it's off-topic in this specific scenario. Here we just take benefit of the fact that
systemd-sysuserslooks for a credential called
passwd.hashed-password.rootto initialize the root password of the system from. We set it to
mysecret. This means once the system is booted up we can log in as
rootand the supplied password. Yay. (Remember,
/etc/is initially empty on this container, and thus also carries no
/etc/shadow, and thus has no root user record, and thus no root password.)
mkpasswdis a tool then converts a plain text password into a UNIX hashed password, which is what this specific credential expects.
systemd-firstbootservice in the container to initialize
/etc/locale.confwith this locale.
--bind-user=lennartbinds the host user
lennartinto the container, also as user
lennart. This does two things: it mounts the host user's home directory into the container. It also copies a minimal user record of the specified user into the container that
nss-systemdthen picks up and includes in the regular user database. This means, once the container is booted up I can log in as
lennartwith my regular password, and once I logged in I will see my regular host home directory, and can make changes to it. Yippieh! (This does a couple of more things, such as UID mapping and things, but let's not get lost in too much details.)
So, if I run this, I will very quickly get a login prompt, where I can
log into as my regular user. I have full access to my host home
directory, but otherwise everything is nicely isolated from the host,
and changes outside of the home directory are either prohibited or are
volatile, i.e. go to a
tmpfs instance whose lifetime is bound to the
container's lifetime: when I shut down the container I just started,
then any changes outside of my user's home directory are lost.
Note that while here I use
--volatile=yes in combination with
--directory=/ you can actually use it on any OS hierarchy, i.e. just
about any directory that contains OS binaries.
--bind-user= stuff works with any OS hierarchy too (but
do note that only systemd 249 and newer will pick up the user records
passed to the container that way, i.e. this requires at least v249
both on the host and in the container to work).
Or in short: the possibilities are endless!
For this all to work, you need:
A recent kernel (5.15 should suffice, as it brings UID mapped mounts for the most common file systems, so that
--bind-user=can work well.)
A recent systemd (249 should suffice, which brings
--bind-user=, and a
-Uswitch backed by UID mapped mounts).
A distribution that adopted the
systemd-sysusersso that the directory hierarchy and user databases are automatically populated when empty at boot. (Fedora 35 should suffice.)
While a lot of today's software actually out of the box works well on
systems that come up with an unpopulated
either fall back to reasonable built-in defaults, or deploy
systemd-tmpfiles to create what is missing, things aren't perfect:
some software typically installed an desktop OSes will fail to start
when invoked in such a container, and be visible as ugly failed
services, but it won't stop me from logging in and using the system
for what I want to use it. It would be excellent to get that fixed,
though. This can either be fixed in the relevant software upstream
(i.e. if opening your configuration file fails with
just default to reasonable defaults), or in the distribution packaging
(i.e. add a
file that copies or symlinks in skeleton configuration from
/usr/share/factory/etc/ via the
L line types).
And then there's certain software dealing with hardware management and
similar that simply cannot work in a container (as device APIs on
Linux are generally not virtualized for containers) reasonably. It
would be excellent if software like that would be updated to carry
ConditionPathIsReadWrite=/sys conditionalization in their unit
files, so that it is automatically – cleanly – skipped when executed
in such a container environment.
And that's all for now.