Apparently, in some parts of this
world, the /usr/
-merge
transition is still ongoing. Let's take the opportunity to have a look
at one specific way to take benefit of the /usr/
-merge (and
associated work) IRL.
I develop system-level software as you might know. Oftentimes I want
to run my development code on my PC but be reasonably sure it cannot
destroy or otherwise negatively affect my host system. Now I could set
up a container tree for that, and boot into that. But often I am too
lazy for that, I don't want to bother with a slow package manager
setting up a new OS tree for me. So here's what I often do instead —
and this only works because of the /usr/
-merge.
I run a command like the following (without any preparatory work):
systemd-nspawn \
--directory=/ \
--volatile=yes \
-U \
--set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) \
--set-credential=firstboot.locale:C.UTF-8 \
--bind-user=lennart \
-b
And then I very quickly get a login prompt on a container that runs
the exact same software as my host — but is also isolated from the
host. I do not need to prepare any separate OS tree or anything
else. It just works. And my host user lennart
is just there,
ready for me to log into.
So here's what these
systemd-nspawn
options specifically do:
-
--directory=/
tellssystemd-nspawn
to run off the host OS' file hierarchy. That smells like danger of course, running two OS instances off the same directory hierarchy. But don't be scared, because: -
--volatile=yes
enables volatile mode. Specifically this means what we configured with--directory=/
as root file system is slightly rearranged. Instead of mounting that tree as it is, we'll mount atmpfs
instance as actual root file system, and then mount the/usr/
subdirectory of the specified hierarchy into the/usr/
subdirectory of the container file hierarchy in read-only fashion – and only that directory. So now we have a container directory tree that is basically empty, but imports all host OS binaries and libraries into its/usr/
tree. All software installed on the host is also available in the container with no manual work. This mechanism only works because on/usr/
-merged OSes vendor resources are monopolized at a single place:/usr/
. It's sufficient to share that one directory with the container to get a second instance of the host OS running. Note that this means/etc/
and/var/
will be entirely empty initially when this second system boots up. Thankfully, forward looking distributions (such as Fedora) have adoptedsystemd-tmpfiles
andsystemd-sysusers
quite pervasively, so that system users and files/directories required for operation are created automatically should they be missing. Thus, even though at boot the mentioned directories are initially empty, once the system is booted up they are sufficiently populated for things to just work. -
-U
means we'll enable user namespacing, in fully automatic mode. This does three things: it picks a free host UID range dynamically for the container, then sets up user namespacing for the container processes mapping host UID range to UIDs 0…65534 in the container. It then sets up a similar UID mapped mount on the/usr/
tree of the container. Net effect: file ownerships as set on the host OS tree appear as they belong to the very same users inside of the container environment, except that we use user namespacing for everything, and thus the users are actually neatly isolated from the host. -
--set-credential=passwd.hashed-password.root:$(mkpasswd mysecret)
passes a credential to the container. Credentials are bits of data that you can pass to systemd services and whole systems. They are actually awesome concepts (e.g. they support TPM2 authentication/encryption that just works!) but I am not going to go into details around that, given it's off-topic in this specific scenario. Here we just take benefit of the fact thatsystemd-sysusers
looks for a credential calledpasswd.hashed-password.root
to initialize the root password of the system from. We set it tomysecret
. This means once the system is booted up we can log in asroot
and the supplied password. Yay. (Remember,/etc/
is initially empty on this container, and thus also carries no/etc/passwd
or/etc/shadow
, and thus has no root user record, and thus no root password.)mkpasswd
is a tool then converts a plain text password into a UNIX hashed password, which is what this specific credential expects. -
Similar,
--set-credential=firstboot.locale:C.UTF-8
tells thesystemd-firstboot
service in the container to initialize/etc/locale.conf
with this locale. -
--bind-user=lennart
binds the host userlennart
into the container, also as userlennart
. This does two things: it mounts the host user's home directory into the container. It also copies a minimal user record of the specified user into the container thatnss-systemd
then picks up and includes in the regular user database. This means, once the container is booted up I can log in aslennart
with my regular password, and once I logged in I will see my regular host home directory, and can make changes to it. Yippieh! (This does a couple of more things, such as UID mapping and things, but let's not get lost in too much details.)
So, if I run this, I will very quickly get a login prompt, where I can
log into as my regular user. I have full access to my host home
directory, but otherwise everything is nicely isolated from the host,
and changes outside of the home directory are either prohibited or are
volatile, i.e. go to a tmpfs
instance whose lifetime is bound to the
container's lifetime: when I shut down the container I just started,
then any changes outside of my user's home directory are lost.
Note that while here I use --volatile=yes
in combination with
--directory=/
you can actually use it on any OS hierarchy, i.e. just
about any directory that contains OS binaries.
Similar, the --bind-user=
stuff works with any OS hierarchy too (but
do note that only systemd 249 and newer will pick up the user records
passed to the container that way, i.e. this requires at least v249
both on the host and in the container to work).
Or in short: the possibilities are endless!
Requirements
For this all to work, you need:
-
A recent kernel (5.15 should suffice, as it brings UID mapped mounts for the most common file systems, so that
-U
and--bind-user=
can work well.) -
A recent systemd (249 should suffice, which brings
--bind-user=
, and a-U
switch backed by UID mapped mounts). -
A distribution that adopted the
/usr/
-merge,systemd-tmpfiles
andsystemd-sysusers
so that the directory hierarchy and user databases are automatically populated when empty at boot. (Fedora 35 should suffice.)
Limitations
While a lot of today's software actually out of the box works well on
systems that come up with an unpopulated /etc/
and /var/
, and
either fall back to reasonable built-in defaults, or deploy
systemd-tmpfiles
to create what is missing, things aren't perfect:
some software typically installed an desktop OSes will fail to start
when invoked in such a container, and be visible as ugly failed
services, but it won't stop me from logging in and using the system
for what I want to use it. It would be excellent to get that fixed,
though. This can either be fixed in the relevant software upstream
(i.e. if opening your configuration file fails with ENOENT
, then
just default to reasonable defaults), or in the distribution packaging
(i.e. add a
tmpfiles.d/
file that copies or symlinks in skeleton configuration from
/usr/share/factory/etc/
via the C
or L
line types).
And then there's certain software dealing with hardware management and
similar that simply cannot work in a container (as device APIs on
Linux are generally not virtualized for containers) reasonably. It
would be excellent if software like that would be updated to carry
ConditionVirtualization=!container
or
ConditionPathIsReadWrite=/sys
conditionalization in their unit
files, so that it is automatically – cleanly – skipped when executed
in such a container environment.
And that's all for now.