TL;DR: Tag your GPT partitions with the right, descriptive partition types, and the world will become a better place.
A number of years ago we started the Discoverable Partitions Specification which defines GPT partition type UUIDs and partition flags for the various partitions Linux systems typically deal with. Before the specification all Linux partitions usually just used the same type, basically saying "Hey, I am a Linux partition" and not much else. With this specification the GPT partition type, flags and label system becomes a lot more expressive, as it can tell you:
- What kind of data a partition contains (i.e. is this swap data, a file system or Verity data?)
- What the purpose/mount point of a partition is (i.e. is this a
/home/
partition or a root file system?) - What CPU architecture a partition is intended for (i.e. is this a root partition for x86-64 or for aarch64?)
- Shall this partition be mounted automatically? (i.e. without specifically be configured via
/etc/fstab
) - And if so, shall it be mounted read-only?
- And if so, shall the file system be grown to its enclosing partition size, if smaller?
- Which partition contains the newer version of the same data (i.e. multiple root file systems, with different versions)
By embedding all of this information inside the GPT partition table
disk images become self-descriptive: without requiring any other
source of information (such as /etc/fstab
) if you look at a
compliant GPT disk image it is clear how an image is put together and
how it should be used and mounted. This self-descriptiveness in
particular breaks one philosophical weirdness of traditional Linux
installations: the original source of information which file system
the root file system is, typically is embedded in the root file system
itself, in /etc/fstab
. Thus, in a way, in order to know what the
root file system is you need to know what the root file system is. 🤯
🤯 🤯
(Of course, the way this recursion is traditionally broken up is by
then copying the root file system information from /etc/fstab
into
the boot loader configuration, resulting in a situation where the
primary source of information for this — i.e. /etc/fstab
— is
actually mostly irrelevant, and the secondary source — i.e. the copy
in the boot loader — becomes the configuration that actually matters.)
Today, the GPT partition type UUIDs defined by the specification have been adopted quite widely, by distributions and their installers, as well as a variety of partitioning tools and other tools.
In this article I want to highlight how the various tools the systemd project provides make use of the concepts the specification introduces.
But before we start with that, let's underline why tagging partitions with these descriptive partition type UUIDs (and the associated partition flags) is a good thing, besides the philosophical points made above.
-
Simplicity: in particular OS installers become simpler — adjusting
/etc/fstab
as part of the installation is not necessary anymore, as the partitioning step already put all information into place for assembling the system properly at boot. i.e. installing doesn't mean that you always have to getfdisk
and/etc/fstab
into place, the former suffices entirely. -
Robustness: since partition tables mostly remain static after installation the chance of corruption is much lower than if the data is stored in file systems (e.g. in
/etc/fstab
). Moreover by associating the metadata directly with the objects it describes the chance of things getting out of sync is reduced. (i.e. if you lose/etc/fstab
, or forget to rerun your initrd builder you still know what a partition is supposed to be just by looking at it.) -
Programmability: if partitions are self-descriptive it's much easier to automatically process them with various tools. In fact, this blog story is mostly about that: various systemd tools can naturally process disk images prepared like this.
-
Alternative entry points: on traditional disk images, the boot loader needs to be told which kernel command line option
root=
to use, which then provides access to the root file system, where/etc/fstab
is then found which describes the rest of the file systems. Where preciselyroot=
is configured for the boot loader highly depends on the boot loader and distribution used, and is typically encoded in a Turing complete programming language (Grub…). This makes it very hard to automatically determine the right root file system to use, to implement alternative entry points to the system. By alternative entry points I mean other ways to boot the disk image, specifically for running it as asystemd-nspawn
container — but this extends to other mechanisms where the boot loader may be bypassed to boot up the system, for exampleqemu
when configured without a boot loader. -
User friendliness: it's simply a lot nicer for the user looking at a partition table if the partition table explains what is what, instead of just saying "Hey, this is a Linux partition!" and nothing else.
Uses for the concept
Now that we cleared up the Why?, lets have a closer look how this is
currently used and exposed in systemd
's various components.
Use #1: Running a disk image in a container
If a disk image follows the Discoverable Partition Specification then
systemd-nspawn
has all it needs to just boot it up. Specifically, if you have a GPT
disk image in a file foobar.raw
and you want to boot it up in a
container, just run systemd-nspawn -i foobar.raw -b
, and that's it
(you can specify a block device like /dev/sdb
too if you like). It
becomes easy and natural to prepare disk images that can be booted
either on a physical machine, inside a virtual machine manager or
inside such a container manager: the necessary meta-information is
included in the image, easily accessible before actually looking into
its file systems.
Use #2: Booting an OS image on bare-metal without /etc/fstab
or kernel command line root=
If a disk image follows the specification in many cases you can remove
/etc/fstab
(or never even install it) — as the basic information
needed is already included in the partition table. The
systemd-gpt-auto-generator
logic implements automatic discovery of the root file system as well
as all auxiliary file systems. (Note that the former requires an
initrd that uses systemd, some more conservative distributions do not
support that yet, unfortunately). Effectively this means you can boot
up a kernel/initrd with an entirely empty kernel command line, and the
initrd will automatically find the root file system (by looking for a
suitably marked partition on the same drive the EFI System Partition
was found on).
(Note, if /etc/fstab
or root=
exist and contain relevant
information they always takes precedence over the automatic logic. This
is in particular useful to tweaks thing by specifying additional mount
options and such.)
Use #3: Mounting a complex disk image for introspection or manipulation
The
systemd-dissect
tool may be used to introspect and manipulate OS disk images that
implement the specification. If you pass the path to a disk image (or
block device) it will extract various bits of useful information from
the image (e.g. what OS is this? what partitions to mount?) and display it.
With the --mount
switch a disk image (or block device) can be
mounted to some location. This is useful for looking what is inside
it, or changing its contents. This will dissect the image and then
automatically mount all contained file systems matching their GPT
partition description to the right places, so that you subsequently
could chroot
into it. (But why chroot
if you can just use systemd-nspawn
? 😎)
Use #4: Copying files in and out of a disk image
The
systemd-dissect
tool also has two switches --copy-from
and --copy-to
which allow
copying files out of or into a compliant disk image, taking all
included file systems and the resulting mount hierarchy into account.
Use #5: Running services directly off a disk image
The
RootImage=
setting in service unit files accepts paths to compliant disk images
(or block device nodes), and can mount them automatically, running
service binaries directly off them (in chroot()
style). In fact,
this is the base for the Portable
Service concept of systemd.
Use #6: Provisioning disk images
systemd
provides various tools that can run operations provisioning
disk images in an "offline" mode. Specifically:
systemd-tmpfiles
With the --image=
switch
systemd-tmpfiles
can directly operate on a disk image, and for example create all
directories and other inodes defined in its declarative configuration
files included in the image. This can be useful for example to set up
the /var/
or /etc/
tree according to such configuration before
first boot.
systemd-sysusers
Similar, the --image=
switch of
systemd-sysusers
tells the tool to read the declarative system user specifications
included in the image and synthesizes system users from it, writing
them to the /etc/passwd
(and related) files in the image. This is
useful for provisioning these users before the first boot, for example
to ensure UID/GID numbers are pre-allocated, and such allocations not
delayed until first boot.
systemd-machine-id-setup
The --image=
switch of
systemd-machine-id-setup
may be used to provision a fresh machine ID into
/etc/machine-id
of a disk image, before first boot.
systemd-firstboot
The --image=
switch of
systemd-firstboot
may be used to set various basic system setting (such as root
password, locale information, hostname, …) on the specified disk
image, before booting it up.
Use #7: Extracting log information
The
journalctl
switch --image=
may be used to show the journal log data included in
a disk image (or, as usual, the specified block device). This is very
useful for analyzing failed systems offline, as it gives direct access
to the logs without any further, manual analysis.
Use #8: Automatic repartitioning/growing of file systems
The
systemd-repart
tool may be used to repartition a disk or image in an declarative and
additive way. One primary use-case for it is to run during boot on
physical or VM systems to grow the root file system to the disk size,
or to add in, format, encrypt, populate additional partitions at boot.
With its --image=
switch it the tool may operate on compliant disk
images in offline mode of operation: it will then read the partition
definitions that shall be grown or created off the image itself, and
then apply them to the image. This is particularly useful in
combination with the --size=
which allows growing disk images to the
specified size.
Specifically, consider the following work-flow: you download a
minimized disk image foobar.raw
that contains only the minimized
root file system (and maybe an ESP, if you want to boot it on
bare-metal, too). You then run systemd-repart --image=foo.raw
--size=15G
to enlarge the image to the 15G, based on the declarative
rules defined in the
repart.d/
drop-in files included in the image (this means this can grow the root
partition, and/or add in more partitions, for example for /srv
or
so, maybe encrypted with a locally generated key or so). Then, you
proceed to boot it up with systemd-nspawn --image=foo.raw -b
, making
use of the full 15G.
Versioning + Multi-Arch
Disk images implementing this specifications can carry OS executables in one of three ways:
-
Only a root file system
-
Only a
/usr/
file system (in which case the root file system is automatically picked astmpfs
). -
Both a root and a
/usr/
file system (in which case the two are combined, the/usr/
file system mounted into the root file system, and the former possibly in read-only fashion`)
They may also contain OS executables for different architectures,
permitting "multi-arch" disk images that can safely boot up on
multiple CPU architectures. As the root and /usr/
partition type
UUIDs are specific to architectures this is easily done by including
one such partition for x86-64
, and another for aarch64
. If the
image is now used on an x86-64
system automatically the former
partition is used, on aarch64
the latter.
Moreover, these OS executables may be contained in different versions,
to implement a simple versioning scheme: when tools such as
systemd-nspawn
or systemd-gpt-auto-generator
dissect a disk image,
and they find two or more root or /usr/
partitions of the same type
UUID, they will automatically pick the one whose GPT partition label
(a 36 character free-form string every GPT partition may have) is the
newest according to
strverscmp()
(OK, truth be told, we don't use strverscmp()
as-is, but a modified
version with some more modern syntax and semantics, but conceptually
identical).
This logic allows to implement a very simple and natural A/B update
scheme: an updater can drop multiple versions of the OS into separate
root or /usr/
partitions, always updating the partition label to the
version included there-in once the download is complete. All of the
tools described here will then honour this, and always automatically
pick the newest version of the OS.
Verity
When building modern OS appliances, security is highly relevant. Specifically, offline security matters: an attacker with physical access should have a difficult time modifying the OS in a way that isn't noticed. i.e. think of a car or a cell network base station: these appliances are usually parked/deployed in environments attackers can get physical access to: it's essential that in this case the OS itself sufficiently protected, so that the attacker cannot just mount the OS file system image, make modifications (inserting a backdoor, spying software or similar) and the system otherwise continues to run without this being immediately detected.
A great way to implement offline security is via Linux' dm-verity
subsystem: it allows to securely bind immutable disk IO to a single,
short trusted hash value: if an attacker manages to offline modify the
disk image the modified disk image won't match the trusted hash
anymore, and will not be trusted anymore (depending on policy this
then just result in IO errors being generated, or automatic
reboot/power-off).
The Discoverable Partitions Specification declares how to include
Verity validation data in disk images, and how to relate them to the file
systems they protect, thus making if very easy to deploy and work with
such protected images. For example systemd-nspawn
supports a
--root-hash=
switch, which accepts the Verity root hash and then
will automatically assemble dm-verity
with this, automatically
matching up the payload and verity partitions. (Alternatively, just
place a .roothash
file next to the image file).
Future
The above already is a powerful tool set for working with disk images. However, there are some more areas I'd like to extend this logic to:
bootctl
Similar to the other tools mentioned above,
bootctl
(which is a tool to interface with the boot loader, and install/update
systemd's own EFI boot loader
sd-boot
)
should learn a --image=
switch, to make installation of the boot
loader on disk images easy and natural. It would automatically find
the ESP and other relevant partitions in the image, and copy the boot
loader binaries into them (or update them).
coredumpctl
Similar to the existing journalctl --image=
logic the coredumpctl
tool should also gain an --image=
switch for extracting coredumps
from compliant disk images. The combination of journalctl --image=
and coredumpctl --image=
would make it exceptionally easy to work
with OS disk images of appliances and extracting logging and debugging
information from them after failures.
And that's all for now. Please refer to the specification and the man pages for further details. If your distribution's installer does not yet tag the GPT partition it creates with the right GPT type UUIDs, consider asking them to do so.
Thank you for your time.