TL;DR: Tag your GPT partitions with the right, descriptive partition types, and the world will become a better place.
A number of years ago we started the Discoverable Partitions Specification which defines GPT partition type UUIDs and partition flags for the various partitions Linux systems typically deal with. Before the specification all Linux partitions usually just used the same type, basically saying "Hey, I am a Linux partition" and not much else. With this specification the GPT partition type, flags and label system becomes a lot more expressive, as it can tell you:
- What kind of data a partition contains (i.e. is this swap data, a file system or Verity data?)
- What the purpose/mount point of a partition is (i.e. is this a
/home/partition or a root file system?)
- What CPU architecture a partition is intended for (i.e. is this a root partition for x86-64 or for aarch64?)
- Shall this partition be mounted automatically? (i.e. without specifically be configured via
- And if so, shall it be mounted read-only?
- And if so, shall the file system be grown to its enclosing partition size, if smaller?
- Which partition contains the newer version of the same data (i.e. multiple root file systems, with different versions)
By embedding all of this information inside the GPT partition table
disk images become self-descriptive: without requiring any other
source of information (such as
/etc/fstab) if you look at a
compliant GPT disk image it is clear how an image is put together and
how it should be used and mounted. This self-descriptiveness in
particular breaks one philosophical weirdness of traditional Linux
installations: the original source of information which file system
the root file system is, typically is embedded in the root file system
/etc/fstab. Thus, in a way, in order to know what the
root file system is you need to know what the root file system is. 🤯
(Of course, the way this recursion is traditionally broken up is by
then copying the root file system information from
the boot loader configuration, resulting in a situation where the
primary source of information for this — i.e.
/etc/fstab — is
actually mostly irrelevant, and the secondary source — i.e. the copy
in the boot loader — becomes the configuration that actually matters.)
Today, the GPT partition type UUIDs defined by the specification have been adopted quite widely, by distributions and their installers, as well as a variety of partitioning tools and other tools.
In this article I want to highlight how the various tools the systemd project provides make use of the concepts the specification introduces.
But before we start with that, let's underline why tagging partitions with these descriptive partition type UUIDs (and the associated partition flags) is a good thing, besides the philosophical points made above.
Simplicity: in particular OS installers become simpler — adjusting
/etc/fstabas part of the installation is not necessary anymore, as the partitioning step already put all information into place for assembling the system properly at boot. i.e. installing doesn't mean that you always have to get
/etc/fstabinto place, the former suffices entirely.
Robustness: since partition tables mostly remain static after installation the chance of corruption is much lower than if the data is stored in file systems (e.g. in
/etc/fstab). Moreover by associating the metadata directly with the objects it describes the chance of things getting out of sync is reduced. (i.e. if you lose
/etc/fstab, or forget to rerun your initrd builder you still know what a partition is supposed to be just by looking at it.)
Programmability: if partitions are self-descriptive it's much easier to automatically process them with various tools. In fact, this blog story is mostly about that: various systemd tools can naturally process disk images prepared like this.
Alternative entry points: on traditional disk images, the boot loader needs to be told which kernel command line option
root=to use, which then provides access to the root file system, where
/etc/fstabis then found which describes the rest of the file systems. Where precisely
root=is configured for the boot loader highly depends on the boot loader and distribution used, and is typically encoded in a Turing complete programming language (Grub…). This makes it very hard to automatically determine the right root file system to use, to implement alternative entry points to the system. By alternative entry points I mean other ways to boot the disk image, specifically for running it as a
systemd-nspawncontainer — but this extends to other mechanisms where the boot loader may be bypassed to boot up the system, for example
qemuwhen configured without a boot loader.
User friendliness: it's simply a lot nicer for the user looking at a partition table if the partition table explains what is what, instead of just saying "Hey, this is a Linux partition!" and nothing else.
Uses for the concept
Now that we cleared up the Why?, lets have a closer look how this is
currently used and exposed in
systemd's various components.
Use #1: Running a disk image in a container
If a disk image follows the Discoverable Partition Specification then
has all it needs to just boot it up. Specifically, if you have a GPT
disk image in a file
foobar.raw and you want to boot it up in a
container, just run
systemd-nspawn -i foobar.raw -b, and that's it
(you can specify a block device like
/dev/sdb too if you like). It
becomes easy and natural to prepare disk images that can be booted
either on a physical machine, inside a virtual machine manager or
inside such a container manager: the necessary meta-information is
included in the image, easily accessible before actually looking into
its file systems.
Use #2: Booting an OS image on bare-metal without
/etc/fstab or kernel command line
If a disk image follows the specification in many cases you can remove
/etc/fstab (or never even install it) — as the basic information
needed is already included in the partition table. The
logic implements automatic discovery of the root file system as well
as all auxiliary file systems. (Note that the former requires an
initrd that uses systemd, some more conservative distributions do not
support that yet, unfortunately). Effectively this means you can boot
up a kernel/initrd with an entirely empty kernel command line, and the
initrd will automatically find the root file system (by looking for a
suitably marked partition on the same drive the EFI System Partition
was found on).
root= exist and contain relevant
information they always takes precedence over the automatic logic. This
is in particular useful to tweaks thing by specifying additional mount
options and such.)
Use #3: Mounting a complex disk image for introspection or manipulation
tool may be used to introspect and manipulate OS disk images that
implement the specification. If you pass the path to a disk image (or
block device) it will extract various bits of useful information from
the image (e.g. what OS is this? what partitions to mount?) and display it.
--mount switch a disk image (or block device) can be
mounted to some location. This is useful for looking what is inside
it, or changing its contents. This will dissect the image and then
automatically mount all contained file systems matching their GPT
partition description to the right places, so that you subsequently
chroot into it. (But why
chroot if you can just use
Use #4: Copying files in and out of a disk image
tool also has two switches
--copy-to which allow
copying files out of or into a compliant disk image, taking all
included file systems and the resulting mount hierarchy into account.
Use #5: Running services directly off a disk image
setting in service unit files accepts paths to compliant disk images
(or block device nodes), and can mount them automatically, running
service binaries directly off them (in
chroot() style). In fact,
this is the base for the Portable
Service concept of systemd.
Use #6: Provisioning disk images
systemd provides various tools that can run operations provisioning
disk images in an "offline" mode. Specifically:
can directly operate on a disk image, and for example create all
directories and other inodes defined in its declarative configuration
files included in the image. This can be useful for example to set up
/etc/ tree according to such configuration before
--image= switch of
tells the tool to read the declarative system user specifications
included in the image and synthesizes system users from it, writing
them to the
/etc/passwd (and related) files in the image. This is
useful for provisioning these users before the first boot, for example
to ensure UID/GID numbers are pre-allocated, and such allocations not
delayed until first boot.
--image= switch of
may be used to set various basic system setting (such as root
password, locale information, hostname, …) on the specified disk
image, before booting it up.
Use #7: Extracting log information
--image= may be used to show the journal log data included in
a disk image (or, as usual, the specified block device). This is very
useful for analyzing failed systems offline, as it gives direct access
to the logs without any further, manual analysis.
Use #8: Automatic repartitioning/growing of file systems
tool may be used to repartition a disk or image in an declarative and
additive way. One primary use-case for it is to run during boot on
physical or VM systems to grow the root file system to the disk size,
or to add in, format, encrypt, populate additional partitions at boot.
--image= switch it the tool may operate on compliant disk
images in offline mode of operation: it will then read the partition
definitions that shall be grown or created off the image itself, and
then apply them to the image. This is particularly useful in
combination with the
--size= which allows growing disk images to the
Specifically, consider the following work-flow: you download a
minimized disk image
foobar.raw that contains only the minimized
root file system (and maybe an ESP, if you want to boot it on
bare-metal, too). You then run
--size=15G to enlarge the image to the 15G, based on the declarative
rules defined in the
drop-in files included in the image (this means this can grow the root
partition, and/or add in more partitions, for example for
so, maybe encrypted with a locally generated key or so). Then, you
proceed to boot it up with
systemd-nspawn --image=foo.raw -b, making
use of the full 15G.
Versioning + Multi-Arch
Disk images implementing this specifications can carry OS executables in one of three ways:
Only a root file system
/usr/file system (in which case the root file system is automatically picked as
Both a root and a
/usr/file system (in which case the two are combined, the
/usr/file system mounted into the root file system, and the former possibly in read-only fashion`)
They may also contain OS executables for different architectures,
permitting "multi-arch" disk images that can safely boot up on
multiple CPU architectures. As the root and
/usr/ partition type
UUIDs are specific to architectures this is easily done by including
one such partition for
x86-64, and another for
aarch64. If the
image is now used on an
x86-64 system automatically the former
partition is used, on
aarch64 the latter.
Moreover, these OS executables may be contained in different versions,
to implement a simple versioning scheme: when tools such as
systemd-gpt-auto-generator dissect a disk image,
and they find two or more root or
/usr/ partitions of the same type
UUID, they will automatically pick the one whose GPT partition label
(a 36 character free-form string every GPT partition may have) is the
newest according to
(OK, truth be told, we don't use
strverscmp() as-is, but a modified
version with some more modern syntax and semantics, but conceptually
This logic allows to implement a very simple and natural A/B update
scheme: an updater can drop multiple versions of the OS into separate
/usr/ partitions, always updating the partition label to the
version included there-in once the download is complete. All of the
tools described here will then honour this, and always automatically
pick the newest version of the OS.
When building modern OS appliances, security is highly relevant. Specifically, offline security matters: an attacker with physical access should have a difficult time modifying the OS in a way that isn't noticed. i.e. think of a car or a cell network base station: these appliances are usually parked/deployed in environments attackers can get physical access to: it's essential that in this case the OS itself sufficiently protected, so that the attacker cannot just mount the OS file system image, make modifications (inserting a backdoor, spying software or similar) and the system otherwise continues to run without this being immediately detected.
A great way to implement offline security is via Linux'
subsystem: it allows to securely bind immutable disk IO to a single,
short trusted hash value: if an attacker manages to offline modify the
disk image the modified disk image won't match the trusted hash
anymore, and will not be trusted anymore (depending on policy this
then just result in IO errors being generated, or automatic
The Discoverable Partitions Specification declares how to include
Verity validation data in disk images, and how to relate them to the file
systems they protect, thus making if very easy to deploy and work with
such protected images. For example
systemd-nspawn supports a
--root-hash= switch, which accepts the Verity root hash and then
will automatically assemble
dm-verity with this, automatically
matching up the payload and verity partitions. (Alternatively, just
.roothash file next to the image file).
The above already is a powerful tool set for working with disk images. However, there are some more areas I'd like to extend this logic to:
Similar to the other tools mentioned above,
(which is a tool to interface with the boot loader, and install/update
systemd's own EFI boot loader
should learn a
--image= switch, to make installation of the boot
loader on disk images easy and natural. It would automatically find
the ESP and other relevant partitions in the image, and copy the boot
loader binaries into them (or update them).
Similar to the existing
journalctl --image= logic the
tool should also gain an
--image= switch for extracting coredumps
from compliant disk images. The combination of
coredumpctl --image= would make it exceptionally easy to work
with OS disk images of appliances and extracting logging and debugging
information from them after failures.
And that's all for now. Please refer to the specification and the man pages for further details. If your distribution's installer does not yet tag the GPT partition it creates with the right GPT type UUIDs, consider asking them to do so.
Thank you for your time.