All Systems Go! is everybody's favourite low-level Userspace Linux conference, taking place in Berlin, Germany in September 28-30, 2018.
For more information please visit our conference website!
All Systems Go! is everybody's favourite low-level Userspace Linux conference, taking place in Berlin, Germany in September 28-30, 2018.
For more information please visit our conference website!
The Call for Participation (CFP) for All Systems Go! 2018 will close in one week, on 30th of July! We’d like to invite you to submit your proposals for consideration to the CFP submission site quickly!
Notification of acceptance and non-acceptance will go out within 7 days of the closing of the CFP.
All topics relevant to foundational open-source Linux technologies are welcome. In particular, however, we are looking for proposals including, but not limited to, the following topics:
While our focus is definitely more on the user-space side of things, talks about kernel projects are welcome, as long as they have a clear and direct relevance for user-space.
For more information please visit our conference website!
systemd v239 contains a great number of new features. One of them is first class support for Portable Services. In this blog story I'd like to shed some light on what they are and why they might be interesting for your application.
The "Portable Service" concept takes inspiration from classic
chroot() environments as well as container management and brings a
number of their features to more regular system service management.
While the definition of what a "container" really is is hotly debated, I figure people can generally agree that the "container" concept primarily provides two major features:
Resource bundling: a container generally brings its own file system tree along, bundling any shared libraries and other resources it might need along with the main service executables.
Isolation and sand-boxing: a container operates in a name-spaced environment that is relatively detached from the host. Besides living in its own file system namespace it usually also has its own user database, process tree and so on. Access from the container to the host is limited with various security technologies.
Of these two concepts the first one is also what traditional UNIX
chroot() environments are about.
Both resource bundling and isolation/sand-boxing are concepts systemd
has implemented to varying degrees for a longer time. Specifically,
have been around for a long time, and so have been the various
systemd provides. The Portable Services concept builds on that,
putting these features together in a new, integrated way to make them
more accessible and usable.
Much like a container image, a portable service on disk can be just a directory tree that contains service executables and all their dependencies, in a hierarchy resembling the normal Linux directory hierarchy. A portable service can also be a raw disk image, containing a file system containing such a tree (which can be mounted via a loop-back block device), or multiple file systems (in which case they need to follow the Discoverable Partitions Specification and be located within a GPT partition table). Regardless whether the portable service on disk is a simple directory tree or a raw disk image, let's call this concept the portable service image.
Such images can be generated with any tool typically used for the
purpose of installing OSes inside some directory, for example
debootstrap. There are very few requirements made
on these trees, except the following two:
The tree should carry systemd unit files for relevant services in them.
The tree should carry
/etc/os-release) OS release information.
Of course, as you might notice, OS trees generated from any of today's
big distributions generally qualify for these two requirements without
any further modification, as pretty much all of them adopted
/usr/lib/os-release and tend to ship their major services with
systemd unit files.
A portable service image generated like this can be "attached" or "detached" from a host:
"Attaching" an image to a host is done through the new
command. This command dissects the image, reading the
information, and searching for unit files in them. It then copies
relevant unit files out of the images and into
/etc/systemd/system/. After that it augments any copied service
unit files in two ways: a drop-in adding a
RootImage= line is added in so that even though the unit files
are now available on the host when started they run the referenced
binaries from the image. It also symlinks in a second drop-in which
is called a "profile", which is supposed to carry additional
security settings to enforce on the attached services, to ensure
the right amount of sand-boxing.
"Detaching" an image from the host is done through
detach. It reverses the steps above: the unit files copied out are
removed again, and so are the two drop-in files generated for them.
While a portable service is attached its relevant unit files are made
available on the host like any others: they will appear in
list-unit-files, you can enable and disable them, you can start them
and stop them. You can extend them with
systemctl edit. You can
introspect them. You can apply resource management to them like to any
other service, and you can process their logs like any other service
and so on. That's because they really are native systemd services,
except that they have 'twist' if you so will: they have tougher
security by default and store their resources in a root directory or
And that's already the essence of what Portable Services are.
A couple of interesting points:
Even though the focus is on shipping service unit files in portable service images, you can actually ship timer units, socket units, target units, path units in portable services too. This means you can very naturally do time, socket and path based activation. It's also entirely fine to ship multiple service units in the same image, in case you have more complex applications.
This concept introduces zero new metadata. Unit files are an
existing concept, as are
os-release files, and — in case you opt
for raw disk images — GPT partition tables are already established
too. This also means existing tools to generate images can be
reused for building portable service images to a large degree as no
completely new artifact types need to be generated.
Because the Portable Service concepts introduces zero new metadata
and just builds on existing security and resource bundling
features of systemd it's implemented in a set of distinct tools,
relatively disconnected from the rest of systemd. Specifically, the
main user-facing command is
and the actual operations are implemented in
you so will, portable services are a true add-on to systemd, just
making a specific work-flow nicer to use than with the basic
operations systemd otherwise provides. Also note that
systemd-portabled provides bus APIs accessible to any program
that wants to interface with it,
portablectl is just one tool
that happens to be shipped along with systemd.
Since Portable Services are a feature we only added very recently
we wanted to keep some freedom to make changes still. Due to that
we decided to install the
portablectl command into
/usr/lib/systemd/ for now, so that it does not appear in
by default. This means, for now you have to invoke it with a full
/usr/lib/systemd/portablectl. We expect to move it into
/usr/bin/ very soon though, and make it a fully supported
interface of systemd.
You may wonder which unit files contained in a portable service
image are the ones considered "relevant" and are actually copied
out by the
portablectl attach operation. Currently, this is
derived from the image name. Let's say you have an image stored in
/var/lib/portables/foobar_4711/ (or alternatively in
a raw image
/var/lib/portables/foobar_4711.raw). In that case the
unit files copied out match the pattern
The Portable Services concept does not define any specific method
how images get on the deployment machines, that's entirely up to
administrators. You can just
scp them there, or
wget them. You
could even package them as RPMs and then deploy them with
you feel adventurous.
Portable service images can reside in any directory you
like. However, if you place them in
portablectl will find them easily and can show you a list of
images you can attach and suchlike.
Attaching a portable service image can be done persistently, so
that it remains attached on subsequent boots (which is the default),
or it can be attached only until the next reboot, by passing
Because portable service images are ultimately just regular OS images, it's natural and easy to build a single image that can be used in three different ways:
It can be attached to any host as a portable service image.
It can be booted as OS container, for example in a container
It can be booted as host system, for example on bare metal or in a VM manager.
Of course, to qualify for the latter two the image needs to
contain more than just the service binaries, the
and the unit files. To be bootable an OS container manager such as
systemd-nspawn the image needs to contain an init system of some
form, for example
be bootable on bare metal or as VM it also needs a boot loader of
some form, for example
In the previous section the "profile" concept was briefly
mentioned. Since they are a major feature of the Portable Services
concept, they deserve some focus. A "profile" is ultimately just a
pre-defined drop-in file for unit files that are attached to a
host. They are supposed to mostly contain sand-boxing and security
settings, but may actually contain any other settings, too. When a
portable service is attached a suitable profile has to be selected. If
none is selected explicitly, the default profile called
used. systemd ships with four different profiles out of the box:
profile provides a medium level of security. It contains settings to
drop capabilities, enforce system call filters, restrict many kernel
interfaces and mount various file systems read-only.
profile is similar to the
default profile, but generally uses the
most restrictive sand-boxing settings. For example networking is turned
off and access to
AF_NETLINK sockets is prohibited.
profile is the least strict of them all. In fact it makes almost no
restrictions at all. A service run with this profile has basically
full access to the host system.
profile is mostly identical to
default, but also turns off network access.
Note that the profile is selected at the time the portable service image is attached, and it applies to all service files attached, in case multiple are shipped in the same image. Thus, the sand-boxing restriction to enforce are selected by the administrator attaching the image and not the image vendor.
Additional profiles can be defined easily by the administrator, if needed. We might also add additional profiles sooner or later to be shipped with systemd out of the box.
Portable Services are primarily intended to cover use-cases where code should more feel like "extensions" to the host system rather than live in disconnected, separate worlds. The profile concept is supposed to be tunable to the exact right amount of integration or isolation needed for an application.
In the container world the concept of "super-privileged containers" has been touted a lot, i.e. containers that run with full privileges. It's precisely that use-case that portable services are intended for: extensions to the host OS, that default to isolation, but can optionally get as much access to the host as needed, and can naturally take benefit of the full functionality of the host. The concept should hence be useful for all kinds of low-level system software that isn't shipped with the OS itself but needs varying degrees of integration with it. Besides servers and appliances this should be particularly interesting for IoT and embedded devices.
Because portable services are just a relatively small extension to the way system services are otherwise managed, they can be treated like regular service for almost all use-cases: they will appear along regular services in all tools that can introspect systemd unit data, and can be managed the same way when it comes to logging, resource management, runtime life-cycles and so on.
Portable services are a very generic concept. While the original use-case is OS extensions, it's of course entirely up to you and other users to use them in a suitable way of your choice.
Let's have a look how this all can be used. We'll start with building a portable service image from scratch, before we attach, enable and start it on a host.
As mentioned, you can use any tool you like that can create OS trees
or raw images for building Portable Service images, for example
dnf --installroot=. For this example walkthrough
run we'll use
mkosi, which is
ultimately just a fancy wrapper around
makes a number of things particularly easy when repetitively building
images from source trees.
I have pushed everything necessary to reproduce this walkthrough locally to a GitHub repository. Let's check it out:
$ git clone https://github.com/systemd/portable-walkthrough.git
Let's have a look in the repository:
First of all,
is the main source file of our little service. To keep things
simple it's written in C, but it could be in any language of your
choice. The daemon as implemented won't do much: it just starts up
and waits for
SIGTERM, at which point it will shut down. It's
ultimately useless, but hopefully illustrates how this all fits
together. The C code has no dependencies besides libc.
is a systemd unit file that starts our little daemon. It's a simple
service, hence the unit file is trivial.
is a short make build script to build the daemon binary. It's
pretty trivial, too: it just takes the C file and builds a binary
from it. It can also install the daemon. It places the binary in
/usr/local/lib/walkthroughd/walkthroughd (why not in
/usr/local/bin? because it's not a user-facing binary but a system
service binary), and its unit file in
/usr/local/lib/systemd/walkthroughd.service. If you want to test
the daemon on the host we can now simply run
make and then
./walkthroughd in order to check everything works.
is file that tells
mkosi how to build the image. We opt for a
Fedora-based image here (but we might as well have used Debian
here, or any other supported distribution). We need no particular
packages during runtime (after all we only depend on libc), but
during the build phase we need gcc and make, hence these are the
only packages we list in
is a shell script that is invoked during mkosi's build logic. All
it does is invoke
make install to build and install
our little daemon, and afterwards it extends the
/etc/os-release file with an additional
field that describes our portable service a bit.
Let's now use this to build the portable service image. For that we
use the mkosi tool. It's
sufficient to invoke it without parameter to build the first image: it
will automatically discover
tells it what to do. (Note that if you work on a project like this for
a longer time,
mkosi -if is probably the better command to use, as
it that speeds up building substantially by using an incremental build
mkosi will download the necessary RPMs, and put them all
together. It will build our little daemon inside the image and after
all that's done it will output the resulting image:
Because we opted to build a GPT raw disk image in
file is actually a raw disk image containing a GPT partition
table. You can use
fdisk -l walkthroughd_1.raw to enumerate the
partition table. You can also use
walkthroughd_1.raw to explore the image quickly if you need.
Now that we have a portable service image, let's see how we can attach, enable and start the service included within it.
First, let's attach the image:
# /usr/lib/systemd/portablectl attach ./walkthroughd_1.raw (Matching unit files with prefix 'walkthroughd'.) Created directory /etc/systemd/system/walkthroughd.service.d. Written /etc/systemd/system/walkthroughd.service.d/20-portable.conf. Created symlink /etc/systemd/system/walkthroughd.service.d/10-profile.conf → /usr/lib/systemd/portable/profile/default/service.conf. Copied /etc/systemd/system/walkthroughd.service. Created symlink /etc/portables/walkthroughd_1.raw → /home/lennart/projects/portable-walkthrough/walkthroughd_1.raw.
The command will show you exactly what is has been doing: it just copied the main service file out, and added the two drop-ins, as expected.
Let's see if the unit is now available on the host, just like a regular unit, as promised:
# systemctl status walkthroughd.service ● walkthroughd.service - A simple example service Loaded: loaded (/etc/systemd/system/walkthroughd.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/walkthroughd.service.d └─10-profile.conf, 20-portable.conf Active: inactive (dead)
Nice, it worked. We see that the unit file is available and that systemd correctly discovered the two drop-ins. The unit is neither enabled nor started however. Yes, attaching a portable service image doesn't imply enabling nor starting. It just means the unit files contained in the image are made available to the host. It's up to the administrator to then enable them (so that they are automatically started when needed, for example at boot), and/or start them (in case they shall run right-away).
Let's now enable and start the service in one step:
# systemctl enable --now walkthroughd.service Created symlink /etc/systemd/system/multi-user.target.wants/walkthroughd.service → /etc/systemd/system/walkthroughd.service.
Let's check if it's running:
# systemctl status walkthroughd.service ● walkthroughd.service - A simple example service Loaded: loaded (/etc/systemd/system/walkthroughd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/walkthroughd.service.d └─10-profile.conf, 20-portable.conf Active: active (running) since Wed 2018-06-27 17:55:30 CEST; 4s ago Main PID: 45003 (walkthroughd) Tasks: 1 (limit: 4915) Memory: 4.3M CGroup: /system.slice/walkthroughd.service └─45003 /usr/local/lib/walkthroughd/walkthroughd Jun 27 17:55:30 sigma walkthroughd: Initializing.
Perfect! We can see that the service is now enabled and running. The daemon is running as PID 45003.
Now that we verified that all is good, let's stop, disable and detach the service again:
# systemctl disable --now walkthroughd.service Removed /etc/systemd/system/multi-user.target.wants/walkthroughd.service. # /usr/lib/systemd/portablectl detach ./walkthroughd_1.raw Removed /etc/systemd/system/walkthroughd.service. Removed /etc/systemd/system/walkthroughd.service.d/10-profile.conf. Removed /etc/systemd/system/walkthroughd.service.d/20-portable.conf. Removed /etc/systemd/system/walkthroughd.service.d. Removed /etc/portables/walkthroughd_1.raw.
And finally, let's see that it's really gone:
# systemctl status walkthroughd Unit walkthroughd.service could not be found.
Perfect! It worked!
I hope the above gets you started with Portable Services. If you have further questions, please contact our mailing list.
A more low-level document explaining details is shipped along with systemd.
For further information about
mkosi see its homepage.
The CFP will close on July 30th. Notification of acceptance and non-acceptance will go out within 7 days of the closing of the CFP.
All topics relevant to foundational open-source Linux technologies are welcome. In particular, however, we are looking for proposals including, but not limited to, the following topics:
While our focus is definitely more on the user-space side of things, talks about kernel projects are welcome, as long as they have a clear and direct relevance for user-space.
For more information please visit our conference website!
For those living under a rock, the videos from everybody's favourite Userspace Linux Conference All Systems Go! 2017 are now available online.
The videos for my own two talks are available here:
Of course, this is the stellar work of the CCC VOC folks, who are hard to beat when it comes to videotaping of community conferences.
The GNOME.Asia Summit 2017 organizers invited to me to speak at their conference in Chongqing/China, and it was an excellent event! Here's my brief report:
Because we arrived one day early in Chongqing, my GNOME friends Sri, Matthias, Jonathan, David and I started our journey with an excursion to the Dazu Rock Carvings, a short bus trip from Chongqing, and an excellent (and sometimes quite surprising) sight. I mean, where else can you see a buddha with 1000+ hands, and centuries old, holding a cell Nexus 5 cell phone? Here's proof:
The GNOME.Asia schedule was excellent, with various good talks, including some about Flatpak, Endless OS, rpm-ostree, Blockchains and more. My own talk was about The Path to a Fully Protected GNOME Desktop OS Image (Slides available here). In the hallway track I did my best to advocate casync to whoever was willing to listen, and I think enough were ;-). As we all know attending conferences is at least as much about the hallway track as about the talks, and GNOME.Asia was a fantastic way to meet the Chinese GNOME and Open Source communities.
The day after the conference the organizers of GNOME.Asia organized a Chongqing day trip. A particular highlight was the ubiqutious hot pot, sometimes with the local speciality: fresh pig brain.
Here some random photos from the trip: sights, food, social event and more.
I'd like to thank the GNOME Foundation for funding my trip to GNOME.Asia. And that's all for now. But let me close with an old chinese wisdom:
The Trials Of A Long Journey Always Feeling, Civilized Travel Pass Reputation.
TL;DR: systemd now can do per-service IP traffic accounting, as well as access control for IP address ranges.
Last Friday we released systemd 235. I already blogged about its Dynamic User feature in detail, but there's one more piece of new functionality that I think deserves special attention: IP accounting and access control.
Before v235 systemd already provided per-unit resource management hooks for a number of different kinds of resources: consumed CPU time, disk I/O, memory usage and number of tasks. With v235 another kind of resource can be controlled per-unit with systemd: network traffic (specifically IP).
Three new unit file settings have been added in this context:
IPAccounting= is a boolean setting. If enabled for a unit, all IP
traffic sent and received by processes associated with it is counted
both in terms of bytes and of packets.
IPAddressDeny= takes an IP address prefix (that means: an IP
address with a network mask). All traffic from and to this address will be
prohibited for processes of the service.
IPAddressAllow= is the matching positive counterpart to
IPAddressDeny=. All traffic matching this IP address/network mask
combination will be allowed, even if otherwise listed in
The three options are thin wrappers around kernel functionality
introduced with Linux 4.11: the control group eBPF hooks. The actual
work is done by the kernel, systemd just provides a number of new
settings to configure this facet of it. Note that cgroup/eBPF is
unrelated to classic Linux firewalling,
iptables. It's up to you whether you use one or the
other, or both in combination (or of course neither).
Let's have a closer look at the IP accounting logic mentioned
above. Let's write a simple unit
[Service] ExecStart=/usr/bin/ping 22.214.171.124 IPAccounting=yes
This simple unit invokes the
ping(8) command to
send a series of ICMP/IP ping packets to the IP address 126.96.36.199 (which
is the Google DNS server IP; we use it for testing here, since it's
easy to remember, reachable everywhere and known to react to ICMP
pings; any other IP address responding to pings would be fine to use,
IPAccounting= option is used to turn on IP accounting for
Let's start this service after writing the file. Let's then have a
look at the status output of
# systemctl daemon-reload # systemctl start ip-accounting-test # systemctl status ip-accounting-test ● ip-accounting-test.service Loaded: loaded (/etc/systemd/system/ip-accounting-test.service; static; vendor preset: disabled) Active: active (running) since Mon 2017-10-09 18:05:47 CEST; 1s ago Main PID: 32152 (ping) IP: 168B in, 168B out Tasks: 1 (limit: 4915) CGroup: /system.slice/ip-accounting-test.service └─32152 /usr/bin/ping 188.8.131.52 Okt 09 18:05:47 sigma systemd: Started ip-accounting-test.service. Okt 09 18:05:47 sigma ping: PING 184.108.40.206 (220.127.116.11) 56(84) bytes of data. Okt 09 18:05:47 sigma ping: 64 bytes from 18.104.22.168: icmp_seq=1 ttl=59 time=29.2 ms Okt 09 18:05:48 sigma ping: 64 bytes from 22.214.171.124: icmp_seq=2 ttl=59 time=28.0 ms
This shows the
ping command running — it's currently at its second
ping cycle as we can see in the logs at the end of the output. More
interesting however is the
IP: line further up showing the current
IP byte counters. It currently shows 168 bytes have been received, and
168 bytes have been sent. That the two counters are at the same value
is not surprising: ICMP ping requests and responses are supposed to
have the same size. Note that this line is shown only if
IPAccounting= is turned on for the service, as only then this data
Let's wait a bit, and invoke
systemctl status again:
# systemctl status ip-accounting-test ● ip-accounting-test.service Loaded: loaded (/etc/systemd/system/ip-accounting-test.service; static; vendor preset: disabled) Active: active (running) since Mon 2017-10-09 18:05:47 CEST; 4min 28s ago Main PID: 32152 (ping) IP: 22.2K in, 22.2K out Tasks: 1 (limit: 4915) CGroup: /system.slice/ip-accounting-test.service └─32152 /usr/bin/ping 126.96.36.199 Okt 09 18:10:07 sigma ping: 64 bytes from 188.8.131.52: icmp_seq=260 ttl=59 time=27.7 ms Okt 09 18:10:08 sigma ping: 64 bytes from 184.108.40.206: icmp_seq=261 ttl=59 time=28.0 ms Okt 09 18:10:09 sigma ping: 64 bytes from 220.127.116.11: icmp_seq=262 ttl=59 time=33.8 ms Okt 09 18:10:10 sigma ping: 64 bytes from 18.104.22.168: icmp_seq=263 ttl=59 time=48.9 ms Okt 09 18:10:11 sigma ping: 64 bytes from 22.214.171.124: icmp_seq=264 ttl=59 time=27.2 ms Okt 09 18:10:12 sigma ping: 64 bytes from 126.96.36.199: icmp_seq=265 ttl=59 time=27.0 ms Okt 09 18:10:13 sigma ping: 64 bytes from 188.8.131.52: icmp_seq=266 ttl=59 time=26.8 ms Okt 09 18:10:14 sigma ping: 64 bytes from 184.108.40.206: icmp_seq=267 ttl=59 time=27.4 ms Okt 09 18:10:15 sigma ping: 64 bytes from 220.127.116.11: icmp_seq=268 ttl=59 time=29.7 ms Okt 09 18:10:16 sigma ping: 64 bytes from 18.104.22.168: icmp_seq=269 ttl=59 time=27.6 ms
As we can see, after 269 pings the counters are much higher: at 22K.
Note that while
systemctl status shows only the byte counters,
packet counters are kept as well. Use the low-level
command to query the current raw values of the in and out packet and
# systemctl show ip-accounting-test -p IPIngressBytes -p IPIngressPackets -p IPEgressBytes -p IPEgressPackets IPIngressBytes=37776 IPIngressPackets=449 IPEgressBytes=37776 IPEgressPackets=449
Of course, the same information is also available via the D-Bus
APIs. If you want to process this data further consider talking proper
D-Bus, rather than scraping the output of
Now, let's stop the service again:
# systemctl stop ip-accounting-test
When a service with such accounting turned on terminates, a log line
about all its consumed resources is written to the logs. Let's check
# journalctl -u ip-accounting-test -n 5 -- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:17:02 CEST. -- Okt 09 18:15:50 sigma ping: 64 bytes from 22.214.171.124: icmp_seq=603 ttl=59 time=26.9 ms Okt 09 18:15:51 sigma ping: 64 bytes from 126.96.36.199: icmp_seq=604 ttl=59 time=27.2 ms Okt 09 18:15:52 sigma systemd: Stopping ip-accounting-test.service... Okt 09 18:15:52 sigma systemd: Stopped ip-accounting-test.service. Okt 09 18:15:52 sigma systemd: ip-accounting-test.service: Received 49.5K IP traffic, sent 49.5K IP traffic
The last line shown is the interesting one, that shows the accounting data. It's actually a structured log message, and among its metadata fields it contains the more comprehensive raw data:
# journalctl -u ip-accounting-test -n 1 -o verbose -- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:18:50 CEST. -- Mon 2017-10-09 18:15:52.649028 CEST [s=89a2cc877fdf4dafb2269a7631afedad;i=14d7;b=4c7e7adcba0c45b69d612857270716d3;m=137592e75e;t=55b1f81298605;x=c3c9b57b28c9490e] PRIORITY=6 _BOOT_ID=4c7e7adcba0c45b69d612857270716d3 _MACHINE_ID=e87bfd866aea4ae4b761aff06c9c3cb3 _HOSTNAME=sigma SYSLOG_FACILITY=3 SYSLOG_IDENTIFIER=systemd _UID=0 _GID=0 _TRANSPORT=journal _PID=1 _COMM=systemd _EXE=/usr/lib/systemd/systemd _CAP_EFFECTIVE=3fffffffff _SYSTEMD_CGROUP=/init.scope _SYSTEMD_UNIT=init.scope _SYSTEMD_SLICE=-.slice CODE_FILE=../src/core/unit.c _CMDLINE=/usr/lib/systemd/systemd --switched-root --system --deserialize 25 _SELINUX_CONTEXT=system_u:system_r:init_t:s0 UNIT=ip-accounting-test.service CODE_LINE=2115 CODE_FUNC=unit_log_resources MESSAGE_ID=ae8f7b866b0347b9af31fe1c80b127c0 INVOCATION_ID=98a6e756fa9d421d8dfc82b6df06a9c3 IP_METRIC_INGRESS_BYTES=50880 IP_METRIC_INGRESS_PACKETS=605 IP_METRIC_EGRESS_BYTES=50880 IP_METRIC_EGRESS_PACKETS=605 MESSAGE=ip-accounting-test.service: Received 49.6K IP traffic, sent 49.6K IP traffic _SOURCE_REALTIME_TIMESTAMP=1507565752649028
The interesting fields of this log message are of course
IP_METRIC_EGRESS_PACKETS= that show the
The log message carries a message
that may be used to quickly search for all such resource log messages
ae8f7b866b0347b9af31fe1c80b127c0). We can combine a search term for
messages of this ID with
-u switch to quickly find
out about the resource usage of any invocation of a specific
service. Let's try:
# journalctl -u ip-accounting-test MESSAGE_ID=ae8f7b866b0347b9af31fe1c80b127c0 -- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:25:27 CEST. -- Okt 09 18:15:52 sigma systemd: ip-accounting-test.service: Received 49.6K IP traffic, sent 49.6K IP traffic
Of course, the output above shows only one message at the moment, since we started the service only once, but a new one will appear every time you start and stop it again.
The IP accounting logic is also hooked up with
which is useful for transiently running a command as systemd service
with IP accounting turned on. Let's try it:
# systemd-run -p IPAccounting=yes --wait wget https://cfp.all-systems-go.io/en/ASG2017/public/schedule/2.pdf Running as unit: run-u2761.service Finished with result: success Main processes terminated with: code=exited/status=0 Service runtime: 878ms IP traffic received: 231.0K IP traffic sent: 3.7K
wget to download the
PDF version of the 2nd day
of everybody's favorite Linux user-space conference All Systems Go!
2017 (BTW, have you already booked your
ticket? We are very close to
selling out, be quick!). The IP traffic this command generated was
231K ingress and 4K egress. In the
systemd-run command line two
parameters are important. First of all, we use
to turn on IP accounting for the transient service (as above). And
secondly we use
--wait to tell
systemd-run to wait for the service
to exit. If
--wait is used,
systemd-run will also show you various
statistics about the service that just ran and terminated, including
the IP statistics you are seeing if IP accounting has been turned on.
It's fun to combine this sort of IP accounting with interactive transient units. Let's try that:
# systemd-run -p IPAccounting=1 -t /bin/sh Running as unit: run-u2779.service Press ^] three times within 1s to disconnect TTY. sh-4.4# dnf update … sh-4.4# dnf install firefox … sh-4.4# exit Finished with result: success Main processes terminated with: code=exited/status=0 Service runtime: 5.297s IP traffic received: …B IP traffic sent: …B
--pty switch (or short:
-t), which opens
an interactive pseudo-TTY connection to the invoked service process,
which is a bourne shell in this case. Doing this means we have a full,
comprehensive shell with job control and everything. Since the shell
is running as part of a service with IP accounting turned on, all IP
traffic we generate or receive will be accounted for. And as soon as
we exit the shell, we'll see what it consumed. (For the sake of
brevity I actually didn't paste the whole output above, but truncated
core parts. Try it out for yourself, if you want to see the output in
Sometimes it might make sense to turn on IP accounting for a unit that
is already running. For that, use
foobar.service IPAccounting=yes, which will instantly turn on
accounting for it. Note that it won't count retroactively though: only
the traffic sent/received after the point in time you turned it on
will be collected. You may turn off accounting for the unit with the
Of course, sometimes it's interesting to collect IP accounting data
for all services, and turning on
IPAccounting=yes in every single
unit is cumbersome. To deal with that there's a global option
available which can be set in
So much about IP accounting. Let's now have a look at IP access
control with systemd 235. As mentioned above, the two new unit file
IPAddressDeny= maybe be used for
that. They operate in the following way:
If the source address of an incoming packet or the destination
address of an outgoing packet matches one of the IP addresses/network
masks in the relevant unit's
IPAddressAllow= setting then it will be
allowed to go through.
Otherwise, if a packet matches an
IPAddressDeny= entry configured
for the service it is dropped.
If the packet matches neither of the above it is allowed to go through.
Or in other words,
IPAddressDeny= implements a blacklist, but
IPAddressAllow= takes precedence.
Let's try that out. Let's modify our last example above in order to get a transient service running an interactive shell which has such an access list set:
# systemd-run -p IPAddressDeny=any -p IPAddressAllow=188.8.131.52 -p IPAddressAllow=127.0.0.0/8 -t /bin/sh Running as unit: run-u2850.service Press ^] three times within 1s to disconnect TTY. sh-4.4# ping 184.108.40.206 -c1 PING 220.127.116.11 (18.104.22.168) 56(84) bytes of data. 64 bytes from 22.214.171.124: icmp_seq=1 ttl=59 time=27.9 ms --- 126.96.36.199 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 27.957/27.957/27.957/0.000 ms sh-4.4# ping 188.8.131.52 -c1 PING 184.108.40.206 (220.127.116.11) 56(84) bytes of data. ping: sendmsg: Operation not permitted ^C --- 18.104.22.168 ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms sh-4.4# ping 127.0.0.2 -c1 PING 127.0.0.1 (127.0.0.2) 56(84) bytes of data. 64 bytes from 127.0.0.2: icmp_seq=1 ttl=64 time=0.116 ms --- 127.0.0.2 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.116/0.116/0.116/0.000 ms sh-4.4# exit
The access list we set up uses
IPAddressDeny=any in order to define
an IP white-list: all traffic will be prohibited for the session,
except for what is explicitly white-listed. In this command line, we
white-listed two address prefixes: 22.214.171.124 (with no explicit network
mask, which means the mask with all bits turned on is implied,
/32), and 127.0.0.0/8. Thus, the service can communicate with
Google's DNS server and everything on the local loop-back, but nothing
else. The commands run in this interactive shell show this: First we
try pinging 126.96.36.199 which happily responds. Then, we try to ping
188.8.131.52 (that's Google's other DNS server, but excluded from this
white-list), and as we see it is immediately refused with an Operation
not permitted error. As last step we ping 127.0.0.2 (which is on the
local loop-back), and we see it works fine again, as expected.
In the example above we used
identifier is a shortcut for writing 0.0.0.0/0 ::/0, i.e. it's a
shortcut for everything, on both IPv4 and IPv6. A number of other
such shortcuts exist. For example, instead of spelling out
127.0.0.0/8 we could also have used the more descriptive shortcut
localhost which is expanded to 127.0.0.0/8 ::1/128, i.e. everything
on the local loopback device, on both IPv4 and IPv6.
Being able to configure IP access lists individually for each unit is
pretty nice already. However, typically one wants to configure this
comprehensively, not just for individual units, but for a set of units
in one go or even the system as a whole. In systemd, that's possible
by making use of
units (for those who don't know systemd that well, slice units are a
concept for organizing services in hierarchical tree for the purpose of
resource management): the IP access list in effect for a unit is the
combination of the individual IP access lists configured for the unit
itself and those of all slice units it is contained in.
By default, system services are assigned to
which in turn is a child of the root slice
of these two slice units are hence suitable for locking down all
system services at once. If an access list is configured on
system.slice it will only apply to system services, however, if
-.slice it will apply to all user processes of the
system, including all user session processes (i.e. which are by
default assigned to
user.slice which is a child of
addition to the system services.
Let's make use of this:
# systemctl set-property system.slice IPAddressDeny=any IPAddressAllow=localhost # systemctl set-property apache.service IPAddressAllow=10.0.0.0/8
The two commands above are a very powerful way to first turn off all IP communication for all system services (with the exception of loop-back traffic), followed by an explicit white-listing of 10.0.0.0/8 (which could refer to the local company network, you get the idea) but only for the Apache service.
After playing around a bit with this, let's talk about use-cases. Here are a few ideas:
The IP access list logic can in many ways provide a more modern replacement for the venerable TCP Wrapper, but unlike it it applies to all IP sockets of a service unconditionally, and requires no explicit support in any way in the service's code: no patching required. On the other hand, TCP wrappers have a number of features this scheme cannot cover, most importantly systemd's IP access lists operate solely on the level of IP addresses and network masks, there is no way to configure access by DNS name (though quite frankly, that is a very dubious feature anyway, as doing networking — unsecured networking even – in order to restrict networking sounds quite questionable, at least to me).
It can also replace (or augment) some facets of IP firewalling,
i.e. Linux NetFilter/
iptables. Right now, systemd's access lists are
of course a lot more minimal than NetFilter, but they have one major
benefit: they understand the service concept, and thus are a lot more
context-aware than NetFilter. Classic firewalls, such as NetFilter,
derive most service context from the IP port number alone, but we live
in a world where IP port numbers are a lot more dynamic than they used
to be. As one example, a BitTorrent client or server may use any IP
port it likes for its file transfer, and writing IP firewalling rules
matching that precisely is hence hard. With the systemd IP access list
implementing this is easy: just set the list for your BitTorrent
service unit, and all is good.
Let me stress though that you should be careful when comparing NetFilter with systemd's IP address list logic, it's really like comparing apples and oranges: to start with, the IP address list logic has a clearly local focus, it only knows what a local service is and manages access of it. NetFilter on the other hand may run on border gateways, at a point where the traffic flowing through is pure IP, carrying no information about a systemd unit concept or anything like that.
It's a simple way to lock down distribution/vendor supplied system
services by default. For example, if you ship a service that you know
never needs to access the network, then simply set
(possibly combined with
IPAddressAllow=localhost) for it, and it
will live in a very tight networking sand-box it cannot escape
from. systemd itself makes use of this for a number of its services by
default now. For example, the logging service
systemd-journald.service, the login manager
systemd-logind or the
core-dump processing unit
systemd-coredump@.service all have such a
rule set out-of-the-box, because we know that neither of these
services should be able to access the network, under any
Because the IP access list logic can be combined with transient
units, it can be used to quickly and effectively sandbox arbitrary
commands, and even include them in shell pipelines and such. For
example, let's say we don't trust our
curl implementation (maybe it
got modified locally by a hacker, and phones home?), but want to use
it anyway to download the the slides of my most recent casync
talk in order to
print it, but want to make sure it doesn't connect anywhere except
where we tell it to (and to make this even more fun, let's minimize
privileges further, by setting
# systemd-resolve 0pointer.de 0pointer.de: 184.108.40.206 2a01:238:43ed:c300:10c3:bcf3:3266:da74 -- Information acquired via protocol DNS in 2.8ms. -- Data is authenticated: no # systemd-run --pipe -p IPAddressDeny=any \ -p IPAddressAllow=220.127.116.11 \ -p IPAddressAllow=2a01:238:43ed:c300:10c3:bcf3:3266:da74 \ -p DynamicUser=yes \ curl http://0pointer.de/public/casync-kinvolk2017.pdf | lp
So much about use-cases. This is by no means a comprehensive list of what you can do with it, after all both IP accounting and IP access lists are very generic concepts. But I do hope the above inspires your fantasy.
IP accounting and IP access control are primarily concepts for the
local administrator. However, As suggested above, it's a very good
idea to ship services that by design have no network-facing
functionality with an access list of
IPAddressDeny=any (and possibly
IPAddressAllow=localhost), in order to improve the out-of-the-box
security of our systems.
An option for security-minded distributions might be a more radical
approach: ship the system with
IPAddressDeny=any by default, and ask the administrator to punch
holes into that for each network facing service with
set-property … IPAddressAllow=…. But of course, that's only an
option for distributions willing to break compatibility with what was
A couple of additional notes:
IP accounting and access lists may be mixed with socket activation. In this case, it's a good idea to configure access lists and accounting for both the socket unit that activates and the service unit that is activated, as both units maintain fully separate settings. Note that IP accounting and access lists configured on the socket unit applies to all sockets created on behalf of that unit, and even if these sockets are passed on to the activated services, they will still remain in effect and belong to the socket unit. This also means that IP traffic done on such sockets will be accounted to the socket unit, not the service unit. The fact that IP access lists are maintained separately for the kernel sockets created on behalf of the socket unit and for the kernel sockets created by the service code itself enables some interesting uses. For example, it's possible to set a relatively open access list on the socket unit, but a very restrictive access list on the service unit, thus making the sockets configured through the socket unit the only way in and out of the service.
systemd's IP accounting and access lists apply to IP sockets only,
not to sockets of any other address families. That also means that
AF_PACKET (i.e. raw) sockets are not covered. This means it's a good
idea to combine IP access lists with
in order to lock this down.
You may wonder if the per-unit resource log message and
systemd-run --wait may also show you details about other types or
resources consumed by a service. The answer is yes: if you turn on
CPUAccounting= for a service, you'll also see a summary of consumed
CPU time in the log message and the command output. And we are
planning to hook-up
IOAccounting= the same way too, soon.
Note that IP accounting and access lists aren't entirely free. systemd inserts an eBPF program into the IP pipeline to make this functionality work. However, eBPF execution has been optimized for speed in the last kernel versions already, and given that it currently is in the focus of interest to many I'd expect to be optimized even further, so that the cost for enabling these features will be negligible, if it isn't already.
IP accounting is currently not recursive. That means you cannot use a slice unit to join the accounting of multiple units into one. This is something we definitely want to add, but requires some more kernel work first.
You might wonder how the
setting relates to
IPAccessDeny=any. Superficially they have similar
effects: they make the network unavailable to services. However,
looking more closely there are a number of
PrivateNetwork= is implemented using Linux network
name-spaces. As such it entirely detaches all networking of a service
from the host, including non-IP networking. It does so by creating a
private little environment the service lives in where communication
with itself is still allowed though. In addition using the
dependency additional services may be added to the same environment,
thus permitting communication with each other but not with anything
outside of this group.
IPAddressDeny= are much
less invasive. First of all they apply to IP networking only, and can
match against specific IP addresses. A service running with
PrivateNetwork= turned off but
IPAddressDeny=any turned on, may
enumerate the network interfaces and their IP configured even though
it cannot actually do any IP communication. On the other hand if you
PrivateNetwork= all network interfaces besides
disappear. Long story short: depending on your use-case one, the other,
both or neither might be suitable for sand-boxing of your service. If
possible I'd always turn on both, for best security, and that's what
we do for all of systemd's own long-running services.
And that's all for now. Have fun with per-unit IP accounting and access lists!
TL;DR: you may now configure systemd to dynamically allocate a UNIX user ID for service processes when it starts them and release it when it stops them. It's pretty secure, mixes well with transient services, socket activated services and service templating.
Today we released systemd 235. Among other improvements this greatly extends the dynamic user logic of systemd. Dynamic users are a powerful but little known concept, supported in its basic form since systemd 232. With this blog story I hope to make it a bit better known.
The UNIX user concept is the most basic and well-understood security concept in POSIX operating systems. It is UNIX/POSIX' primary security concept, the one everybody can agree on, and most security concepts that came after it (such as process capabilities, SELinux and other MACs, user name-spaces, …) in some form or another build on it, extend it or at least interface with it. If you build a Linux kernel with all security features turned off, the user concept is pretty much the one you'll still retain.
Originally, the user concept was introduced to make multi-user systems
a reality, i.e. systems enabling multiple human users to share the
same system at the same time, cleanly separating their resources and
protecting them from each other. The majority of today's UNIX systems
don't really use the user concept like that anymore though. Most of
today's systems probably have only one actual human user (or even
less!), but their user databases (
/etc/passwd) list a good number
more entries than that. Today, the majority of UNIX users in most
environments are system users, i.e. users that are not the technical
representation of a human sitting in front of a PC anymore, but the
security identity a system service — an executable program — runs
as. Even though traditional, simultaneous multi-user systems slowly
became less relevant, their ground-breaking basic concept became the
cornerstone of UNIX security. The OS is nowadays partitioned into
isolated services — and each service runs as its own system user, and
thus within its own, minimal security context.
The people behind the Android OS realized the relevance of the UNIX user concept as the primary security concept on UNIX, and took its use even further: on Android not only system services take benefit of the UNIX user concept, but each UI app gets its own, individual user identity too — thus neatly separating app resources from each other, and protecting app processes from each other, too.
Back in the more traditional Linux world things are a bit less advanced in this area. Even though users are the quintessential UNIX security concept, allocation and management of system users is still a pretty limited, raw and static affair. In most cases, RPM or DEB package installation scripts allocate a fixed number of (usually one) system users when you install the package of a service that wants to take benefit of the user concept, and from that point on the system user remains allocated on the system and is never deallocated again, even if the package is later removed again. Most Linux distributions limit the number of system users to 1000 (which isn't particularly a lot). Allocating a system user is hence expensive: the number of available users is limited, and there's no defined way to dispose of them after use. If you make use of system users too liberally, you are very likely to run out of them sooner rather than later.
You may wonder why system users are generally not deallocated when the package that registered them is uninstalled from a system (at least on most distributions). The reason for that is one relevant property of the user concept (you might even want to call this a design flaw): user IDs are sticky to files (and other objects such as IPC objects). If a service running as a specific system user creates a file at some location, and is then terminated and its package and user removed, then the created file still belongs to the numeric ID ("UID") the system user originally got assigned. When the next system user is allocated and — due to ID recycling — happens to get assigned the same numeric ID, then it will also gain access to the file, and that's generally considered a problem, given that the file belonged to a potentially very different service once upon a time, and likely should not be readable or changeable by anything coming after it. Distributions hence tend to avoid UID recycling which means system users remain registered forever on a system after they have been allocated once.
The above is a description of the status quo ante. Let's now focus on what systemd's dynamic user concept brings to the table, to improve the situation.
With systemd dynamic users we hope to make make it easier and cheaper to allocate system users on-the-fly, thus substantially increasing the possible uses of this core UNIX security concept.
If you write a systemd service unit file, you may enable the dynamic
user logic for it by setting the
option in its
[Service] section to
yes. If you do a system user is
dynamically allocated the instant the service binary is invoked, and
released again when the service terminates. The user is automatically
allocated from the UID range 61184–65519, by looking for a so far
Now you may wonder, how does this concept deal with the sticky user issue discussed above? In order to counter the problem, two strategies easily come to mind:
Prohibit the service from creating any files/directories or IPC objects
Automatically removing the files/directories or IPC objects the service created when it shuts down.
In systemd we implemented both strategies, but for different parts of the execution environment. Specifically:
sand-boxing options turn off write access to pretty much the whole OS
directory tree, with a few relevant exceptions, such as the API file
/sys and so on, as well as
/var/tmp. (BTW: setting these two options on your regular services
that do not use
DynamicUser= is a good idea too, as it drastically
reduces the exposure of the system to exploited services.)
option sets up
/var/tmp for the service in a way that it
gets its own, disconnected version of these directories, that are not
shared by other services, and whose life-cycle is bound to the
service's own life-cycle. Thus if the service goes down, the user is
removed and all its temporary files and directories with it. (BTW: as
above, consider setting this option for your regular services that do
DynamicUser= too, it's a great way to lock things down
option ensures that when the service goes down all SysV and POSIX IPC
objects (shared memory, message queues, semaphores) owned by the
service's user are removed. Thus, the life-cycle of the IPC objects is
bound to the life-cycle of the dynamic user and service, too. (BTW:
yes, here too, consider using this in your regular services, too!)
With these four settings in effect, services with dynamic users are
nicely sand-boxed. They cannot create files or directories, except in
/var/tmp, where they will be removed automatically when
the service shuts down, as will any IPC objects created. Sticky
ownership of files/directories and IPC objects is hence dealt with
option may be used to open up a bit the sandbox to external
programs. If you set it to a directory name of your choice, it will be
/run when the service is started, and removed in its
entirety when it is terminated. The ownership of the directory is
assigned to the service's dynamic user. This way, a dynamic user
service can expose API interfaces (AF_UNIX sockets, …) to other
services at a well-defined place and again bind the life-cycle of it to
the service's own run-time. Example: set
your service, and watch how a directory
/run/foobar appears at the
moment you start the service, and disappears the moment you stop
it again. (BTW: Much like the other settings discussed above,
RuntimeDirectory= may be used outside of the
too, and is a nice way to run any service with a properly owned,
life-cycle-managed run-time directory.)
Of course, a service running in such an environment (although already very useful for many cases!), has a major limitation: it cannot leave persistent data around it can reuse on a later run. As pretty much the whole OS directory tree is read-only to it, there's simply no place it could put the data that survives from one service invocation to the next.
With systemd 235 this limitation is removed: there are now three new
CacheDirectory=. In many ways they operate like
RuntimeDirectory=, but create sub-directories below
/var/cache, respectively. There's one major
difference beyond that however: directories created that way are
persistent, they will survive the run-time cycle of a service, and
thus may be used to store data that is supposed to stay around between
invocations of the service.
Of course, the obvious question to ask now is: how do these three settings deal with the sticky file ownership problem?
For that we lifted a concept from container managers. Container
managers have a very similar problem: each container and the host
typically end up using a very similar set of numeric UIDs, and unless
user name-spacing is deployed this means that host users might be able
to access the data of specific containers that also have a user by the
same numeric UID assigned, even though it actually refers to a very
different identity in a different context. (Actually, it's even worse
than just getting access, due to the existence of
setuid file bits,
access might translate to privilege elevation.) The way container
managers protect the container images from the host (and from each
other to some level) is by placing the container trees below a
boundary directory, with very restrictive access modes and ownership
root:root or so). A host user hence cannot take advantage
of the files/directories of a container user of the same UID inside of
a local container tree, simply because the boundary directory makes it
impossible to even reference files in it. After all on UNIX, in order
to get access to a specific path you need access to every single
component of it.
How is that applied to dynamic user services? Let's say
StateDirectory=foobar is set for a service that has
turned off. The instant the service is started,
created as state directory, owned by the service's user and remains in
existence when the service is stopped. If the same service now is run
DynamicUser= turned on, the implementation is slightly
altered. Instead of a directory
/var/lib/foobar a symbolic link by
the same path is created (owned by root), pointing to
/var/lib/private/foobar (the latter being owned by the service's
dynamic user). The
/var/lib/private directory is created as boundary
directory: it's owned by
root:root, and has a restrictive access
mode of 0700. Both the symlink and the service's state directory will
survive the service's life-cycle, but the state directory will remain,
and continues to be owned by the now disposed dynamic UID — however it
is protected from other host users (and other services which might get
the same dynamic UID assigned due to UID recycling) by the boundary
The obvious question to ask now is: but if the boundary directory
prohibits access to the directory from unprivileged processes, how can
the service itself which runs under its own dynamic UID access it
anyway? This is achieved by invoking the service process in a slightly
modified mount name-space: it will see most of the file hierarchy the
same way as everything else on the system (modulo
/var/tmp as mentioned above), except for
is over-mounted with a read-only
tmpfs file system instance, with a
slightly more liberal access mode permitting the service read
access. Inside of this
tmpfs file system instance another mount is
placed: a bind mount to the host's real
directory, onto the same name. Putting this together these means that
superficially everything looks the same and is available at the same
place on the host and from inside the service, but two important
changes have been made: the
/var/lib/private boundary directory lost
its restrictive character inside the service, and has been emptied of
the state directories of any other service, thus making the protection
complete. Note that the symlink
/var/lib/foobar hides the fact that
the boundary directory is used (making it little more than an
implementation detail), as the directory is available this way under
the same name as it would be if
DynamicUser= was not used. Long
story short: for the daemon and from the view from the host the
/var/lib/private is mostly transparent.
This logic of course raises another question: what happens to the state directory if a dynamic user service is started with a state directory configured, gets UID X assigned on this first invocation, then terminates and is restarted and now gets UID Y assigned on the second invocation, with X ≠ Y? On the second invocation the directory — and all the files and directories below it — will still be owned by the original UID X so how could the second instance running as Y access it? Our way out is simple: systemd will recursively change the ownership of the directory and everything contained within it to UID Y before invoking the service's executable.
Of course, such recursive ownership changing (
chown()ing) of whole
directory trees can become expensive (though according to my
experiences, IRL and for most services it's much cheaper than you
might think), hence in order to optimize behavior in this regard, the
allocation of dynamic UIDs has been tweaked in two ways to avoid the
necessity to do this expensive operation in most cases: firstly, when
a dynamic UID is allocated for a service an allocation loop is
employed that starts out with a UID hashed from the service's
name. This means a service by the same name is likely to always use
the same numeric UID. That means that a stable service name translates
into a stable dynamic UID, and that means recursive file ownership
adjustments can be skipped (of course, after validation). Secondly, if
the configured state directory already exists, and is owned by a
suitable currently unused dynamic UID, it's preferably used above
everything else, thus maximizing the chance we can avoid the
chown()ing. (That all said, ultimately we have to face it, the
currently available UID space of 4K+ is very small still, and
conflicts are pretty likely sooner or later, thus a chown()ing has to
be expected every now and then when this feature is used extensively).
LogsDirectory= work very similar to
StateDirectory=. The only difference is that they manage directories
/var/logs directories, and their boundary
directory hence is
So, after all this introduction, let's have a look how this all can be put together. Here's a trivial example:
# cat > /etc/systemd/system/dynamic-user-test.service <<EOF [Service] ExecStart=/usr/bin/sleep 4711 DynamicUser=yes EOF # systemctl daemon-reload # systemctl start dynamic-user-test # systemctl status dynamic-user-test ● dynamic-user-test.service Loaded: loaded (/etc/systemd/system/dynamic-user-test.service; static; vendor preset: disabled) Active: active (running) since Fri 2017-10-06 13:12:25 CEST; 3s ago Main PID: 2967 (sleep) Tasks: 1 (limit: 4915) CGroup: /system.slice/dynamic-user-test.service └─2967 /usr/bin/sleep 4711 Okt 06 13:12:25 sigma systemd: Started dynamic-user-test.service. # ps -e -o pid,comm,user | grep 2967 2967 sleep dynamic-user-test # id dynamic-user-test uid=64642(dynamic-user-test) gid=64642(dynamic-user-test) groups=64642(dynamic-user-test) # systemctl stop dynamic-user-test # id dynamic-user-test id: ‘dynamic-user-test’: no such user
In this example, we create a unit file with
DynamicUser= turned on,
start it, check if it's running correctly, have a look at the service
process' user (which is named like the service; systemd does this
automatically if the service name is suitable as user name, and you
didn't configure any user name to use explicitly), stop the service
and verify that the user ceased to exist too.
That's already pretty cool. Let's step it up a notch, by doing the
same in an interactive transient service (for those who don't know
systemd well: a transient service is a service that is defined and
started dynamically at run-time, for example via the
command from the shell. Think: run a service without having to write a
unit file first):
# systemd-run --pty --property=DynamicUser=yes --property=StateDirectory=wuff /bin/sh Running as unit: run-u15750.service Press ^] three times within 1s to disconnect TTY. sh-4.4$ id uid=63122(run-u15750) gid=63122(run-u15750) groups=63122(run-u15750) context=system_u:system_r:initrc_t:s0 sh-4.4$ ls -al /var/lib/private/ total 0 drwxr-xr-x. 3 root root 60 6. Okt 13:21 . drwxr-xr-x. 1 root root 852 6. Okt 13:21 .. drwxr-xr-x. 1 run-u15750 run-u15750 8 6. Okt 13:22 wuff sh-4.4$ ls -ld /var/lib/wuff lrwxrwxrwx. 1 root root 12 6. Okt 13:21 /var/lib/wuff -> private/wuff sh-4.4$ ls -ld /var/lib/wuff/ drwxr-xr-x. 1 run-u15750 run-u15750 0 6. Okt 13:21 /var/lib/wuff/ sh-4.4$ echo hello > /var/lib/wuff/test sh-4.4$ exit exit # id run-u15750 id: ‘run-u15750’: no such user # ls -al /var/lib/private total 0 drwx------. 1 root root 66 6. Okt 13:21 . drwxr-xr-x. 1 root root 852 6. Okt 13:21 .. drwxr-xr-x. 1 63122 63122 8 6. Okt 13:22 wuff # ls -ld /var/lib/wuff lrwxrwxrwx. 1 root root 12 6. Okt 13:21 /var/lib/wuff -> private/wuff # ls -ld /var/lib/wuff/ drwxr-xr-x. 1 63122 63122 8 6. Okt 13:22 /var/lib/wuff/ # cat /var/lib/wuff/test hello
The above invokes an interactive shell as transient service
systemd-run picked that name automatically,
since we didn't specify anything explicitly) with a dynamic user whose
name is derived automatically from the service name. Because
StateDirectory=wuff is used, a persistent state directory for the
service is made available as
/var/lib/wuff. In the interactive shell
running inside the service, the
ls commands show the
/var/lib/private boundary directory and its contents, as well as the
symlink that is placed for the service. Finally, before exiting the
shell, a file is created in the state directory. Back in the original
command shell we check if the user is still allocated: it is not, of
course, since the service ceased to exist when we exited the shell and
with it the dynamic user associated with it. From the host we check
the state directory of the service, with similar commands as we did
from inside of it. We see that things are set up pretty much the same
way in both cases, except for two things: first of all the user/group
of the files is now shown as raw numeric UIDs instead of the
user/group names derived from the unit name. That's because the user
ceased to exist at this point, and "ls" shows the raw UID for files
owned by users that don't exist. Secondly, the access mode of the
boundary directory is different: when we look at it from outside of
the service it is not readable by anyone but root, when we looked from
inside we saw it it being world readable.
Now, let's see how things look if we start another transient service, reusing the state directory from the first invocation:
# systemd-run --pty --property=DynamicUser=yes --property=StateDirectory=wuff /bin/sh Running as unit: run-u16087.service Press ^] three times within 1s to disconnect TTY. sh-4.4$ cat /var/lib/wuff/test hello sh-4.4$ ls -al /var/lib/wuff/ total 4 drwxr-xr-x. 1 run-u16087 run-u16087 8 6. Okt 13:22 . drwxr-xr-x. 3 root root 60 6. Okt 15:42 .. -rw-r--r--. 1 run-u16087 run-u16087 6 6. Okt 13:22 test sh-4.4$ id uid=63122(run-u16087) gid=63122(run-u16087) groups=63122(run-u16087) context=system_u:system_r:initrc_t:s0 sh-4.4$ exit exit
systemd-run picked a different auto-generated unit name, but
the used dynamic UID is still the same, as it was read from the
pre-existing state directory, and was otherwise unused. As we can see
the test file we generated earlier is accessible and still contains
the data we left in there. Do note that the user name is different
this time (as it is derived from the unit name, which is different),
but the UID it is assigned to is the same one as on the first
invocation. We can thus see that the mentioned optimization of the UID
allocation logic (i.e. that we start the allocation loop from the UID
owner of any existing state directory) took effect, so that no
chown()ing was required.
And that's the end of our example, which hopefully illustrated a bit how this concept and implementation works.
Now that we had a look at how to enable this logic for a unit and how it is implemented, let's discuss where this actually could be useful in real life.
One major benefit of dynamic user IDs is that running a privilege-separated service leaves no artifacts in the system. A system user is allocated and made use of, but it is discarded automatically in a safe and secure way after use, in a fashion that is safe for later recycling. Thus, quickly invoking a short-lived service for processing some job can be protected properly through a user ID without having to pre-allocate it and without this draining the available UID pool any longer than necessary.
In many cases, starting a service no longer requires
package-specific preparation. Or in other words, quite often
chmod invocations in "
scripts, as well as
drop-ins become unnecessary, as the
LogsDirectory= logic can do the
necessary work automatically, on-demand and with a well-defined
By combining dynamic user IDs with the transient unit concept, new
creative ways of sand-boxing are made available. For example, let's say
you don't trust the correct implementation of the
sort command. You
can now lock it into a simple, robust, dynamic UID sandbox with a
systemd-run and still integrate it into a shell pipeline like
any other command. Here's an example, showcasing a shell pipeline
whose middle element runs as a dynamically on-the-fly allocated UID,
that is released when the pipelines ends.
# cat some-file.txt | systemd-run ---pipe --property=DynamicUser=1 sort -u | grep -i foobar > some-other-file.txt
By combining dynamic user IDs with the systemd templating logic it
is now possible to do much more fine-grained and fully automatic UID
management. For example, let's say you have a template unit file
[Service] ExecStart=/usr/bin/myfoobarserviced DynamicUser=1 StateDirectory=foobar/%i
Now, let's say you want to start one instance of this service for each of your customers. All you need to do now for that is:
# systemctl enable email@example.com --now
And you are done. (Invoke this as many times as you like, each time
customerxyz by some customer identifier, you get the
By combining dynamic user IDs with socket activation you may easily
implement a system where each incoming connection is served by a
process instance running as a different, fresh, newly allocated UID
within its own sandbox. Here's an example
[Socket] ListenStream=2048 Accept=yes
With a matching
[Service] ExecStart=-/usr/bin/myservicebinary DynamicUser=yes
With the two unit files above, systemd will listen on TCP/IP port
2048, and for each incoming connection invoke a fresh instance of
waldo@.service, each time utilizing a different, new,
dynamically allocated UID, neatly isolated from any other
Dynamic user IDs combine very well with state-less systems,
i.e. systems that come up with an unpopulated
service using dynamic user IDs and the
will implicitly allocate the users and directories it needs for
running, right at the moment where it needs it.
Dynamic users are a very generic concept, hence a multitude of other uses are thinkable; the list above is just supposed to trigger your imagination.
I am pretty sure that a large number of services shipped with today's
distributions could benefit from using
StateDirectory= (and related settings). It often allows removal of
post-inst packaging scripts altogether, as well as any
tmpfiles.d drop-ins by unifying the needed declarations in the
unit file itself. Hence, as a packager please consider switching your
unit files over. That said, there are a number of conditions where
StateDirectory= (and friends) cannot or should
not be used. To name a few:
Service that need to write to files outside of
/dev/shm are generally incompatible with this
scheme. This rules out daemons that upgrade the system as one example,
as that involves writing to
Services that maintain a herd of processes with different user IDs. Some SMTP services are like this. If your service has such a super-server design, UID management needs to be done by the super-server itself, which rules out systemd doing its dynamic UID magic for it.
Services which run as root (obviously…) or are otherwise privileged.
Services that need to live in the same mount name-space as the host
system (for example, because they want to establish mount points
visible system-wide). As mentioned
PrivateTmp= and related options, which all require
the service to run in its own mount name-space.
Your focus is older distributions, i.e. distributions that do not
have systemd 232 (for
DynamicUser=) or systemd 235 (for
StateDirectory= and friends) yet.
If your distribution's packaging guides don't allow it. Consult your packaging guides, and possibly start a discussion on your distribution's mailing list about this.
A couple of additional, random notes about the implementation and use of these features:
Do note that allocating or deallocating a dynamic user leaves
/etc/passwd untouched. A dynamic user is added into the user
database through the glibc NSS module
and this information never hits the disk.
On traditional UNIX systems it was the job of the daemon process
itself to drop privileges, while the
DynamicUser= concept is
designed around the service manager (i.e. systemd) being responsible
for that. That said, since v235 there's a way to marry
and such services which want to drop privileges on their own. For
that, turn on
DynamicUser= and set
to the user name the service wants to
setuid() to. This has the
effect that systemd will allocate the dynamic user under the specified
name when the service is started. Then, prefix the command line you
with a single
! character. If you do, the user is allocated for the
service, but the daemon binary is invoked as
root instead of the
allocated user, under the assumption that the daemon changes its UID
on its own the right way. Note that after registration the user will
show up instantly in the user database, and is hence resolvable like
any other by the daemon process. Example:
You may wonder why systemd uses the UID range 61184–65519 for its
dynamic user allocations (side note: in hexadecimal this reads as
0xEF00–0xFFEF). That's because distributions (specifically Fedora)
tend to allocate regular users from below the 60000 range, and we
don't want to step into that. We also want to stay away from 65535 and
a bit around it, as some of these UIDs have special meanings (65535 is
often used as special value for "invalid" or "no" UID, as it is
identical to the 16bit value -1; 65534 is generally mapped to the
"nobody" user, and is where some kernel subsystems map unmappable
UIDs). Finally, we want to stay within the 16bit range. In a user
name-spacing world each container tends to have much less than the full
32bit UID range available that Linux kernels theoretically
provide. Everybody apparently can agree that a container should at
least cover the 16bit range though — already to include a
user. (And quite frankly, I am pretty sure assigning 64K UIDs per
container is nicely systematic, as the the higher 16bit of the 32bit
UID values this way become a container ID, while the lower 16bit
become the logical UID within each container, if you still follow what
I am babbling here…). And before you ask: no this range cannot be
changed right now, it's compiled in. We might change that eventually
You might wonder what happens if you already used UIDs from the 61184–65519 range on your system for other purposes. systemd should handle that mostly fine, as long as that usage is properly registered in the user database: when allocating a dynamic user we pick a UID, see if it is currently used somehow, and if yes pick a different one, until we find a free one. Whether a UID is used right now or not is checked through NSS calls. Moreover the IPC object lists are checked to see if there are any objects owned by the UID we are about to pick. This means systemd will avoid using UIDs you have assigned otherwise. Note however that this of course makes the pool of available UIDs smaller, and in the worst cases this means that allocating a dynamic user might fail because there simply are no unused UIDs in the range.
If not specified otherwise the name for a dynamically allocated
user is derived from the service name. Not everything that's valid in
a service name is valid in a user-name however, and in some cases a
randomized name is used instead to deal with this. Often it makes
sense to pick the user names to register explicitly. For that use
User= and choose whatever you like.
If you pick a user name with
User= and combine it with
DynamicUser= and the user already exists statically it will be used
for the service and the dynamic user logic is automatically
disabled. This permits automatic up- and downgrades between static and
dynamic UIDs. For example, it provides a nice way to move a system
from static to dynamic UIDs in a compatible way: as long as you select
User= value before and after switching
the service will continue to use the statically allocated user if it
exists, and only operates in the dynamic mode if it does not. This is
useful for other cases as well, for example to adapt a service that
normally would use a dynamic user to concepts that require statically
assigned UIDs, for example to marry classic UID-based file system
quota with such services.
systemd always allocates a pair of dynamic UID and GID at the same time, with the same numeric ID.
If the Linux kernel had a "shiftfs" or similar functionality,
i.e. a way to mount an existing directory to a second place, but map
the exposed UIDs/GIDs in some way configurable at mount time, this
would be excellent for the implementation of
DynamicUser=. It would make the recursive
chown()ing step unnecessary, as the host version of the state
directory could simply be mounted into a the service's mount
name-space, with a shift applied that maps the directory's owner to the
services' UID/GID. But I don't have high hopes in this regard, as all
work being done in this area appears to be bound to user name-spacing
— which is a concept not used here (and I guess one could say user
name-spacing is probably more a source of problems than a solution to
one, but you are welcome to disagree on that).
And that's all for now. Enjoy your dynamic users!
I am happy to announce that we have published the All Systems Go! 2017 schedule! We are very happy with the large number and the quality of the submissions we got, and the resulting schedule is exceptionally strong.
Without further ado:
Here are a couple of keywords from the topics of the talks: 1password, azure, bluetooth, build systems, casync, cgroups, cilium, cockpit, containers, ebpf, flatpak, habitat, IoT, kubernetes, landlock, meson, OCI, rkt, rust, secureboot, skydive, systemd, testing, tor, varlink, virtualization, wifi, and more.
Our speakers are from all across the industry: Chef CoreOS, Covalent, Facebook, Google, Intel, Kinvolk, Microsoft, Mozilla, Pantheon, Pengutronix, Red Hat, SUSE and more.
For further information about All Systems Go! visit our conference web site.
Make sure to buy your ticket for All Systems Go! 2017 now! A limited number of tickets are left at this point, so make sure you get yours before we are all sold out! Find all details here.
See you in Berlin!
Please make sure to get your presentation proprosals forAll Systems Go! 2017 in now! The CfP closes on sunday!
In case you haven't heard about All Systems Go! yet, here's a quick reminder what kind of conference it is, and why you should attend and speak there:
All Systems Go! is an Open Source community conference focused on the projects and technologies at the foundation of modern Linux systems — specifically low-level user-space technologies. Its goal is to provide a friendly and collaborative gathering place for individuals and communities working to push these technologies forward. All Systems Go! 2017 takes place in Berlin, Germany on October 21st+22nd. All Systems Go! is a 2-day event with 2-3 talks happening in parallel. Full presentation slots are 30-45 minutes in length and lightning talk slots are 5-10 minutes.
In particular, we are looking for sessions including, but not limited to, the following topics:
While our focus is definitely more on the user-space side of things, talks about kernel projects are welcome too, as long as they have a clear and direct relevance for user-space.
To submit your proposal now please visit our CFP submission web site.
For further information about All Systems Go! visit our conference web site.
systemd.conf will not take place this year in lieu of All Systems Go!. All Systems Go! welcomes all projects that contribute to Linux user space, which, of course, includes systemd. Thus, anything you think was appropriate for submission to systemd.conf is also fitting for All Systems Go!