TL;DR: you may now configure systemd to dynamically allocate a UNIX user ID for service processes when it starts them and release it when it stops them. It's pretty secure, mixes well with transient services, socket activated services and service templating.
Today we released systemd 235. Among other improvements this greatly extends the dynamic user logic of systemd. Dynamic users are a powerful but little known concept, supported in its basic form since systemd 232. With this blog story I hope to make it a bit better known.
The UNIX user concept is the most basic and well-understood security concept in POSIX operating systems. It is UNIX/POSIX' primary security concept, the one everybody can agree on, and most security concepts that came after it (such as process capabilities, SELinux and other MACs, user name-spaces, …) in some form or another build on it, extend it or at least interface with it. If you build a Linux kernel with all security features turned off, the user concept is pretty much the one you'll still retain.
Originally, the user concept was introduced to make multi-user systems
a reality, i.e. systems enabling multiple human users to share the
same system at the same time, cleanly separating their resources and
protecting them from each other. The majority of today's UNIX systems
don't really use the user concept like that anymore though. Most of
today's systems probably have only one actual human user (or even
less!), but their user databases (
/etc/passwd) list a good number
more entries than that. Today, the majority of UNIX users in most
environments are system users, i.e. users that are not the technical
representation of a human sitting in front of a PC anymore, but the
security identity a system service — an executable program — runs
as. Even though traditional, simultaneous multi-user systems slowly
became less relevant, their ground-breaking basic concept became the
cornerstone of UNIX security. The OS is nowadays partitioned into
isolated services — and each service runs as its own system user, and
thus within its own, minimal security context.
The people behind the Android OS realized the relevance of the UNIX user concept as the primary security concept on UNIX, and took its use even further: on Android not only system services take benefit of the UNIX user concept, but each UI app gets its own, individual user identity too — thus neatly separating app resources from each other, and protecting app processes from each other, too.
Back in the more traditional Linux world things are a bit less advanced in this area. Even though users are the quintessential UNIX security concept, allocation and management of system users is still a pretty limited, raw and static affair. In most cases, RPM or DEB package installation scripts allocate a fixed number of (usually one) system users when you install the package of a service that wants to take benefit of the user concept, and from that point on the system user remains allocated on the system and is never deallocated again, even if the package is later removed again. Most Linux distributions limit the number of system users to 1000 (which isn't particularly a lot). Allocating a system user is hence expensive: the number of available users is limited, and there's no defined way to dispose of them after use. If you make use of system users too liberally, you are very likely to run out of them sooner rather than later.
You may wonder why system users are generally not deallocated when the package that registered them is uninstalled from a system (at least on most distributions). The reason for that is one relevant property of the user concept (you might even want to call this a design flaw): user IDs are sticky to files (and other objects such as IPC objects). If a service running as a specific system user creates a file at some location, and is then terminated and its package and user removed, then the created file still belongs to the numeric ID ("UID") the system user originally got assigned. When the next system user is allocated and — due to ID recycling — happens to get assigned the same numeric ID, then it will also gain access to the file, and that's generally considered a problem, given that the file belonged to a potentially very different service once upon a time, and likely should not be readable or changeable by anything coming after it. Distributions hence tend to avoid UID recycling which means system users remain registered forever on a system after they have been allocated once.
The above is a description of the status quo ante. Let's now focus on what systemd's dynamic user concept brings to the table, to improve the situation.
Introducing Dynamic Users
With systemd dynamic users we hope to make make it easier and cheaper to allocate system users on-the-fly, thus substantially increasing the possible uses of this core UNIX security concept.
If you write a systemd service unit file, you may enable the dynamic
user logic for it by setting the
option in its
[Service] section to
yes. If you do a system user is
dynamically allocated the instant the service binary is invoked, and
released again when the service terminates. The user is automatically
allocated from the UID range 61184–65519, by looking for a so far
Now you may wonder, how does this concept deal with the sticky user issue discussed above? In order to counter the problem, two strategies easily come to mind:
Prohibit the service from creating any files/directories or IPC objects
Automatically removing the files/directories or IPC objects the service created when it shuts down.
In systemd we implemented both strategies, but for different parts of the execution environment. Specifically:
ProtectHome=read-only. These sand-boxing options turn off write access to pretty much the whole OS directory tree, with a few relevant exceptions, such as the API file systems
/sysand so on, as well as
/var/tmp. (BTW: setting these two options on your regular services that do not use
DynamicUser=is a good idea too, as it drastically reduces the exposure of the system to exploited services.)
PrivateTmp=yes. This option sets up
/var/tmpfor the service in a way that it gets its own, disconnected version of these directories, that are not shared by other services, and whose life-cycle is bound to the service's own life-cycle. Thus if the service goes down, the user is removed and all its temporary files and directories with it. (BTW: as above, consider setting this option for your regular services that do not use
DynamicUser=too, it's a great way to lock things down security-wise.)
RemoveIPC=yes. This option ensures that when the service goes down all SysV and POSIX IPC objects (shared memory, message queues, semaphores) owned by the service's user are removed. Thus, the life-cycle of the IPC objects is bound to the life-cycle of the dynamic user and service, too. (BTW: yes, here too, consider using this in your regular services, too!)
With these four settings in effect, services with dynamic users are
nicely sand-boxed. They cannot create files or directories, except in
/var/tmp, where they will be removed automatically when
the service shuts down, as will any IPC objects created. Sticky
ownership of files/directories and IPC objects is hence dealt with
option may be used to open up a bit the sandbox to external
programs. If you set it to a directory name of your choice, it will be
/run when the service is started, and removed in its
entirety when it is terminated. The ownership of the directory is
assigned to the service's dynamic user. This way, a dynamic user
service can expose API interfaces (AF_UNIX sockets, …) to other
services at a well-defined place and again bind the life-cycle of it to
the service's own run-time. Example: set
your service, and watch how a directory
/run/foobar appears at the
moment you start the service, and disappears the moment you stop
it again. (BTW: Much like the other settings discussed above,
RuntimeDirectory= may be used outside of the
too, and is a nice way to run any service with a properly owned,
life-cycle-managed run-time directory.)
Of course, a service running in such an environment (although already very useful for many cases!), has a major limitation: it cannot leave persistent data around it can reuse on a later run. As pretty much the whole OS directory tree is read-only to it, there's simply no place it could put the data that survives from one service invocation to the next.
With systemd 235 this limitation is removed: there are now three new
CacheDirectory=. In many ways they operate like
RuntimeDirectory=, but create sub-directories below
/var/cache, respectively. There's one major
difference beyond that however: directories created that way are
persistent, they will survive the run-time cycle of a service, and
thus may be used to store data that is supposed to stay around between
invocations of the service.
Of course, the obvious question to ask now is: how do these three settings deal with the sticky file ownership problem?
For that we lifted a concept from container managers. Container
managers have a very similar problem: each container and the host
typically end up using a very similar set of numeric UIDs, and unless
user name-spacing is deployed this means that host users might be able
to access the data of specific containers that also have a user by the
same numeric UID assigned, even though it actually refers to a very
different identity in a different context. (Actually, it's even worse
than just getting access, due to the existence of
setuid file bits,
access might translate to privilege elevation.) The way container
managers protect the container images from the host (and from each
other to some level) is by placing the container trees below a
boundary directory, with very restrictive access modes and ownership
root:root or so). A host user hence cannot take advantage
of the files/directories of a container user of the same UID inside of
a local container tree, simply because the boundary directory makes it
impossible to even reference files in it. After all on UNIX, in order
to get access to a specific path you need access to every single
component of it.
How is that applied to dynamic user services? Let's say
StateDirectory=foobar is set for a service that has
turned off. The instant the service is started,
created as state directory, owned by the service's user and remains in
existence when the service is stopped. If the same service now is run
DynamicUser= turned on, the implementation is slightly
altered. Instead of a directory
/var/lib/foobar a symbolic link by
the same path is created (owned by root), pointing to
/var/lib/private/foobar (the latter being owned by the service's
dynamic user). The
/var/lib/private directory is created as boundary
directory: it's owned by
root:root, and has a restrictive access
mode of 0700. Both the symlink and the service's state directory will
survive the service's life-cycle, but the state directory will remain,
and continues to be owned by the now disposed dynamic UID — however it
is protected from other host users (and other services which might get
the same dynamic UID assigned due to UID recycling) by the boundary
The obvious question to ask now is: but if the boundary directory
prohibits access to the directory from unprivileged processes, how can
the service itself which runs under its own dynamic UID access it
anyway? This is achieved by invoking the service process in a slightly
modified mount name-space: it will see most of the file hierarchy the
same way as everything else on the system (modulo
/var/tmp as mentioned above), except for
is over-mounted with a read-only
tmpfs file system instance, with a
slightly more liberal access mode permitting the service read
access. Inside of this
tmpfs file system instance another mount is
placed: a bind mount to the host's real
directory, onto the same name. Putting this together these means that
superficially everything looks the same and is available at the same
place on the host and from inside the service, but two important
changes have been made: the
/var/lib/private boundary directory lost
its restrictive character inside the service, and has been emptied of
the state directories of any other service, thus making the protection
complete. Note that the symlink
/var/lib/foobar hides the fact that
the boundary directory is used (making it little more than an
implementation detail), as the directory is available this way under
the same name as it would be if
DynamicUser= was not used. Long
story short: for the daemon and from the view from the host the
/var/lib/private is mostly transparent.
This logic of course raises another question: what happens to the state directory if a dynamic user service is started with a state directory configured, gets UID X assigned on this first invocation, then terminates and is restarted and now gets UID Y assigned on the second invocation, with X ≠ Y? On the second invocation the directory — and all the files and directories below it — will still be owned by the original UID X so how could the second instance running as Y access it? Our way out is simple: systemd will recursively change the ownership of the directory and everything contained within it to UID Y before invoking the service's executable.
Of course, such recursive ownership changing (
chown()ing) of whole
directory trees can become expensive (though according to my
experiences, IRL and for most services it's much cheaper than you
might think), hence in order to optimize behavior in this regard, the
allocation of dynamic UIDs has been tweaked in two ways to avoid the
necessity to do this expensive operation in most cases: firstly, when
a dynamic UID is allocated for a service an allocation loop is
employed that starts out with a UID hashed from the service's
name. This means a service by the same name is likely to always use
the same numeric UID. That means that a stable service name translates
into a stable dynamic UID, and that means recursive file ownership
adjustments can be skipped (of course, after validation). Secondly, if
the configured state directory already exists, and is owned by a
suitable currently unused dynamic UID, it's preferably used above
everything else, thus maximizing the chance we can avoid the
chown()ing. (That all said, ultimately we have to face it, the
currently available UID space of 4K+ is very small still, and
conflicts are pretty likely sooner or later, thus a chown()ing has to
be expected every now and then when this feature is used extensively).
LogsDirectory= work very similar to
StateDirectory=. The only difference is that they manage directories
/var/logs directories, and their boundary
directory hence is
So, after all this introduction, let's have a look how this all can be put together. Here's a trivial example:
# cat > /etc/systemd/system/dynamic-user-test.service <<EOF [Service] ExecStart=/usr/bin/sleep 4711 DynamicUser=yes EOF # systemctl daemon-reload # systemctl start dynamic-user-test # systemctl status dynamic-user-test ● dynamic-user-test.service Loaded: loaded (/etc/systemd/system/dynamic-user-test.service; static; vendor preset: disabled) Active: active (running) since Fri 2017-10-06 13:12:25 CEST; 3s ago Main PID: 2967 (sleep) Tasks: 1 (limit: 4915) CGroup: /system.slice/dynamic-user-test.service └─2967 /usr/bin/sleep 4711 Okt 06 13:12:25 sigma systemd: Started dynamic-user-test.service. # ps -e -o pid,comm,user | grep 2967 2967 sleep dynamic-user-test # id dynamic-user-test uid=64642(dynamic-user-test) gid=64642(dynamic-user-test) groups=64642(dynamic-user-test) # systemctl stop dynamic-user-test # id dynamic-user-test id: ‘dynamic-user-test’: no such user
In this example, we create a unit file with
DynamicUser= turned on,
start it, check if it's running correctly, have a look at the service
process' user (which is named like the service; systemd does this
automatically if the service name is suitable as user name, and you
didn't configure any user name to use explicitly), stop the service
and verify that the user ceased to exist too.
That's already pretty cool. Let's step it up a notch, by doing the
same in an interactive transient service (for those who don't know
systemd well: a transient service is a service that is defined and
started dynamically at run-time, for example via the
command from the shell. Think: run a service without having to write a
unit file first):
# systemd-run --pty --property=DynamicUser=yes --property=StateDirectory=wuff /bin/sh Running as unit: run-u15750.service Press ^] three times within 1s to disconnect TTY. sh-4.4$ id uid=63122(run-u15750) gid=63122(run-u15750) groups=63122(run-u15750) context=system_u:system_r:initrc_t:s0 sh-4.4$ ls -al /var/lib/private/ total 0 drwxr-xr-x. 3 root root 60 6. Okt 13:21 . drwxr-xr-x. 1 root root 852 6. Okt 13:21 .. drwxr-xr-x. 1 run-u15750 run-u15750 8 6. Okt 13:22 wuff sh-4.4$ ls -ld /var/lib/wuff lrwxrwxrwx. 1 root root 12 6. Okt 13:21 /var/lib/wuff -> private/wuff sh-4.4$ ls -ld /var/lib/wuff/ drwxr-xr-x. 1 run-u15750 run-u15750 0 6. Okt 13:21 /var/lib/wuff/ sh-4.4$ echo hello > /var/lib/wuff/test sh-4.4$ exit exit # id run-u15750 id: ‘run-u15750’: no such user # ls -al /var/lib/private total 0 drwx------. 1 root root 66 6. Okt 13:21 . drwxr-xr-x. 1 root root 852 6. Okt 13:21 .. drwxr-xr-x. 1 63122 63122 8 6. Okt 13:22 wuff # ls -ld /var/lib/wuff lrwxrwxrwx. 1 root root 12 6. Okt 13:21 /var/lib/wuff -> private/wuff # ls -ld /var/lib/wuff/ drwxr-xr-x. 1 63122 63122 8 6. Okt 13:22 /var/lib/wuff/ # cat /var/lib/wuff/test hello
The above invokes an interactive shell as transient service
systemd-run picked that name automatically,
since we didn't specify anything explicitly) with a dynamic user whose
name is derived automatically from the service name. Because
StateDirectory=wuff is used, a persistent state directory for the
service is made available as
/var/lib/wuff. In the interactive shell
running inside the service, the
ls commands show the
/var/lib/private boundary directory and its contents, as well as the
symlink that is placed for the service. Finally, before exiting the
shell, a file is created in the state directory. Back in the original
command shell we check if the user is still allocated: it is not, of
course, since the service ceased to exist when we exited the shell and
with it the dynamic user associated with it. From the host we check
the state directory of the service, with similar commands as we did
from inside of it. We see that things are set up pretty much the same
way in both cases, except for two things: first of all the user/group
of the files is now shown as raw numeric UIDs instead of the
user/group names derived from the unit name. That's because the user
ceased to exist at this point, and "ls" shows the raw UID for files
owned by users that don't exist. Secondly, the access mode of the
boundary directory is different: when we look at it from outside of
the service it is not readable by anyone but root, when we looked from
inside we saw it it being world readable.
Now, let's see how things look if we start another transient service, reusing the state directory from the first invocation:
# systemd-run --pty --property=DynamicUser=yes --property=StateDirectory=wuff /bin/sh Running as unit: run-u16087.service Press ^] three times within 1s to disconnect TTY. sh-4.4$ cat /var/lib/wuff/test hello sh-4.4$ ls -al /var/lib/wuff/ total 4 drwxr-xr-x. 1 run-u16087 run-u16087 8 6. Okt 13:22 . drwxr-xr-x. 3 root root 60 6. Okt 15:42 .. -rw-r--r--. 1 run-u16087 run-u16087 6 6. Okt 13:22 test sh-4.4$ id uid=63122(run-u16087) gid=63122(run-u16087) groups=63122(run-u16087) context=system_u:system_r:initrc_t:s0 sh-4.4$ exit exit
systemd-run picked a different auto-generated unit name, but
the used dynamic UID is still the same, as it was read from the
pre-existing state directory, and was otherwise unused. As we can see
the test file we generated earlier is accessible and still contains
the data we left in there. Do note that the user name is different
this time (as it is derived from the unit name, which is different),
but the UID it is assigned to is the same one as on the first
invocation. We can thus see that the mentioned optimization of the UID
allocation logic (i.e. that we start the allocation loop from the UID
owner of any existing state directory) took effect, so that no
chown()ing was required.
And that's the end of our example, which hopefully illustrated a bit how this concept and implementation works.
Now that we had a look at how to enable this logic for a unit and how it is implemented, let's discuss where this actually could be useful in real life.
One major benefit of dynamic user IDs is that running a privilege-separated service leaves no artifacts in the system. A system user is allocated and made use of, but it is discarded automatically in a safe and secure way after use, in a fashion that is safe for later recycling. Thus, quickly invoking a short-lived service for processing some job can be protected properly through a user ID without having to pre-allocate it and without this draining the available UID pool any longer than necessary.
In many cases, starting a service no longer requires package-specific preparation. Or in other words, quite often
chmodinvocations in "
post-inst" package scripts, as well as
tmpfiles.ddrop-ins become unnecessary, as the
LogsDirectory=logic can do the necessary work automatically, on-demand and with a well-defined life-cycle.
By combining dynamic user IDs with the transient unit concept, new creative ways of sand-boxing are made available. For example, let's say you don't trust the correct implementation of the
sortcommand. You can now lock it into a simple, robust, dynamic UID sandbox with a simple
systemd-runand still integrate it into a shell pipeline like any other command. Here's an example, showcasing a shell pipeline whose middle element runs as a dynamically on-the-fly allocated UID, that is released when the pipelines ends.
# cat some-file.txt | systemd-run ---pipe --property=DynamicUser=1 sort -u | grep -i foobar > some-other-file.txt
By combining dynamic user IDs with the systemd templating logic it is now possible to do much more fine-grained and fully automatic UID management. For example, let's say you have a template unit file
[Service] ExecStart=/usr/bin/myfoobarserviced DynamicUser=1 StateDirectory=foobar/%i
Now, let's say you want to start one instance of this service for each of your customers. All you need to do now for that is:
# systemctl enable email@example.com --now
And you are done. (Invoke this as many times as you like, each time replacing
customerxyzby some customer identifier, you get the idea.)
By combining dynamic user IDs with socket activation you may easily implement a system where each incoming connection is served by a process instance running as a different, fresh, newly allocated UID within its own sandbox. Here's an example
[Socket] ListenStream=2048 Accept=yes
With a matching
[Service] ExecStart=-/usr/bin/myservicebinary DynamicUser=yes
With the two unit files above, systemd will listen on TCP/IP port 2048, and for each incoming connection invoke a fresh instance of
waldo@.service, each time utilizing a different, new, dynamically allocated UID, neatly isolated from any other instance.
Dynamic user IDs combine very well with state-less systems, i.e. systems that come up with an unpopulated
/var. A service using dynamic user IDs and the
RuntimeDirectory=concepts will implicitly allocate the users and directories it needs for running, right at the moment where it needs it.
Dynamic users are a very generic concept, hence a multitude of other uses are thinkable; the list above is just supposed to trigger your imagination.
What does this mean for you as a packager?
I am pretty sure that a large number of services shipped with today's
distributions could benefit from using
StateDirectory= (and related settings). It often allows removal of
post-inst packaging scripts altogether, as well as any
tmpfiles.d drop-ins by unifying the needed declarations in the
unit file itself. Hence, as a packager please consider switching your
unit files over. That said, there are a number of conditions where
StateDirectory= (and friends) cannot or should
not be used. To name a few:
Service that need to write to files outside of
/dev/shmare generally incompatible with this scheme. This rules out daemons that upgrade the system as one example, as that involves writing to
Services that maintain a herd of processes with different user IDs. Some SMTP services are like this. If your service has such a super-server design, UID management needs to be done by the super-server itself, which rules out systemd doing its dynamic UID magic for it.
Services which run as root (obviously…) or are otherwise privileged.
Services that need to live in the same mount name-space as the host system (for example, because they want to establish mount points visible system-wide). As mentioned
PrivateTmp=and related options, which all require the service to run in its own mount name-space.
Your focus is older distributions, i.e. distributions that do not have systemd 232 (for
DynamicUser=) or systemd 235 (for
StateDirectory=and friends) yet.
If your distribution's packaging guides don't allow it. Consult your packaging guides, and possibly start a discussion on your distribution's mailing list about this.
A couple of additional, random notes about the implementation and use of these features:
Do note that allocating or deallocating a dynamic user leaves
/etc/passwduntouched. A dynamic user is added into the user database through the glibc NSS module
nss-systemd, and this information never hits the disk.
On traditional UNIX systems it was the job of the daemon process itself to drop privileges, while the
DynamicUser=concept is designed around the service manager (i.e. systemd) being responsible for that. That said, since v235 there's a way to marry
DynamicUser=and such services which want to drop privileges on their own. For that, turn on
User=to the user name the service wants to
setuid()to. This has the effect that systemd will allocate the dynamic user under the specified name when the service is started. Then, prefix the command line you specify in
ExecStart=with a single
!character. If you do, the user is allocated for the service, but the daemon binary is invoked as
rootinstead of the allocated user, under the assumption that the daemon changes its UID on its own the right way. Note that after registration the user will show up instantly in the user database, and is hence resolvable like any other by the daemon process. Example:
You may wonder why systemd uses the UID range 61184–65519 for its dynamic user allocations (side note: in hexadecimal this reads as 0xEF00–0xFFEF). That's because distributions (specifically Fedora) tend to allocate regular users from below the 60000 range, and we don't want to step into that. We also want to stay away from 65535 and a bit around it, as some of these UIDs have special meanings (65535 is often used as special value for "invalid" or "no" UID, as it is identical to the 16bit value -1; 65534 is generally mapped to the "nobody" user, and is where some kernel subsystems map unmappable UIDs). Finally, we want to stay within the 16bit range. In a user name-spacing world each container tends to have much less than the full 32bit UID range available that Linux kernels theoretically provide. Everybody apparently can agree that a container should at least cover the 16bit range though — already to include a
nobodyuser. (And quite frankly, I am pretty sure assigning 64K UIDs per container is nicely systematic, as the the higher 16bit of the 32bit UID values this way become a container ID, while the lower 16bit become the logical UID within each container, if you still follow what I am babbling here…). And before you ask: no this range cannot be changed right now, it's compiled in. We might change that eventually however.
You might wonder what happens if you already used UIDs from the 61184–65519 range on your system for other purposes. systemd should handle that mostly fine, as long as that usage is properly registered in the user database: when allocating a dynamic user we pick a UID, see if it is currently used somehow, and if yes pick a different one, until we find a free one. Whether a UID is used right now or not is checked through NSS calls. Moreover the IPC object lists are checked to see if there are any objects owned by the UID we are about to pick. This means systemd will avoid using UIDs you have assigned otherwise. Note however that this of course makes the pool of available UIDs smaller, and in the worst cases this means that allocating a dynamic user might fail because there simply are no unused UIDs in the range.
If not specified otherwise the name for a dynamically allocated user is derived from the service name. Not everything that's valid in a service name is valid in a user-name however, and in some cases a randomized name is used instead to deal with this. Often it makes sense to pick the user names to register explicitly. For that use
User=and choose whatever you like.
If you pick a user name with
User=and combine it with
DynamicUser=and the user already exists statically it will be used for the service and the dynamic user logic is automatically disabled. This permits automatic up- and downgrades between static and dynamic UIDs. For example, it provides a nice way to move a system from static to dynamic UIDs in a compatible way: as long as you select the same
User=value before and after switching
DynamicUser=on, the service will continue to use the statically allocated user if it exists, and only operates in the dynamic mode if it does not. This is useful for other cases as well, for example to adapt a service that normally would use a dynamic user to concepts that require statically assigned UIDs, for example to marry classic UID-based file system quota with such services.
systemd always allocates a pair of dynamic UID and GID at the same time, with the same numeric ID.
If the Linux kernel had a "shiftfs" or similar functionality, i.e. a way to mount an existing directory to a second place, but map the exposed UIDs/GIDs in some way configurable at mount time, this would be excellent for the implementation of
StateDirectory=in conjunction with
DynamicUser=. It would make the recursive
chown()ing step unnecessary, as the host version of the state directory could simply be mounted into a the service's mount name-space, with a shift applied that maps the directory's owner to the services' UID/GID. But I don't have high hopes in this regard, as all work being done in this area appears to be bound to user name-spacing — which is a concept not used here (and I guess one could say user name-spacing is probably more a source of problems than a solution to one, but you are welcome to disagree on that).
And that's all for now. Enjoy your dynamic users!