Index | Archives | Atom Feed | RSS Feed

Dynamic Users with systemd

TL;DR: you may now configure systemd to dynamically allocate a UNIX user ID for service processes when it starts them and release it when it stops them. It's pretty secure, mixes well with transient services, socket activated services and service templating.

Today we released systemd 235. Among other improvements this greatly extends the dynamic user logic of systemd. Dynamic users are a powerful but little known concept, supported in its basic form since systemd 232. With this blog story I hope to make it a bit better known.

The UNIX user concept is the most basic and well-understood security concept in POSIX operating systems. It is UNIX/POSIX' primary security concept, the one everybody can agree on, and most security concepts that came after it (such as process capabilities, SELinux and other MACs, user name-spaces, …) in some form or another build on it, extend it or at least interface with it. If you build a Linux kernel with all security features turned off, the user concept is pretty much the one you'll still retain.

Originally, the user concept was introduced to make multi-user systems a reality, i.e. systems enabling multiple human users to share the same system at the same time, cleanly separating their resources and protecting them from each other. The majority of today's UNIX systems don't really use the user concept like that anymore though. Most of today's systems probably have only one actual human user (or even less!), but their user databases (/etc/passwd) list a good number more entries than that. Today, the majority of UNIX users in most environments are system users, i.e. users that are not the technical representation of a human sitting in front of a PC anymore, but the security identity a system service — an executable program — runs as. Even though traditional, simultaneous multi-user systems slowly became less relevant, their ground-breaking basic concept became the cornerstone of UNIX security. The OS is nowadays partitioned into isolated services — and each service runs as its own system user, and thus within its own, minimal security context.

The people behind the Android OS realized the relevance of the UNIX user concept as the primary security concept on UNIX, and took its use even further: on Android not only system services take benefit of the UNIX user concept, but each UI app gets its own, individual user identity too — thus neatly separating app resources from each other, and protecting app processes from each other, too.

Back in the more traditional Linux world things are a bit less advanced in this area. Even though users are the quintessential UNIX security concept, allocation and management of system users is still a pretty limited, raw and static affair. In most cases, RPM or DEB package installation scripts allocate a fixed number of (usually one) system users when you install the package of a service that wants to take benefit of the user concept, and from that point on the system user remains allocated on the system and is never deallocated again, even if the package is later removed again. Most Linux distributions limit the number of system users to 1000 (which isn't particularly a lot). Allocating a system user is hence expensive: the number of available users is limited, and there's no defined way to dispose of them after use. If you make use of system users too liberally, you are very likely to run out of them sooner rather than later.

You may wonder why system users are generally not deallocated when the package that registered them is uninstalled from a system (at least on most distributions). The reason for that is one relevant property of the user concept (you might even want to call this a design flaw): user IDs are sticky to files (and other objects such as IPC objects). If a service running as a specific system user creates a file at some location, and is then terminated and its package and user removed, then the created file still belongs to the numeric ID ("UID") the system user originally got assigned. When the next system user is allocated and — due to ID recycling — happens to get assigned the same numeric ID, then it will also gain access to the file, and that's generally considered a problem, given that the file belonged to a potentially very different service once upon a time, and likely should not be readable or changeable by anything coming after it. Distributions hence tend to avoid UID recycling which means system users remain registered forever on a system after they have been allocated once.

The above is a description of the status quo ante. Let's now focus on what systemd's dynamic user concept brings to the table, to improve the situation.

Introducing Dynamic Users

With systemd dynamic users we hope to make make it easier and cheaper to allocate system users on-the-fly, thus substantially increasing the possible uses of this core UNIX security concept.

If you write a systemd service unit file, you may enable the dynamic user logic for it by setting the DynamicUser= option in its [Service] section to yes. If you do a system user is dynamically allocated the instant the service binary is invoked, and released again when the service terminates. The user is automatically allocated from the UID range 61184–65519, by looking for a so far unused UID.

Now you may wonder, how does this concept deal with the sticky user issue discussed above? In order to counter the problem, two strategies easily come to mind:

  1. Prohibit the service from creating any files/directories or IPC objects

  2. Automatically removing the files/directories or IPC objects the service created when it shuts down.

In systemd we implemented both strategies, but for different parts of the execution environment. Specifically:

  1. Setting DynamicUser=yes implies ProtectSystem=strict and ProtectHome=read-only. These sand-boxing options turn off write access to pretty much the whole OS directory tree, with a few relevant exceptions, such as the API file systems /proc, /sys and so on, as well as /tmp and /var/tmp. (BTW: setting these two options on your regular services that do not use DynamicUser= is a good idea too, as it drastically reduces the exposure of the system to exploited services.)

  2. Setting DynamicUser=yes implies PrivateTmp=yes. This option sets up /tmp and /var/tmp for the service in a way that it gets its own, disconnected version of these directories, that are not shared by other services, and whose life-cycle is bound to the service's own life-cycle. Thus if the service goes down, the user is removed and all its temporary files and directories with it. (BTW: as above, consider setting this option for your regular services that do not use DynamicUser= too, it's a great way to lock things down security-wise.)

  3. Setting DynamicUser=yes implies RemoveIPC=yes. This option ensures that when the service goes down all SysV and POSIX IPC objects (shared memory, message queues, semaphores) owned by the service's user are removed. Thus, the life-cycle of the IPC objects is bound to the life-cycle of the dynamic user and service, too. (BTW: yes, here too, consider using this in your regular services, too!)

With these four settings in effect, services with dynamic users are nicely sand-boxed. They cannot create files or directories, except in /tmp and /var/tmp, where they will be removed automatically when the service shuts down, as will any IPC objects created. Sticky ownership of files/directories and IPC objects is hence dealt with effectively.

The RuntimeDirectory= option may be used to open up a bit the sandbox to external programs. If you set it to a directory name of your choice, it will be created below /run when the service is started, and removed in its entirety when it is terminated. The ownership of the directory is assigned to the service's dynamic user. This way, a dynamic user service can expose API interfaces (AF_UNIX sockets, …) to other services at a well-defined place and again bind the life-cycle of it to the service's own run-time. Example: set RuntimeDirectory=foobar in your service, and watch how a directory /run/foobar appears at the moment you start the service, and disappears the moment you stop it again. (BTW: Much like the other settings discussed above, RuntimeDirectory= may be used outside of the DynamicUser= context too, and is a nice way to run any service with a properly owned, life-cycle-managed run-time directory.)

Persistent Data

Of course, a service running in such an environment (although already very useful for many cases!), has a major limitation: it cannot leave persistent data around it can reuse on a later run. As pretty much the whole OS directory tree is read-only to it, there's simply no place it could put the data that survives from one service invocation to the next.

With systemd 235 this limitation is removed: there are now three new settings: StateDirectory=, LogsDirectory= and CacheDirectory=. In many ways they operate like RuntimeDirectory=, but create sub-directories below /var/lib, /var/log and /var/cache, respectively. There's one major difference beyond that however: directories created that way are persistent, they will survive the run-time cycle of a service, and thus may be used to store data that is supposed to stay around between invocations of the service.

Of course, the obvious question to ask now is: how do these three settings deal with the sticky file ownership problem?

For that we lifted a concept from container managers. Container managers have a very similar problem: each container and the host typically end up using a very similar set of numeric UIDs, and unless user name-spacing is deployed this means that host users might be able to access the data of specific containers that also have a user by the same numeric UID assigned, even though it actually refers to a very different identity in a different context. (Actually, it's even worse than just getting access, due to the existence of setuid file bits, access might translate to privilege elevation.) The way container managers protect the container images from the host (and from each other to some level) is by placing the container trees below a boundary directory, with very restrictive access modes and ownership (0700 and root:root or so). A host user hence cannot take advantage of the files/directories of a container user of the same UID inside of a local container tree, simply because the boundary directory makes it impossible to even reference files in it. After all on UNIX, in order to get access to a specific path you need access to every single component of it.

How is that applied to dynamic user services? Let's say StateDirectory=foobar is set for a service that has DynamicUser= turned off. The instant the service is started, /var/lib/foobar is created as state directory, owned by the service's user and remains in existence when the service is stopped. If the same service now is run with DynamicUser= turned on, the implementation is slightly altered. Instead of a directory /var/lib/foobar a symbolic link by the same path is created (owned by root), pointing to /var/lib/private/foobar (the latter being owned by the service's dynamic user). The /var/lib/private directory is created as boundary directory: it's owned by root:root, and has a restrictive access mode of 0700. Both the symlink and the service's state directory will survive the service's life-cycle, but the state directory will remain, and continues to be owned by the now disposed dynamic UID — however it is protected from other host users (and other services which might get the same dynamic UID assigned due to UID recycling) by the boundary directory.

The obvious question to ask now is: but if the boundary directory prohibits access to the directory from unprivileged processes, how can the service itself which runs under its own dynamic UID access it anyway? This is achieved by invoking the service process in a slightly modified mount name-space: it will see most of the file hierarchy the same way as everything else on the system (modulo /tmp and /var/tmp as mentioned above), except for /var/lib/private, which is over-mounted with a read-only tmpfs file system instance, with a slightly more liberal access mode permitting the service read access. Inside of this tmpfs file system instance another mount is placed: a bind mount to the host's real /var/lib/private/foobar directory, onto the same name. Putting this together these means that superficially everything looks the same and is available at the same place on the host and from inside the service, but two important changes have been made: the /var/lib/private boundary directory lost its restrictive character inside the service, and has been emptied of the state directories of any other service, thus making the protection complete. Note that the symlink /var/lib/foobar hides the fact that the boundary directory is used (making it little more than an implementation detail), as the directory is available this way under the same name as it would be if DynamicUser= was not used. Long story short: for the daemon and from the view from the host the indirection through /var/lib/private is mostly transparent.

This logic of course raises another question: what happens to the state directory if a dynamic user service is started with a state directory configured, gets UID X assigned on this first invocation, then terminates and is restarted and now gets UID Y assigned on the second invocation, with X ≠ Y? On the second invocation the directory — and all the files and directories below it — will still be owned by the original UID X so how could the second instance running as Y access it? Our way out is simple: systemd will recursively change the ownership of the directory and everything contained within it to UID Y before invoking the service's executable.

Of course, such recursive ownership changing (chown()ing) of whole directory trees can become expensive (though according to my experiences, IRL and for most services it's much cheaper than you might think), hence in order to optimize behavior in this regard, the allocation of dynamic UIDs has been tweaked in two ways to avoid the necessity to do this expensive operation in most cases: firstly, when a dynamic UID is allocated for a service an allocation loop is employed that starts out with a UID hashed from the service's name. This means a service by the same name is likely to always use the same numeric UID. That means that a stable service name translates into a stable dynamic UID, and that means recursive file ownership adjustments can be skipped (of course, after validation). Secondly, if the configured state directory already exists, and is owned by a suitable currently unused dynamic UID, it's preferably used above everything else, thus maximizing the chance we can avoid the chown()ing. (That all said, ultimately we have to face it, the currently available UID space of 4K+ is very small still, and conflicts are pretty likely sooner or later, thus a chown()ing has to be expected every now and then when this feature is used extensively).

Note that CacheDirectory= and LogsDirectory= work very similar to StateDirectory=. The only difference is that they manage directories below the /var/cache and /var/logs directories, and their boundary directory hence is /var/cache/private and /var/log/private, respectively.

Examples

So, after all this introduction, let's have a look how this all can be put together. Here's a trivial example:

# cat > /etc/systemd/system/dynamic-user-test.service <<EOF
[Service]
ExecStart=/usr/bin/sleep 4711
DynamicUser=yes
EOF
# systemctl daemon-reload
# systemctl start dynamic-user-test
# systemctl status dynamic-user-test
● dynamic-user-test.service
   Loaded: loaded (/etc/systemd/system/dynamic-user-test.service; static; vendor preset: disabled)
   Active: active (running) since Fri 2017-10-06 13:12:25 CEST; 3s ago
 Main PID: 2967 (sleep)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/dynamic-user-test.service
           └─2967 /usr/bin/sleep 4711

Okt 06 13:12:25 sigma systemd[1]: Started dynamic-user-test.service.
# ps -e -o pid,comm,user | grep 2967
 2967 sleep           dynamic-user-test
# id dynamic-user-test
uid=64642(dynamic-user-test) gid=64642(dynamic-user-test) groups=64642(dynamic-user-test)
# systemctl stop dynamic-user-test
# id dynamic-user-test
id: ‘dynamic-user-test’: no such user

In this example, we create a unit file with DynamicUser= turned on, start it, check if it's running correctly, have a look at the service process' user (which is named like the service; systemd does this automatically if the service name is suitable as user name, and you didn't configure any user name to use explicitly), stop the service and verify that the user ceased to exist too.

That's already pretty cool. Let's step it up a notch, by doing the same in an interactive transient service (for those who don't know systemd well: a transient service is a service that is defined and started dynamically at run-time, for example via the systemd-run command from the shell. Think: run a service without having to write a unit file first):

# systemd-run --pty --property=DynamicUser=yes --property=StateDirectory=wuff /bin/sh
Running as unit: run-u15750.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4$ id
uid=63122(run-u15750) gid=63122(run-u15750) groups=63122(run-u15750) context=system_u:system_r:initrc_t:s0
sh-4.4$ ls -al /var/lib/private/
total 0
drwxr-xr-x. 3 root       root        60  6. Okt 13:21 .
drwxr-xr-x. 1 root       root       852  6. Okt 13:21 ..
drwxr-xr-x. 1 run-u15750 run-u15750   8  6. Okt 13:22 wuff
sh-4.4$ ls -ld /var/lib/wuff
lrwxrwxrwx. 1 root root 12  6. Okt 13:21 /var/lib/wuff -> private/wuff
sh-4.4$ ls -ld /var/lib/wuff/
drwxr-xr-x. 1 run-u15750 run-u15750 0  6. Okt 13:21 /var/lib/wuff/
sh-4.4$ echo hello > /var/lib/wuff/test
sh-4.4$ exit
exit
# id run-u15750
id: ‘run-u15750’: no such user
# ls -al /var/lib/private
total 0
drwx------. 1 root  root   66  6. Okt 13:21 .
drwxr-xr-x. 1 root  root  852  6. Okt 13:21 ..
drwxr-xr-x. 1 63122 63122   8  6. Okt 13:22 wuff
# ls -ld /var/lib/wuff
lrwxrwxrwx. 1 root root 12  6. Okt 13:21 /var/lib/wuff -> private/wuff
# ls -ld /var/lib/wuff/
drwxr-xr-x. 1 63122 63122 8  6. Okt 13:22 /var/lib/wuff/
# cat /var/lib/wuff/test
hello

The above invokes an interactive shell as transient service run-u15750.service (systemd-run picked that name automatically, since we didn't specify anything explicitly) with a dynamic user whose name is derived automatically from the service name. Because StateDirectory=wuff is used, a persistent state directory for the service is made available as /var/lib/wuff. In the interactive shell running inside the service, the ls commands show the /var/lib/private boundary directory and its contents, as well as the symlink that is placed for the service. Finally, before exiting the shell, a file is created in the state directory. Back in the original command shell we check if the user is still allocated: it is not, of course, since the service ceased to exist when we exited the shell and with it the dynamic user associated with it. From the host we check the state directory of the service, with similar commands as we did from inside of it. We see that things are set up pretty much the same way in both cases, except for two things: first of all the user/group of the files is now shown as raw numeric UIDs instead of the user/group names derived from the unit name. That's because the user ceased to exist at this point, and "ls" shows the raw UID for files owned by users that don't exist. Secondly, the access mode of the boundary directory is different: when we look at it from outside of the service it is not readable by anyone but root, when we looked from inside we saw it it being world readable.

Now, let's see how things look if we start another transient service, reusing the state directory from the first invocation:

# systemd-run --pty --property=DynamicUser=yes --property=StateDirectory=wuff /bin/sh
Running as unit: run-u16087.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4$ cat /var/lib/wuff/test
hello
sh-4.4$ ls -al /var/lib/wuff/
total 4
drwxr-xr-x. 1 run-u16087 run-u16087  8  6. Okt 13:22 .
drwxr-xr-x. 3 root       root       60  6. Okt 15:42 ..
-rw-r--r--. 1 run-u16087 run-u16087  6  6. Okt 13:22 test
sh-4.4$ id
uid=63122(run-u16087) gid=63122(run-u16087) groups=63122(run-u16087) context=system_u:system_r:initrc_t:s0
sh-4.4$ exit
exit

Here, systemd-run picked a different auto-generated unit name, but the used dynamic UID is still the same, as it was read from the pre-existing state directory, and was otherwise unused. As we can see the test file we generated earlier is accessible and still contains the data we left in there. Do note that the user name is different this time (as it is derived from the unit name, which is different), but the UID it is assigned to is the same one as on the first invocation. We can thus see that the mentioned optimization of the UID allocation logic (i.e. that we start the allocation loop from the UID owner of any existing state directory) took effect, so that no recursive chown()ing was required.

And that's the end of our example, which hopefully illustrated a bit how this concept and implementation works.

Use-cases

Now that we had a look at how to enable this logic for a unit and how it is implemented, let's discuss where this actually could be useful in real life.

  • One major benefit of dynamic user IDs is that running a privilege-separated service leaves no artifacts in the system. A system user is allocated and made use of, but it is discarded automatically in a safe and secure way after use, in a fashion that is safe for later recycling. Thus, quickly invoking a short-lived service for processing some job can be protected properly through a user ID without having to pre-allocate it and without this draining the available UID pool any longer than necessary.

  • In many cases, starting a service no longer requires package-specific preparation. Or in other words, quite often useradd/mkdir/chown/chmod invocations in "post-inst" package scripts, as well as sysusers.d and tmpfiles.d drop-ins become unnecessary, as the DynamicUser= and StateDirectory=/CacheDirectory=/LogsDirectory= logic can do the necessary work automatically, on-demand and with a well-defined life-cycle.

  • By combining dynamic user IDs with the transient unit concept, new creative ways of sand-boxing are made available. For example, let's say you don't trust the correct implementation of the sort command. You can now lock it into a simple, robust, dynamic UID sandbox with a simple systemd-run and still integrate it into a shell pipeline like any other command. Here's an example, showcasing a shell pipeline whose middle element runs as a dynamically on-the-fly allocated UID, that is released when the pipelines ends.

    # cat some-file.txt | systemd-run ---pipe --property=DynamicUser=1 sort -u | grep -i foobar > some-other-file.txt
    
  • By combining dynamic user IDs with the systemd templating logic it is now possible to do much more fine-grained and fully automatic UID management. For example, let's say you have a template unit file /etc/systemd/system/foobard@.service:

    [Service]
    ExecStart=/usr/bin/myfoobarserviced
    DynamicUser=1
    StateDirectory=foobar/%i
    

    Now, let's say you want to start one instance of this service for each of your customers. All you need to do now for that is:

    # systemctl enable foobard@customerxyz.service --now
    

    And you are done. (Invoke this as many times as you like, each time replacing customerxyz by some customer identifier, you get the idea.)

  • By combining dynamic user IDs with socket activation you may easily implement a system where each incoming connection is served by a process instance running as a different, fresh, newly allocated UID within its own sandbox. Here's an example waldo.socket:

    [Socket]
    ListenStream=2048
    Accept=yes
    

    With a matching waldo@.service:

    [Service]
    ExecStart=-/usr/bin/myservicebinary
    DynamicUser=yes
    

    With the two unit files above, systemd will listen on TCP/IP port 2048, and for each incoming connection invoke a fresh instance of waldo@.service, each time utilizing a different, new, dynamically allocated UID, neatly isolated from any other instance.

  • Dynamic user IDs combine very well with state-less systems, i.e. systems that come up with an unpopulated /etc and /var. A service using dynamic user IDs and the StateDirectory=, CacheDirectory=, LogsDirectory= and RuntimeDirectory= concepts will implicitly allocate the users and directories it needs for running, right at the moment where it needs it.

Dynamic users are a very generic concept, hence a multitude of other uses are thinkable; the list above is just supposed to trigger your imagination.

What does this mean for you as a packager?

I am pretty sure that a large number of services shipped with today's distributions could benefit from using DynamicUser= and StateDirectory= (and related settings). It often allows removal of post-inst packaging scripts altogether, as well as any sysusers.d and tmpfiles.d drop-ins by unifying the needed declarations in the unit file itself. Hence, as a packager please consider switching your unit files over. That said, there are a number of conditions where DynamicUser= and StateDirectory= (and friends) cannot or should not be used. To name a few:

  1. Service that need to write to files outside of /run/<package>, /var/lib/<package>, /var/cache/<package>, /var/log/<package>, /var/tmp, /tmp, /dev/shm are generally incompatible with this scheme. This rules out daemons that upgrade the system as one example, as that involves writing to /usr.

  2. Services that maintain a herd of processes with different user IDs. Some SMTP services are like this. If your service has such a super-server design, UID management needs to be done by the super-server itself, which rules out systemd doing its dynamic UID magic for it.

  3. Services which run as root (obviously…) or are otherwise privileged.

  4. Services that need to live in the same mount name-space as the host system (for example, because they want to establish mount points visible system-wide). As mentioned DynamicUser= implies ProtectSystem=, PrivateTmp= and related options, which all require the service to run in its own mount name-space.

  5. Your focus is older distributions, i.e. distributions that do not have systemd 232 (for DynamicUser=) or systemd 235 (for StateDirectory= and friends) yet.

  6. If your distribution's packaging guides don't allow it. Consult your packaging guides, and possibly start a discussion on your distribution's mailing list about this.

Notes

A couple of additional, random notes about the implementation and use of these features:

  1. Do note that allocating or deallocating a dynamic user leaves /etc/passwd untouched. A dynamic user is added into the user database through the glibc NSS module nss-systemd, and this information never hits the disk.

  2. On traditional UNIX systems it was the job of the daemon process itself to drop privileges, while the DynamicUser= concept is designed around the service manager (i.e. systemd) being responsible for that. That said, since v235 there's a way to marry DynamicUser= and such services which want to drop privileges on their own. For that, turn on DynamicUser= and set User= to the user name the service wants to setuid() to. This has the effect that systemd will allocate the dynamic user under the specified name when the service is started. Then, prefix the command line you specify in ExecStart= with a single ! character. If you do, the user is allocated for the service, but the daemon binary is invoked as root instead of the allocated user, under the assumption that the daemon changes its UID on its own the right way. Note that after registration the user will show up instantly in the user database, and is hence resolvable like any other by the daemon process. Example: ExecStart=!/usr/bin/mydaemond

  3. You may wonder why systemd uses the UID range 61184–65519 for its dynamic user allocations (side note: in hexadecimal this reads as 0xEF00–0xFFEF). That's because distributions (specifically Fedora) tend to allocate regular users from below the 60000 range, and we don't want to step into that. We also want to stay away from 65535 and a bit around it, as some of these UIDs have special meanings (65535 is often used as special value for "invalid" or "no" UID, as it is identical to the 16bit value -1; 65534 is generally mapped to the "nobody" user, and is where some kernel subsystems map unmappable UIDs). Finally, we want to stay within the 16bit range. In a user name-spacing world each container tends to have much less than the full 32bit UID range available that Linux kernels theoretically provide. Everybody apparently can agree that a container should at least cover the 16bit range though — already to include a nobody user. (And quite frankly, I am pretty sure assigning 64K UIDs per container is nicely systematic, as the the higher 16bit of the 32bit UID values this way become a container ID, while the lower 16bit become the logical UID within each container, if you still follow what I am babbling here…). And before you ask: no this range cannot be changed right now, it's compiled in. We might change that eventually however.

  4. You might wonder what happens if you already used UIDs from the 61184–65519 range on your system for other purposes. systemd should handle that mostly fine, as long as that usage is properly registered in the user database: when allocating a dynamic user we pick a UID, see if it is currently used somehow, and if yes pick a different one, until we find a free one. Whether a UID is used right now or not is checked through NSS calls. Moreover the IPC object lists are checked to see if there are any objects owned by the UID we are about to pick. This means systemd will avoid using UIDs you have assigned otherwise. Note however that this of course makes the pool of available UIDs smaller, and in the worst cases this means that allocating a dynamic user might fail because there simply are no unused UIDs in the range.

  5. If not specified otherwise the name for a dynamically allocated user is derived from the service name. Not everything that's valid in a service name is valid in a user-name however, and in some cases a randomized name is used instead to deal with this. Often it makes sense to pick the user names to register explicitly. For that use User= and choose whatever you like.

  6. If you pick a user name with User= and combine it with DynamicUser= and the user already exists statically it will be used for the service and the dynamic user logic is automatically disabled. This permits automatic up- and downgrades between static and dynamic UIDs. For example, it provides a nice way to move a system from static to dynamic UIDs in a compatible way: as long as you select the same User= value before and after switching DynamicUser= on, the service will continue to use the statically allocated user if it exists, and only operates in the dynamic mode if it does not. This is useful for other cases as well, for example to adapt a service that normally would use a dynamic user to concepts that require statically assigned UIDs, for example to marry classic UID-based file system quota with such services.

  7. systemd always allocates a pair of dynamic UID and GID at the same time, with the same numeric ID.

  8. If the Linux kernel had a "shiftfs" or similar functionality, i.e. a way to mount an existing directory to a second place, but map the exposed UIDs/GIDs in some way configurable at mount time, this would be excellent for the implementation of StateDirectory= in conjunction with DynamicUser=. It would make the recursive chown()ing step unnecessary, as the host version of the state directory could simply be mounted into a the service's mount name-space, with a shift applied that maps the directory's owner to the services' UID/GID. But I don't have high hopes in this regard, as all work being done in this area appears to be bound to user name-spacing — which is a concept not used here (and I guess one could say user name-spacing is probably more a source of problems than a solution to one, but you are welcome to disagree on that).

And that's all for now. Enjoy your dynamic users!


All Systems Go! 2017 Schedule Published

The All Systems Go! 2017 schedule has been published!

I am happy to announce that we have published the All Systems Go! 2017 schedule! We are very happy with the large number and the quality of the submissions we got, and the resulting schedule is exceptionally strong.

Without further ado:

Here's the schedule for the first day (Saturday, 21st of October).

And here's the schedule for the second day (Sunday, 22nd of October).

Here are a couple of keywords from the topics of the talks: 1password, azure, bluetooth, build systems, casync, cgroups, cilium, cockpit, containers, ebpf, flatpak, habitat, IoT, kubernetes, landlock, meson, OCI, rkt, rust, secureboot, skydive, systemd, testing, tor, varlink, virtualization, wifi, and more.

Our speakers are from all across the industry: Chef CoreOS, Covalent, Facebook, Google, Intel, Kinvolk, Microsoft, Mozilla, Pantheon, Pengutronix, Red Hat, SUSE and more.

For further information about All Systems Go! visit our conference web site.

Make sure to buy your ticket for All Systems Go! 2017 now! A limited number of tickets are left at this point, so make sure you get yours before we are all sold out! Find all details here.

See you in Berlin!


All Systems Go! 2017 CfP Closes Soon!

The All Systems Go! 2017 Call for Participation is Closing on September 3rd!

Please make sure to get your presentation proprosals forAll Systems Go! 2017 in now! The CfP closes on sunday!

In case you haven't heard about All Systems Go! yet, here's a quick reminder what kind of conference it is, and why you should attend and speak there:

All Systems Go! is an Open Source community conference focused on the projects and technologies at the foundation of modern Linux systems — specifically low-level user-space technologies. Its goal is to provide a friendly and collaborative gathering place for individuals and communities working to push these technologies forward. All Systems Go! 2017 takes place in Berlin, Germany on October 21st+22nd. All Systems Go! is a 2-day event with 2-3 talks happening in parallel. Full presentation slots are 30-45 minutes in length and lightning talk slots are 5-10 minutes.

In particular, we are looking for sessions including, but not limited to, the following topics:

  • Low-level container executors and infrastructure
  • IoT and embedded OS infrastructure
  • OS, container, IoT image delivery and updating
  • Building Linux devices and applications
  • Low-level desktop technologies
  • Networking
  • System and service management
  • Tracing and performance measuring
  • IPC and RPC systems
  • Security and Sandboxing

While our focus is definitely more on the user-space side of things, talks about kernel projects are welcome too, as long as they have a clear and direct relevance for user-space.

To submit your proposal now please visit our CFP submission web site.

For further information about All Systems Go! visit our conference web site.

systemd.conf will not take place this year in lieu of All Systems Go!. All Systems Go! welcomes all projects that contribute to Linux user space, which, of course, includes systemd. Thus, anything you think was appropriate for submission to systemd.conf is also fitting for All Systems Go!


All Systems Go! 2017 Speakers

The All Systems Go! 2017 Headline Speakers Announced!

Don't forget to send in your submissions to the All Systems Go! 2017 CfP! Proposals are accepted until September 3rd!

A couple of headline speakers have been announced now:

  • Alban Crequy (Kinvolk)
  • Brian "Redbeard" Harrington (CoreOS)
  • Gianluca Borello (Sysdig)
  • Jon Boulle (NStack/CoreOS)
  • Martin Pitt (Debian)
  • Thomas Graf (covalent.io/Cilium)
  • Vincent Batts (Red Hat/OCI)
  • (and yours truly)

These folks will also review your submissions as part of the papers committee!

All Systems Go! is an Open Source community conference focused on the projects and technologies at the foundation of modern Linux systems — specifically low-level user-space technologies. Its goal is to provide a friendly and collaborative gathering place for individuals and communities working to push these technologies forward.

All Systems Go! 2017 takes place in Berlin, Germany on October 21st+22nd.

To submit your proposal now please visit our CFP submission web site.

For further information about All Systems Go! visit our conference web site.


casync Video

Video of my casync Presentation @ kinvolk

The great folks at kinvolk have uploaded a video of my casync presentation at their offices last week.

The slides are available as well.

Enjoy!


mkosi — A Tool for Generating OS Images

Introducing mkosi

After blogging about casync I realized I never blogged about the mkosi tool that combines nicely with it. mkosi has been around for a while already, and its time to make it a bit better known. mkosi stands for Make Operating System Image, and is a tool for precisely that: generating an OS tree or image that can be booted.

Yes, there are many tools like mkosi, and a number of them are quite well known and popular. But mkosi has a number of features that I think make it interesting for a variety of use-cases that other tools don't cover that well.

What is mkosi?

What are those use-cases, and what does mkosi precisely set apart? mkosi is definitely a tool with a focus on developer's needs for building OS images, for testing and debugging, but also for generating production images with cryptographic protection. A typical use-case would be to add a mkosi.default file to an existing project (for example, one written in C or Python), and thus making it easy to generate an OS image for it. mkosi will put together the image with development headers and tools, compile your code in it, run your test suite, then throw away the image again, and build a new one, this time without development headers and tools, and install your build artifacts in it. This final image is then "production-ready", and only contains your built program and the minimal set of packages you configured otherwise. Such an image could then be deployed with casync (or any other tool of course) to be delivered to your set of servers, or IoT devices or whatever you are building.

mkosi is supposed to be legacy-free: the focus is clearly on today's technology, not yesteryear's. Specifically this means that we'll generate GPT partition tables, not MBR/DOS ones. When you tell mkosi to generate a bootable image for you, it will make it bootable on EFI, not on legacy BIOS. The GPT images generated follow specifications such as the Discoverable Partitions Specification, so that /etc/fstab can remain unpopulated and tools such as systemd-nspawn can automatically dissect the image and boot from them.

So, let's have a look on the specific images it can generate:

  1. Raw GPT disk image, with ext4 as root
  2. Raw GPT disk image, with btrfs as root
  3. Raw GPT disk image, with a read-only squashfs as root
  4. A plain directory on disk containing the OS tree directly (this is useful for creating generic container images)
  5. A btrfs subvolume on disk, similar to the plain directory
  6. A tarball of a plain directory

When any of the GPT choices above are selected, a couple of additional options are available:

  1. A swap partition may be added in
  2. The system may be made bootable on EFI systems
  3. Separate partitions for /home and /srv may be added in
  4. The root, /home and /srv partitions may be optionally encrypted with LUKS
  5. The root partition may be protected using dm-verity, thus making offline attacks on the generated system hard
  6. If the image is made bootable, the dm-verity root hash is automatically added to the kernel command line, and the kernel together with its initial RAM disk and the kernel command line is optionally cryptographically signed for UEFI SecureBoot

Note that mkosi is distribution-agnostic. It currently can build images based on the following Linux distributions:

  1. Fedora
  2. Debian
  3. Ubuntu
  4. ArchLinux
  5. openSUSE

Note though that not all distributions are supported at the same feature level currently. Also, as mkosi is based on dnf --installroot, debootstrap, pacstrap and zypper, and those packages are not packaged universally on all distributions, you might not be able to build images for all those distributions on arbitrary host distributions.

The GPT images are put together in a way that they aren't just compatible with UEFI systems, but also with VM and container managers (that is, at least the smart ones, i.e. VM managers that know UEFI, and container managers that grok GPT disk images) to a large degree. In fact, the idea is that you can use mkosi to build a single GPT image that may be used to:

  1. Boot on bare-metal boxes
  2. Boot in a VM
  3. Boot in a systemd-nspawn container
  4. Directly run a systemd service off, using systemd's RootImage= unit file setting

Note that in all four cases the dm-verity data is automatically used if available to ensure the image is not tampered with (yes, you read that right, systemd-nspawn and systemd's RootImage= setting automatically do dm-verity these days if the image has it.)

Mode of Operation

The simplest usage of mkosi is by simply invoking it without parameters (as root):

# mkosi

Without any configuration this will create a GPT disk image for you, will call it image.raw and drop it in the current directory. The distribution used will be the same one as your host runs.

Of course in most cases you want more control about how the image is put together, i.e. select package sets, select the distribution, size partitions and so on. Most of that you can actually specify on the command line, but it is recommended to instead create a couple of mkosi.$SOMETHING files and directories in some directory. Then, simply change to that directory and run mkosi without any further arguments. The tool will then look in the current working directory for these files and directories and make use of them (similar to how make looks for a Makefile…). Every single file/directory is optional, but if they exist they are honored. Here's a list of the files/directories mkosi currently looks for:

  1. mkosi.default — This is the main configuration file, here you can configure what kind of image you want, which distribution, which packages and so on.

  2. mkosi.extra/ — If this directory exists, then mkosi will copy everything inside it into the images built. You can place arbitrary directory hierarchies in here, and they'll be copied over whatever is already in the image, after it was put together by the distribution's package manager. This is the best way to drop additional static files into the image, or override distribution-supplied ones.

  3. mkosi.build — This executable file is supposed to be a build script. When it exists, mkosi will build two images, one after the other in the mode already mentioned above: the first version is the build image, and may include various build-time dependencies such as a compiler or development headers. The build script is also copied into it, and then run inside it. The script should then build whatever shall be built and place the result in $DESTDIR (don't worry, popular build tools such as Automake or Meson all honor $DESTDIR anyway, so there's not much to do here explicitly). It may also run a test suite, or anything else you like. After the script finished, the build image is removed again, and a second image (the final image) is built. This time, no development packages are included, and the build script is not copied into the image again — however, the build artifacts from the first run (i.e. those placed in $DESTDIR) are copied into the image.

  4. mkosi.postinst — If this executable script exists, it is invoked inside the image (inside a systemd-nspawn invocation) and can adjust the image as it likes at a very late point in the image preparation. If mkosi.build exists, i.e. the dual-phased development build process used, then this script will be invoked twice: once inside the build image and once inside the final image. The first parameter passed to the script clarifies which phase it is run in.

  5. mkosi.nspawn — If this file exists, it should contain a container configuration file for systemd-nspawn (see systemd.nspawn(5) for details), which shall be shipped along with the final image and shall be included in the check-sum calculations (see below).

  6. mkosi.cache/ — If this directory exists, it is used as package cache directory for the builds. This directory is effectively bind mounted into the image at build time, in order to speed up building images. The package installers of the various distributions will place their package files here, so that subsequent runs can reuse them.

  7. mkosi.passphrase — If this file exists, it should contain a pass-phrase to use for the LUKS encryption (if that's enabled for the image built). This file should not be readable to other users.

  8. mkosi.secure-boot.crt and mkosi.secure-boot.key should be an X.509 key pair to use for signing the kernel and initrd for UEFI SecureBoot, if that's enabled.

How to use it

So, let's come back to our most trivial example, without any of the mkosi.$SOMETHING files around:

# mkosi

As mentioned, this will create a build file image.raw in the current directory. How do we use it? Of course, we could dd it onto some USB stick and boot it on a bare-metal device. However, it's much simpler to first run it in a container for testing:

# systemd-nspawn -bi image.raw

And there you go: the image should boot up, and just work for you.

Now, let's make things more interesting. Let's still not use any of the mkosi.$SOMETHING files around:

# mkosi -t raw_btrfs --bootable -o foobar.raw
# systemd-nspawn -bi foobar.raw

This is similar as the above, but we made three changes: it's no longer GPT + ext4, but GPT + btrfs. Moreover, the system is made bootable on UEFI systems, and finally, the output is now called foobar.raw.

Because this system is bootable on UEFI systems, we can run it in KVM:

qemu-kvm -m 512 -smp 2 -bios /usr/share/edk2/ovmf/OVMF_CODE.fd -drive format=raw,file=foobar.raw

This will look very similar to the systemd-nspawn invocation, except that this uses full VM virtualization rather than container virtualization. (Note that the way to run a UEFI qemu/kvm instance appears to change all the time and is different on the various distributions. It's quite annoying, and I can't really tell you what the right qemu command line is to make this work on your system.)

Of course, it's not all raw GPT disk images with mkosi. Let's try a plain directory image:

# mkosi -d fedora -t directory -o quux
# systemd-nspawn -bD quux

Of course, if you generate the image as plain directory you can't boot it on bare-metal just like that, nor run it in a VM.

A more complex command line is the following:

# mkosi -d fedora -t raw_squashfs --checksum --xz --package=openssh-clients --package=emacs

In this mode we explicitly pick Fedora as the distribution to use, ask mkosi to generate a compressed GPT image with a root squashfs, compress the result with xz, and generate a SHA256SUMS file with the hashes of the generated artifacts. The package will contain the SSH client as well as everybody's favorite editor.

Now, let's make use of the various mkosi.$SOMETHING files. Let's say we are working on some Automake-based project and want to make it easy to generate a disk image off the development tree with the version you are hacking on. Create a configuration file:

# cat > mkosi.default <<EOF
[Distribution]
Distribution=fedora
Release=24

[Output]
Format=raw_btrfs
Bootable=yes

[Packages]
# The packages to appear in both the build and the final image
Packages=openssh-clients httpd
# The packages to appear in the build image, but absent from the final image
BuildPackages=make gcc libcurl-devel
EOF

And let's add a build script:

# cat > mkosi.build <<EOF
#!/bin/sh
./autogen.sh
./configure --prefix=/usr
make -j `nproc`
make install
EOF
# chmod +x mkosi.build

And with all that in place we can now build our project into a disk image, simply by typing:

# mkosi

Let's try it out:

# systemd-nspawn -bi image.raw

Of course, if you do this you'll notice that building an image like this can be quite slow. And slow build times are actively hurtful to your productivity as a developer. Hence let's make things a bit faster. First, let's make use of a package cache shared between runs:

# mkdir mkosi.cache

Building images now should already be substantially faster (and generate less network traffic) as the packages will now be downloaded only once and reused. However, you'll notice that unpacking all those packages and the rest of the work is still quite slow. But mkosi can help you with that. Simply use mkosi's incremental build feature. In this mode mkosi will make a copy of the build and final images immediately before dropping in your build sources or artifacts, so that building an image becomes a lot quicker: instead of always starting totally from scratch a build will now reuse everything it can reuse from a previous run, and immediately begin with building your sources rather than the build image to build your sources in. To enable the incremental build feature use -i:

# mkosi -i

Note that if you use this option, the package list is not updated anymore from your distribution's servers, as the cached copy is made after all packages are installed, and hence until you actually delete the cached copy the distribution's network servers aren't contacted again and no RPMs or DEBs are downloaded. This means the distribution you use becomes "frozen in time" this way. (Which might be a bad thing, but also a good thing, as it makes things kinda reproducible.)

Of course, if you run mkosi a couple of times you'll notice that it won't overwrite the generated image when it already exists. You can either delete the file yourself first (rm image.raw) or let mkosi do it for you right before building a new image, with mkosi -f. You can also tell mkosi to not only remove any such pre-existing images, but also remove any cached copies of the incremental feature, by using -f twice.

I wrote mkosi originally in order to test systemd, and quickly generate a disk image of various distributions with the most current systemd version from git, without all that affecting my host system. I regularly use mkosi for that today, in incremental mode. The two commands I use most in that context are:

# mkosi -if && systemd-nspawn -bi image.raw

And sometimes:

# mkosi -iff && systemd-nspawn -bi image.raw

The latter I use only if I want to regenerate everything based on the very newest set of RPMs provided by Fedora, instead of a cached snapshot of it.

BTW, the mkosi files for systemd are included in the systemd git tree: mkosi.default and mkosi.build. This way, any developer who wants to quickly test something with current systemd git, or wants to prepare a patch based on it and test it can check out the systemd repository and simply run mkosi in it and a few minutes later he has a bootable image he can test in systemd-nspawn or KVM. casync has similar files: mkosi.default, mkosi.build.

Random Interesting Features

  1. As mentioned already, mkosi will generate dm-verity enabled disk images if you ask for it. For that use the --verity switch on the command line or Verity= setting in mkosi.default. Of course, dm-verity implies that the root volume is read-only. In this mode the top-level dm-verity hash will be placed along-side the output disk image in a file named the same way, but with the .roothash suffix. If the image is to be created bootable, the root hash is also included on the kernel command line in the roothash= parameter, which current systemd versions can use to both find and activate the root partition in a dm-verity protected way. BTW: it's a good idea to combine this dm-verity mode with the raw_squashfs image mode, to generate a genuinely protected, compressed image suitable for running in your IoT device.

  2. As indicated above, mkosi can automatically create a check-sum file SHA256SUMS for you (--checksum) covering all the files it outputs (which could be the image file itself, a matching .nspawn file using the mkosi.nspawn file mentioned above, as well as the .roothash file for the dm-verity root hash.) It can then optionally sign this with gpg (--sign). Note that systemd's machinectl pull-tar and machinectl pull-raw command can download these files and the SHA256SUMS file automatically and verify things on download. With other words: what mkosi outputs is perfectly ready for downloads using these two systemd commands.

  3. As mentioned, mkosi is big on supporting UEFI SecureBoot. To make use of that, place your X.509 key pair in two files mkosi.secureboot.crt and mkosi.secureboot.key, and set SecureBoot= or --secure-boot. If so, mkosi will sign the kernel/initrd/kernel command line combination during the build. Of course, if you use this mode, you should also use Verity=/--verity=, otherwise the setup makes only partial sense. Note that mkosi will not help you with actually enrolling the keys you use in your UEFI BIOS.

  4. mkosi has minimal support for GIT checkouts: when it recognizes it is run in a git checkout and you use the mkosi.build script stuff, the source tree will be copied into the build image, but will all files excluded by .gitignore removed.

  5. There's support for encryption in place. Use --encrypt= or Encrypt=. Note that the UEFI ESP is never encrypted though, and the root partition only if explicitly requested. The /home and /srv partitions are unconditionally encrypted if that's enabled.

  6. Images may be built with all documentation removed.

  7. The password for the root user and additional kernel command line arguments may be configured for the image to generate.

Minimum Requirements

Current mkosi requires Python 3.5, and has a number of dependencies, listed in the README. Most notably you need a somewhat recent systemd version to make use of its full feature set: systemd 233. Older versions are already packaged for various distributions, but much of what I describe above is only available in the most recent release mkosi 3.

The UEFI SecureBoot support requires sbsign which currently isn't available in Fedora, but there's a COPR.

Future

It is my intention to continue turning mkosi into a tool suitable for:

  1. Testing and debugging projects
  2. Building images for secure devices
  3. Building portable service images
  4. Building images for secure VMs and containers

One of the biggest goals I have for the future is to teach mkosi and systemd/sd-boot native support for A/B IoT style partition setups. The idea is that the combination of systemd, casync and mkosi provides generic building blocks for building secure, auto-updating devices in a generic way from, even though all pieces may be used individually, too.

FAQ

  1. Why are you reinventing the wheel again? This is exactly like $SOMEOTHERPROJECT! — Well, to my knowledge there's no tool that integrates this nicely with your project's development tree, and can do dm-verity and UEFI SecureBoot and all that stuff for you. So nope, I don't think this exactly like $SOMEOTHERPROJECT, thank you very much.

  2. What about creating MBR/DOS partition images? — That's really out of focus to me. This is an exercise in figuring out how generic OSes and devices in the future should be built and an attempt to commoditize OS image building. And no, the future doesn't speak MBR, sorry. That said, I'd be quite interested in adding support for booting on Raspberry Pi, possibly using a hybrid approach, i.e. using a GPT disk label, but arranging things in a way that the Raspberry Pi boot protocol (which is built around DOS partition tables), can still work.

  3. Is this portable? — Well, depends what you mean by portable. No, this tool runs on Linux only, and as it uses systemd-nspawn during the build process it doesn't run on non-systemd systems either. But then again, you should be able to create images for any architecture you like with it, but of course if you want the image bootable on bare-metal systems only systems doing UEFI are supported (but systemd-nspawn should still work fine on them).

  4. Where can I get this stuff? — Try GitHub. And some distributions carry packaged versions, but I think none of them the current v3 yet.

  5. Is this a systemd project? — Yes, it's hosted under the systemd GitHub umbrella. And yes, during run-time systemd-nspawn in a current version is required. But no, the code-bases are separate otherwise, already because systemd is a C project, and mkosi Python.

  6. Requiring systemd 233 is a pretty steep requirement, no? — Yes, but the feature we need kind of matters (systemd-nspawn's --overlay= switch), and again, this isn't supposed to be a tool for legacy systems.

  7. Can I run the resulting images in LXC or Docker? — Humm, I am not an LXC nor Docker guy. If you select directory or subvolume as image type, LXC should be able to boot the generated images just fine, but I didn't try. Last time I looked, Docker doesn't permit running proper init systems as PID 1 inside the container, as they define their own run-time without intention to emulate a proper system. Hence, no I don't think it will work, at least not with an unpatched Docker version. That said, again, don't ask me questions about Docker, it's not precisely my area of expertise, and quite frankly I am not a fan. To my knowledge neither LXC nor Docker are able to run containers directly off GPT disk images, hence the various raw_xyz image types are definitely not compatible with either. That means if you want to generate a single raw disk image that can be booted unmodified both in a container and on bare-metal, then systemd-nspawn is the container manager to go for (specifically, its -i/--image= switch).

Should you care? Is this a tool for you?

Well, that's up to you really.

If you hack on some complex project and need a quick way to compile and run your project on a specific current Linux distribution, then mkosi is an excellent way to do that. Simply drop the mkosi.default and mkosi.build files in your git tree and everything will be easy. (And of course, as indicated above: if the project you are hacking on happens to be called systemd or casync be aware that those files are already part of the git tree — you can just use them.)

If you hack on some embedded or IoT device, then mkosi is a great choice too, as it will make it reasonably easy to generate secure images that are protected against offline modification, by using dm-verity and UEFI SecureBoot.

If you are an administrator and need a nice way to build images for a VM or systemd-nspawn container, or a portable service then mkosi is an excellent choice too.

If you care about legacy computers, old distributions, non-systemd init systems, old VM managers, Docker, … then no, mkosi is not for you, but there are plenty of well-established alternatives around that cover that nicely.

And never forget: mkosi is an Open Source project. We are happy to accept your patches and other contributions.

Oh, and one unrelated last thing: don't forget to submit your talk proposal and/or buy a ticket for All Systems Go! 2017 in Berlin — the conference where things like systemd, casync and mkosi are discussed, along with a variety of other Linux userspace projects used for building systems.


All Systems Go! 2017 CfP Open

The All Systems Go! 2017 Call for Participation is Now Open!

We’d like to invite presentation proposals for All Systems Go! 2017!

All Systems Go! is an Open Source community conference focused on the projects and technologies at the foundation of modern Linux systems — specifically low-level user-space technologies. Its goal is to provide a friendly and collaborative gathering place for individuals and communities working to push these technologies forward.

All Systems Go! 2017 takes place in Berlin, Germany on October 21st+22nd.

All Systems Go! is a 2-day event with 2-3 talks happening in parallel. Full presentation slots are 30-45 minutes in length and lightning talk slots are 5-10 minutes.

We are now accepting submissions for presentation proposals. In particular, we are looking for sessions including, but not limited to, the following topics:

  • Low-level container executors and infrastructure
  • IoT and embedded OS infrastructure
  • OS, container, IoT image delivery and updating
  • Building Linux devices and applications
  • Low-level desktop technologies
  • Networking
  • System and service management
  • Tracing and performance measuring
  • IPC and RPC systems
  • Security and Sandboxing

While our focus is definitely more on the user-space side of things, talks about kernel projects are welcome too, as long as they have a clear and direct relevance for user-space.

Please submit your proposals by September 3rd. Notification of acceptance will be sent out 1-2 weeks later.

To submit your proposal now please visit our CFP submission web site.

For further information about All Systems Go! visit our conference web site.

systemd.conf will not take place this year in lieu of All Systems Go!. All Systems Go! welcomes all projects that contribute to Linux user space, which, of course, includes systemd. Thus, anything you think was appropriate for submission to systemd.conf is also fitting for All Systems Go!


casync — A tool for distributing file system images

Introducing casync

In the past months I have been working on a new project: casync. casync takes inspiration from the popular rsync file synchronization tool as well as the probably even more popular git revision control system. It combines the idea of the rsync algorithm with the idea of git-style content-addressable file systems, and creates a new system for efficiently storing and delivering file system images, optimized for high-frequency update cycles over the Internet. Its current focus is on delivering IoT, container, VM, application, portable service or OS images, but I hope to extend it later in a generic fashion to become useful for backups and home directory synchronization as well (but more about that later).

The basic technological building blocks casync is built from are neither new nor particularly innovative (at least not anymore), however the way casync combines them is different from existing tools, and that's what makes it useful for a variety of use-cases that other tools can't cover that well.

Why?

I created casync after studying how today's popular tools store and deliver file system images. To briefly name a few: Docker has a layered tarball approach, OSTree serves the individual files directly via HTTP and maintains packed deltas to speed up updates, while other systems operate on the block layer and place raw squashfs images (or other archival file systems, such as IS09660) for download on HTTP shares (in the better cases combined with zsync data).

Neither of these approaches appeared fully convincing to me when used in high-frequency update cycle systems. In such systems, it is important to optimize towards a couple of goals:

  1. Most importantly, make updates cheap traffic-wise (for this most tools use image deltas of some form)
  2. Put boundaries on disk space usage on servers (keeping deltas between all version combinations clients might want to run updates between, would suggest keeping an exponentially growing amount of deltas on servers)
  3. Put boundaries on disk space usage on clients
  4. Be friendly to Content Delivery Networks (CDNs), i.e. serve neither too many small nor too many overly large files, and only require the most basic form of HTTP. Provide the repository administrator with high-level knobs to tune the average file size delivered.
  5. Simplicity to use for users, repository administrators and developers

I don't think any of the tools mentioned above are really good on more than a small subset of these points.

Specifically: Docker's layered tarball approach dumps the "delta" question onto the feet of the image creators: the best way to make your image downloads minimal is basing your work on an existing image clients might already have, and inherit its resources, maintaining full history. Here, revision control (a tool for the developer) is intermingled with update management (a concept for optimizing production delivery). As container histories grow individual deltas are likely to stay small, but on the other hand a brand-new deployment usually requires downloading the full history onto the deployment system, even though there's no use for it there, and likely requires substantially more disk space and download sizes.

OSTree's serving of individual files is unfriendly to CDNs (as many small files in file trees cause an explosion of HTTP GET requests). To counter that OSTree supports placing pre-calculated delta images between selected revisions on the delivery servers, which means a certain amount of revision management, that leaks into the clients.

Delivering direct squashfs (or other file system) images is almost beautifully simple, but of course means every update requires a full download of the newest image, which is both bad for disk usage and generated traffic. Enhancing it with zsync makes this a much better option, as it can reduce generated traffic substantially at very little cost of history/meta-data (no explicit deltas between a large number of versions need to be prepared server side). On the other hand server requirements in disk space and functionality (HTTP Range requests) are minus points for the use-case I am interested in.

(Note: all the mentioned systems have great properties, and it's not my intention to badmouth them. They only point I am trying to make is that for the use case I care about — file system image delivery with high high frequency update-cycles — each system comes with certain drawbacks.)

Security & Reproducibility

Besides the issues pointed out above I wasn't happy with the security and reproducibility properties of these systems. In today's world where security breaches involving hacking and breaking into connected systems happen every day, an image delivery system that cannot make strong guarantees regarding data integrity is out of date. Specifically, the tarball format is famously nondeterministic: the very same file tree can result in any number of different valid serializations depending on the tool used, its version and the underlying OS and file system. Some tar implementations attempt to correct that by guaranteeing that each file tree maps to exactly one valid serialization, but such a property is always only specific to the tool used. I strongly believe that any good update system must guarantee on every single link of the chain that there's only one valid representation of the data to deliver, that can easily be verified.

What casync Is

So much about the background why I created casync. Now, let's have a look what casync actually is like, and what it does. Here's the brief technical overview:

Encoding: Let's take a large linear data stream, split it into variable-sized chunks (the size of each being a function of the chunk's contents), and store these chunks in individual, compressed files in some directory, each file named after a strong hash value of its contents, so that the hash value may be used to as key for retrieving the full chunk data. Let's call this directory a "chunk store". At the same time, generate a "chunk index" file that lists these chunk hash values plus their respective chunk sizes in a simple linear array. The chunking algorithm is supposed to create variable, but similarly sized chunks from the data stream, and do so in a way that the same data results in the same chunks even if placed at varying offsets. For more information see this blog story.

Decoding: Let's take the chunk index file, and reassemble the large linear data stream by concatenating the uncompressed chunks retrieved from the chunk store, keyed by the listed chunk hash values.

As an extra twist, we introduce a well-defined, reproducible, random-access serialization format for file trees (think: a more modern tar), to permit efficient, stable storage of complete file trees in the system, simply by serializing them and then passing them into the encoding step explained above.

Finally, let's put all this on the network: for each image you want to deliver, generate a chunk index file and place it on an HTTP server. Do the same with the chunk store, and share it between the various index files you intend to deliver.

Why bother with all of this? Streams with similar contents will result in mostly the same chunk files in the chunk store. This means it is very efficient to store many related versions of a data stream in the same chunk store, thus minimizing disk usage. Moreover, when transferring linear data streams chunks already known on the receiving side can be made use of, thus minimizing network traffic.

Why is this different from rsync or OSTree, or similar tools? Well, one major difference between casync and those tools is that we remove file boundaries before chunking things up. This means that small files are lumped together with their siblings and large files are chopped into pieces, which permits us to recognize similarities in files and directories beyond file boundaries, and makes sure our chunk sizes are pretty evenly distributed, without the file boundaries affecting them.

The "chunking" algorithm is based on a the buzhash rolling hash function. SHA256 is used as strong hash function to generate digests of the chunks. xz is used to compress the individual chunks.

Here's a diagram, hopefully explaining a bit how the encoding process works, wasn't it for my crappy drawing skills:

Diagram

The diagram shows the encoding process from top to bottom. It starts with a block device or a file tree, which is then serialized and chunked up into variable sized blocks. The compressed chunks are then placed in the chunk store, while a chunk index file is written listing the chunk hashes in order. (The original SVG of this graphic may be found here.)

Details

Note that casync operates on two different layers, depending on the use-case of the user:

  1. You may use it on the block layer. In this case the raw block data on disk is taken as-is, read directly from the block device, split into chunks as described above, compressed, stored and delivered.

  2. You may use it on the file system layer. In this case, the file tree serialization format mentioned above comes into play: the file tree is serialized depth-first (much like tar would do it) and then split into chunks, compressed, stored and delivered.

The fact that it may be used on both the block and file system layer opens it up for a variety of different use-cases. In the VM and IoT ecosystems shipping images as block-level serializations is more common, while in the container and application world file-system-level serializations are more typically used.

Chunk index files referring to block-layer serializations carry the .caibx suffix, while chunk index files referring to file system serializations carry the .caidx suffix. Note that you may also use casync as direct tar replacement, i.e. without the chunking, just generating the plain linear file tree serialization. Such files carry the .catar suffix. Internally .caibx are identical to .caidx files, the only difference is semantical: .caidx files describe a .catar file, while .caibx files may describe any other blob. Finally, chunk stores are directories carrying the .castr suffix.

Features

Here are a couple of other features casync has:

  1. When downloading a new image you may use casync's --seed= feature: each block device, file, or directory specified is processed using the same chunking logic described above, and is used as preferred source when putting together the downloaded image locally, avoiding network transfer of it. This of course is useful whenever updating an image: simply specify one or more old versions as seed and only download the chunks that truly changed since then. Note that using seeds requires no history relationship between seed and the new image to download. This has major benefits: you can even use it to speed up downloads of relatively foreign and unrelated data. For example, when downloading a container image built using Ubuntu you can use your Fedora host OS tree in /usr as seed, and casync will automatically use whatever it can from that tree, for example timezone and locale data that tends to be identical between distributions. Example: casync extract http://example.com/myimage.caibx --seed=/dev/sda1 /dev/sda2. This will place the block-layer image described by the indicated URL in the /dev/sda2 partition, using the existing /dev/sda1 data as seeding source. An invocation like this could be typically used by IoT systems with an A/B partition setup. Example 2: casync extract http://example.com/mycontainer-v3.caidx --seed=/srv/container-v1 --seed=/srv/container-v2 /src/container-v3, is very similar but operates on the file system layer, and uses two old container versions to seed the new version.

  2. When operating on the file system level, the user has fine-grained control on the meta-data included in the serialization. This is relevant since different use-cases tend to require a different set of saved/restored meta-data. For example, when shipping OS images, file access bits/ACLs and ownership matter, while file modification times hurt. When doing personal backups OTOH file ownership matters little but file modification times are important. Moreover different backing file systems support different feature sets, and storing more information than necessary might make it impossible to validate a tree against an image if the meta-data cannot be replayed in full. Due to this, casync provides a set of --with= and --without= parameters that allow fine-grained control of the data stored in the file tree serialization, including the granularity of modification times and more. The precise set of selected meta-data features is also always part of the serialization, so that seeding can work correctly and automatically.

  3. casync tries to be as accurate as possible when storing file system meta-data. This means that besides the usual baseline of file meta-data (file ownership and access bits), and more advanced features (extended attributes, ACLs, file capabilities) a number of more exotic data is stored as well, including Linux chattr(1) file attributes, as well as FAT file attributes (you may wonder why the latter? — EFI is FAT, and /efi is part of the comprehensive serialization of any host). In the future I intend to extend this further, for example storing btrfs sub-volume information where available. Note that as described above every single type of meta-data may be turned off and on individually, hence if you don't need FAT file bits (and I figure it's pretty likely you don't), then they won't be stored.

  4. The user creating .caidx or .caibx files may control the desired average chunk length (before compression) freely, using the --chunk-size= parameter. Smaller chunks increase the number of generated files in the chunk store and increase HTTP GET load on the server, but also ensure that sharing between similar images is improved, as identical patterns in the images stored are more likely to be recognized. By default casync will use a 64K average chunk size. Tweaking this can be particularly useful when adapting the system to specific CDNs, or when delivering compressed disk images such as squashfs (see below).

  5. Emphasis is placed on making all invocations reproducible, well-defined and strictly deterministic. As mentioned above this is a requirement to reach the intended security guarantees, but is also useful for many other use-cases. For example, the casync digest command may be used to calculate a hash value identifying a specific directory in all desired detail (use --with= and --without to pick the desired detail). Moreover the casync mtree command may be used to generate a BSD mtree(5) compatible manifest of a directory tree, .caidx or .catar file.

  6. The file system serialization format is nicely composable. By this I mean that the serialization of a file tree is the concatenation of the serializations of all files and file sub-trees located at the top of the tree, with zero meta-data references from any of these serializations into the others. This property is essential to ensure maximum reuse of chunks when similar trees are serialized.

  7. When extracting file trees or disk image files, casync will automatically create reflinks from any specified seeds if the underlying file system supports it (such as btrfs, ocfs, and future xfs). After all, instead of copying the desired data from the seed, we can just tell the file system to link up the relevant blocks. This works both when extracting .caidx and .caibx files — the latter of course only when the extracted disk image is placed in a regular raw image file on disk, rather than directly on a plain block device, as plain block devices do not know the concept of reflinks.

  8. Optionally, when extracting file trees, casync can create traditional UNIX hard-links for identical files in specified seeds (--hardlink=yes). This works on all UNIX file systems, and can save substantial amounts of disk space. However, this only works for very specific use-cases where disk images are considered read-only after extraction, as any changes made to one tree will propagate to all other trees sharing the same hard-linked files, as that's the nature of hard-links. In this mode, casync exposes OSTree-like behavior, which is built heavily around read-only hard-link trees.

  9. casync tries to be smart when choosing what to include in file system images. Implicitly, file systems such as procfs and sysfs are excluded from serialization, as they expose API objects, not real files. Moreover, the "nodump" (+d) chattr(1) flag is honored by default, permitting users to mark files to exclude from serialization.

  10. When creating and extracting file trees casync may apply an automatic or explicit UID/GID shift. This is particularly useful when transferring container image for use with Linux user name-spacing.

  11. In addition to local operation, casync currently supports HTTP, HTTPS, FTP and ssh natively for downloading chunk index files and chunks (the ssh mode requires installing casync on the remote host, though, but an sftp mode not requiring that should be easy to add). When creating index files or chunks, only ssh is supported as remote back-end.

  12. When operating on block-layer images, you may expose locally or remotely stored images as local block devices. Example: casync mkdev http://example.com/myimage.caibx exposes the disk image described by the indicated URL as local block device in /dev, which you then may use the usual block device tools on, such as mount or fdisk (only read-only though). Chunks are downloaded on access with high priority, and at low priority when idle in the background. Note that in this mode, casync also plays a role similar to "dm-verity", as all blocks are validated against the strong digests in the chunk index file before passing them on to the kernel's block layer. This feature is implemented though Linux' NBD kernel facility.

  13. Similar, when operating on file-system-layer images, you may mount locally or remotely stored images as regular file systems. Example: casync mount http://example.com/mytree.caidx /srv/mytree mounts the file tree image described by the indicated URL as a local directory /srv/mytree. This feature is implemented though Linux' FUSE kernel facility. Note that special care is taken that the images exposed this way can be packed up again with casync make and are guaranteed to return the bit-by-bit exact same serialization again that it was mounted from. No data is lost or changed while passing things through FUSE (OK, strictly speaking this is a lie, we do lose ACLs, but that's hopefully just a temporary gap to be fixed soon).

  14. In IoT A/B fixed size partition setups the file systems placed in the two partitions are usually much shorter than the partition size, in order to keep some room for later, larger updates. casync is able to analyze the super-block of a number of common file systems in order to determine the actual size of a file system stored on a block device, so that writing a file system to such a partition and reading it back again will result in reproducible data. Moreover this speeds up the seeding process, as there's little point in seeding the white-space after the file system within the partition.

Example Command Lines

Here's how to use casync, explained with a few examples:

$ casync make foobar.caidx /some/directory

This will create a chunk index file foobar.caidx in the local directory, and populate the chunk store directory default.castr located next to it with the chunks of the serialization (you can change the name for the store directory with --store= if you like). This command operates on the file-system level. A similar command operating on the block level:

$ casync make foobar.caibx /dev/sda1

This command creates a chunk index file foobar.caibx in the local directory describing the current contents of the /dev/sda1 block device, and populates default.castr in the same way as above. Note that you may as well read a raw disk image from a file instead of a block device:

$ casync make foobar.caibx myimage.raw

To reconstruct the original file tree from the .caidx file and the chunk store of the first command, use:

$ casync extract foobar.caidx /some/other/directory

And similar for the block-layer version:

$ casync extract foobar.caibx /dev/sdb1

or, to extract the block-layer version into a raw disk image:

$ casync extract foobar.caibx myotherimage.raw

The above are the most basic commands, operating on local data only. Now let's make this more interesting, and reference remote resources:

$ casync extract http://example.com/images/foobar.caidx /some/other/directory

This extracts the specified .caidx onto a local directory. This of course assumes that foobar.caidx was uploaded to the HTTP server in the first place, along with the chunk store. You can use any command you like to accomplish that, for example scp or rsync. Alternatively, you can let casync do this directly when generating the chunk index:

$ casync make ssh.example.com:images/foobar.caidx /some/directory

This will use ssh to connect to the ssh.example.com server, and then places the .caidx file and the chunks on it. Note that this mode of operation is "smart": this scheme will only upload chunks currently missing on the server side, and not re-transmit what already is available.

Note that you can always configure the precise path or URL of the chunk store via the --store= option. If you do not do that, then the store path is automatically derived from the path or URL: the last component of the path or URL is replaced by default.castr.

Of course, when extracting .caidx or .caibx files from remote sources, using a local seed is advisable:

$ casync extract http://example.com/images/foobar.caidx --seed=/some/exising/directory /some/other/directory

Or on the block layer:

$ casync extract http://example.com/images/foobar.caibx --seed=/dev/sda1 /dev/sdb2

When creating chunk indexes on the file system layer casync will by default store meta-data as accurately as possible. Let's create a chunk index with reduced meta-data:

$ casync make foobar.caidx --with=sec-time --with=symlinks --with=read-only /some/dir

This command will create a chunk index for a file tree serialization that has three features above the absolute baseline supported: 1s granularity time-stamps, symbolic links and a single read-only bit. In this mode, all the other meta-data bits are not stored, including nanosecond time-stamps, full UNIX permission bits, file ownership or even ACLs or extended attributes.

Now let's make a .caidx file available locally as a mounted file system, without extracting it:

$ casync mount http://example.comf/images/foobar.caidx /mnt/foobar

And similar, let's make a .caibx file available locally as a block device:

$ casync mkdev http://example.comf/images/foobar.caibx

This will create a block device in /dev and print the used device node path to STDOUT.

As mentioned, casync is big about reproducibility. Let's make use of that to calculate the a digest identifying a very specific version of a file tree:

$ casync digest .

This digest will include all meta-data bits casync and the underlying file system know about. Usually, to make this useful you want to configure exactly what meta-data to include:

$ casync digest --with=unix .

This makes use of the --with=unix shortcut for selecting meta-data fields. Specifying --with-unix= selects all meta-data that traditional UNIX file systems support. It is a shortcut for writing out: --with=16bit-uids --with=permissions --with=sec-time --with=symlinks --with=device-nodes --with=fifos --with=sockets.

Note that when calculating digests or creating chunk indexes you may also use the negative --without= option to remove specific features but start from the most precise:

$ casync digest --without=flag-immutable

This generates a digest with the most accurate meta-data, but leaves one feature out: chattr(1)'s immutable (+i) file flag.

To list the contents of a .caidx file use a command like the following:

$ casync list http://example.com/images/foobar.caidx

or

$ casync mtree http://example.com/images/foobar.caidx

The former command will generate a brief list of files and directories, not too different from tar t or ls -al in its output. The latter command will generate a BSD mtree(5) compatible manifest. Note that casync actually stores substantially more file meta-data than mtree files can express, though.

What casync isn't

  1. casync is not an attempt to minimize serialization and downloaded deltas to the extreme. Instead, the tool is supposed to find a good middle ground, that is good on traffic and disk space, but not at the price of convenience or requiring explicit revision control. If you care about updates that are absolutely minimal, there are binary delta systems around that might be an option for you, such as Google's Courgette.

  2. casync is not a replacement for rsync, or git or zsync or anything like that. They have very different use-cases and semantics. For example, rsync permits you to directly synchronize two file trees remotely. casync just cannot do that, and it is unlikely it every will.

Where next?

casync is supposed to be a generic synchronization tool. Its primary focus for now is delivery of OS images, but I'd like to make it useful for a couple other use-cases, too. Specifically:

  1. To make the tool useful for backups, encryption is missing. I have pretty concrete plans how to add that. When implemented, the tool might become an alternative to restic, BorgBackup or tarsnap.

  2. Right now, if you want to deploy casync in real-life, you still need to validate the downloaded .caidx or .caibx file yourself, for example with some gpg signature. It is my intention to integrate with gpg in a minimal way so that signing and verifying chunk index files is done automatically.

  3. In the longer run, I'd like to build an automatic synchronizer for $HOME between systems from this. Each $HOME instance would be stored automatically in regular intervals in the cloud using casync, and conflicts would be resolved locally.

  4. casync is written in a shared library style, but it is not yet built as one. Specifically this means that almost all of casync's functionality is supposed to be available as C API soon, and applications can process casync files on every level. It is my intention to make this library useful enough so that it will be easy to write a module for GNOME's gvfs subsystem in order to make remote or local .caidx files directly available to applications (as an alternative to casync mount). In fact the idea is to make this all flexible enough that even the remoting back-ends can be replaced easily, for example to replace casync's default HTTP/HTTPS back-ends built on CURL with GNOME's own HTTP implementation, in order to share cookies, certificates, … There's also an alternative method to integrate with casync in place already: simply invoke casync as a sub-process. casync will inform you about a certain set of state changes using a mechanism compatible with sd_notify(3). In future it will also propagate progress data this way and more.

  5. I intend to a add a new seeding back-end that sources chunks from the local network. After downloading the new .caidx file off the Internet casync would then search for the listed chunks on the local network first before retrieving them from the Internet. This should speed things up on all installations that have multiple similar systems deployed in the same network.

Further plans are listed tersely in the TODO file.

FAQ:

  1. Is this a systemd project?casync is hosted under the github systemd umbrella, and the projects share the same coding style. However, the code-bases are distinct and without interdependencies, and casync works fine both on systemd systems and systems without it.

  2. Is casync portable? — At the moment: no. I only run Linux and that's what I code for. That said, I am open to accepting portability patches (unlike for systemd, which doesn't really make sense on non-Linux systems), as long as they don't interfere too much with the way casync works. Specifically this means that I am not too enthusiastic about merging portability patches for OSes lacking the openat(2) family of APIs.

  3. Does casync require reflink-capable file systems to work, such as btrfs? — No it doesn't. The reflink magic in casync is employed when the file system permits it, and it's good to have it, but it's not a requirement, and casync will implicitly fall back to copying when it isn't available. Note that casync supports a number of file system features on a variety of file systems that aren't available everywhere, for example FAT's system/hidden file flags or xfs's projinherit file flag.

  4. Is casync stable? — I just tagged the first, initial release. While I have been working on it since quite some time and it is quite featureful, this is the first time I advertise it publicly, and it hence received very little testing outside of its own test suite. I am also not fully ready to commit to the stability of the current serialization or chunk index format. I don't see any breakages coming for it though. casync is pretty light on documentation right now, and does not even have a man page. I also intend to correct that soon.

  5. Are the .caidx/.caibx and .catar file formats open and documented?casync is Open Source, so if you want to know the precise format, have a look at the sources for now. It's definitely my intention to add comprehensive docs for both formats however. Don't forget this is just the initial version right now.

  6. casync is just like $SOMEOTHERTOOL! Why are you reinventing the wheel (again)? — Well, because casync isn't "just like" some other tool. I am pretty sure I did my homework, and that there is no tool just like casync right now. The tools coming closest are probably rsync, zsync, tarsnap, restic, but they are quite different beasts each.

  7. Why did you invent your own serialization format for file trees? Why don't you just use tar? — That's a good question, and other systems — most prominently tarsnap — do that. However, as mentioned above tar doesn't enforce reproducibility. It also doesn't really do random access: if you want to access some specific file you need to read every single byte stored before it in the tar archive to find it, which is of course very expensive. The serialization casync implements places a focus on reproducibility, random access, and meta-data control. Much like traditional tar it can still be generated and extracted in a stream fashion though.

  8. Does casync save/restore SELinux/SMACK file labels? — At the moment not. That's not because I wouldn't want it to, but simply because I am not a guru of either of these systems, and didn't want to implement something I do not fully grok nor can test. If you look at the sources you'll find that there's already some definitions in place that keep room for them though. I'd be delighted to accept a patch implementing this fully.

  9. What about delivering squashfs images? How well does chunking work on compressed serializations? – That's a very good point! Usually, if you apply the a chunking algorithm to a compressed data stream (let's say a tar.gz file), then changing a single bit at the front will propagate into the entire remainder of the file, so that minimal changes will explode into major changes. Thankfully this doesn't apply that strictly to squashfs images, as it provides random access to files and directories and thus breaks up the compression streams in regular intervals to make seeking easy. This fact is beneficial for systems employing chunking, such as casync as this means single bit changes might affect their vicinity but will not explode in an unbounded fashion. In order achieve best results when delivering squashfs images through casync the block sizes of squashfs and the chunks sizes of casync should be matched up (using casync's --chunk-size= option). How precisely to choose both values is left a research subject for the user, for now.

  10. What does the name casync mean? – It's a synchronizing tool, hence the -sync suffix, following rsync's naming. It makes use of the content-addressable concept of git hence the ca- prefix.

  11. Where can I get this stuff? Is it already packaged? – Check out the sources on GitHub. I just tagged the first version. Martin Pitt has packaged casync for Ubuntu. There is also an ArchLinux package. Zbigniew Jędrzejewski-Szmek has prepared a Fedora RPM that hopefully will soon be included in the distribution.

Should you care? Is this a tool for you?

Well, that's up to you really. If you are involved with projects that need to deliver IoT, VM, container, application or OS images, then maybe this is a great tool for you — but other options exist, some of which are linked above.

Note that casync is an Open Source project: if it doesn't do exactly what you need, prepare a patch that adds what you need, and we'll consider it.

If you are interested in the project and would like to talk about this in person, I'll be presenting casync soon at Kinvolk's Linux Technologies Meetup in Berlin, Germany. You are invited. I also intend to talk about it at All Systems Go!, also in Berlin.


Avoiding CVE-2016-8655 with systemd

Avoiding CVE-2016-8655 with systemd

Just a quick note: on recent versions of systemd it is relatively easy to block the vulnerability described in CVE-2016-8655 for individual services.

Since systemd release v211 there's an option RestrictAddressFamilies= for service unit files which takes away the right to create sockets of specific address families for processes of the service. In your unit file, add RestrictAddressFamilies=~AF_PACKET to the [Service] section to make AF_PACKET unavailable to it (i.e. a blacklist), which is sufficient to close the attack path. Safer of course is a whitelist of address families whch you can define by dropping the ~ character from the assignment. Here's a trivial example:

…
[Service]
ExecStart=/usr/bin/mydaemon
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
…

This restricts access to socket families, so that the service may access only AF_INET, AF_INET6 or AF_UNIX sockets, which is usually the right, minimal set for most system daemons. (AF_INET is the low-level name for the IPv4 address family, AF_INET6 for the IPv6 address family, and AF_UNIX for local UNIX socket IPC).

Starting with systemd v232 we added RestrictAddressFamilies= to all of systemd's own unit files, always with the minimal set of socket address families appropriate.

With the upcoming v233 release we'll provide a second method for blocking this vulnerability. Using RestrictNamespaces= it is possible to limit which types of Linux namespaces a service may get access to. Use RestrictNamespaces=yes to prohibit access to any kind of namespace, or set RestrictNamespaces=net ipc (or similar) to restrict access to a specific set (in this case: network and IPC namespaces). Given that user namespaces have been a major source of security vulnerabilities in the past months it's probably a good idea to block namespaces on all services which don't need them (which is probably most of them).

Of course, ideally, distributions such as Fedora, as well as upstream developers would turn on the various sandboxing settings systemd provides like these ones by default, since they know best which kind of address families or namespaces a specific daemon needs.


systemd.conf 2016 Over Now

systemd.conf 2016 is Over Now!

A few days ago systemd.conf 2016 ended, our second conference of this kind. I personally enjoyed this conference a lot: the talks, the atmosphere, the audience, the organization, the location, they all were excellent!

I'd like to take the opportunity to thanks everybody involved. In particular I'd like to thank Chris, Daniel, Sandra and Henrike for organizing the conference, your work was stellar!

I'd also like to thank our sponsors, without which the conference couldn't take place like this, of course. In particular I'd like to thank our gold sponsor, Red Hat, our organizing sponsor Kinvolk, as well as our silver sponsors CoreOS and Facebook. I'd also like to thank our bronze sponsors Collabora, OpenSUSE, Pantheon, Pengutronix, our supporting sponsor Codethink and last but not least our media sponsor Linux Magazin. Thank you all!

I'd also like to thank the Video Operation Center ("VOC") for their amazing work on live-streaming the conference and making all talks available on YouTube. It's amazing how efficient the VOC is, it's simply stunning! Thank you guys!

In case you missed this year's iteration of the conference, please have a look at our YouTube Channel. You'll find all of this year's talks there, as well the ones from last year. (For example, my welcome talk is available here). Enjoy!

We hope to see you again next year, for systemd.conf 2017 in Berlin!

© Lennart Poettering. Built using Pelican. Theme by Giulio Fidente on github. .