Category: projects

Boot & Base OS Miniconf at Linux Plumbers Conference 2012, San Diego

Linux Plumbers Conference Logo

We are working on putting together a miniconf on the topic of Boot & Base OS for the Linux Plumbers Conference 2012 in San Diego (Aug 29-31). And we need your submission!

Are you working on some exciting project related to Boot and Base OS and would like to present your work? Then please submit something following these guidelines, but please CC Kay Sievers and Lennart Poettering.

I hope that at this point the Linux Plumbers Conference needs little introduction, so I will spare any further prose on how great and useful and the best conference ever it is for everybody who works on the plumbing layer of Linux. However, there's one conference that will be co-located with LPC that is still little known, because it happens for the first time: The C Conference, organized by Brandon Philips and friends. It covers all things C, and they are still looking for more topics, in a reverse CFP. Please consider submitting a proposal and registering to the conference!

C
Conference Logo


The Most Awesome, Least-Advertised Fedora 17 Feature

There's one feature In the upcoming Fedora 17 release that is immensly useful but very little known, since its feature page 'ckremoval' does not explicitly refer to it in its name: true automatic multi-seat support for Linux.

A multi-seat computer is a system that offers not only one local seat for a user, but multiple, at the same time. A seat refers to a combination of a screen, a set of input devices (such as mice and keyboards), and maybe an audio card or webcam, as individual local workplace for a user. A multi-seat computer can drive an entire class room of seats with only a fraction of the cost in hardware, energy, administration and space: you only have one PC, which usually has way enough CPU power to drive 10 or more workplaces. (In fact, even a Netbook has fast enough to drive a couple of seats!) Automatic multi-seat refers to an entirely automatically managed seat setup: whenever a new seat is plugged in a new login screen immediately appears -- without any manual configuration --, and when the seat is unplugged all user sessions on it are removed without delay.

In Fedora 17 we added this functionality to the low-level user and device tracking of systemd, replacing the previous ConsoleKit logic that lacked support for automatic multi-seat. With all the ground work done in systemd, udev and the other components of our plumbing layer the last remaining bits were surprisingly easy to add.

Currently, the automatic multi-seat logic works best with the USB multi-seat hardware from Plugable you can buy cheaply on Amazon (US). These devices require exactly zero configuration with the new scheme implemented in Fedora 17: just plug them in at any time, login screens pop up on them, and you have your additional seats. Alternatively you can also assemble your seat manually with a few easy loginctl attach commands, from any kind of hardware you might have lying around. To get a full seat you need multiple graphics cards, keyboards and mice: one set for each seat. (Later on we'll probably have a graphical setup utility for additional seats, but that's not a pressing issue we believe, as the plug-n-play multi-seat support with the Plugable devices is so awesomely nice.)

Plugable provided us for free with hardware for testing multi-seat. They are also involved with the upstream development of the USB DisplayLink driver for Linux. Due to their positive involvement with Linux we can only recommend to buy their hardware. They are good guys, and support Free Software the way all hardware vendors should! (And besides that, their hardware is also nicely put together. For example, in contrast to most similar vendors they actually assign proper vendor/product IDs to their USB hardware so that we can easily recognize their hardware when plugged in to set up automatic seats.)

Currently, all this magic is only implemented in the GNOME stack with the biggest component getting updated being the GNOME Display Manager. On the Plugable USB hardware you get a full GNOME Shell session with all the usual graphical gimmicks, the same way as on any other hardware. (Yes, GNOME 3 works perfectly fine on simpler graphics cards such as these USB devices!) If you are hacking on a different desktop environment, or on a different display manager, please have a look at the multi-seat documentation we put together, and particularly at our short piece about writing display managers which are multi-seat capable.

If you work on a major desktop environment or display manager and would like to implement multi-seat support for it, but lack the aforementioned Plugable hardware, we might be able to provide you with the hardware for free. Please contact us directly, and we might be able to send you a device. Note that we don't have unlimited devices available, hence we'll probably not be able to pass hardware to everybody who asks, and we will pass the hardware preferably to people who work on well-known software or otherwise have contributed good code to the community already. Anyway, if in doubt, ping us, and explain to us why you should get the hardware, and we'll consider you! (Oh, and this not only applies to display managers, if you hack on some other software where multi-seat awareness would be truly useful, then don't hesitate and ping us!)

Phoronix has this story about this new multi-seat support which is quite interesting and full of pictures. Please have a look.

Plugable started a Pledge drive to lower the price of the Plugable USB multi-seat terminals further. It's full of pictures (and a video showing all this in action!), and uses the code we now make available in Fedora 17 as base. Please consider pledging a few bucks.

Recently David Zeuthen added multi-seat support to udisks as well. With this in place, a user logged in on a specific seat can only see the USB storage plugged into his individual seat, but does not see any USB storage plugged into any other local seat. With this in place we closed the last missing bit of multi-seat support in our desktop stack.

With this code in Fedora 17 we cover the big use cases of multi-seat already: internet cafes, class rooms and similar installations can provide PC workplaces cheaply and easily without any manual configuration. Later on we want to build on this and make this useful for different uses too: for example, the ability to get a login screen as easily as plugging in a USB connector makes this not useful only for saving money in setups for many people, but also in embedded environments (consider monitoring/debugging screens made available via this hotplug logic) or servers (get trivially quick local access to your otherwise head-less server). To be truly useful in these areas we need one more thing though: the ability to run a simply getty (i.e. text login) on the seat, without necessarily involving a graphical UI.

The well-known X successor Wayland already comes out of the box with multi-seat support based on this logic.

Oh, and BTW, as Ubuntu appears to be "focussing" on "clarity" in the "cloud" now ;-), and chose Upstart instead of systemd, this feature won't be available in Ubuntu any time soon. That's (one detail of) the price Ubuntu has to pay for choosing to maintain it's own (largely legacy, such as ConsoleKit) plumbing stack.

Multi-seat has a long history on Unix. Since the earliest days Unix systems could be accessed by multiple local terminals at the same time. Since then local terminal support (and hence multi-seat) gradually moved out of view in computing. The fewest machines these days have more than one seat, the concept of terminals survived almost exclusively in the context of PTYs (i.e. fully virtualized API objects, disconnected from any real hardware seat) and VCs (i.e. a single virtualized local seat), but almost not in any other way (well, server setups still use serial terminals for emergency remote access, but they almost never have more than one serial terminal). All what we do in systemd is based on the ideas originally brought forward in Unix; with systemd we now try to bring back a number of the good ideas of Unix that since the old times were lost on the roadside. For example, in true Unix style we already started to expose the concept of a service in the file system (in /sys/fs/cgroup/systemd/system/), something where on Linux the (often misunderstood) "everything is a file" mantra previously fell short. With automatic multi-seat support we bring back support for terminals, but updated with all the features of today's desktops: plug and play, zero configuration, full graphics, and not limited to input devices and screens, but extending to all kinds of devices, such as audio, webcams or USB memory sticks.

Anyway, this is all for now; I'd like to thank everybody who was involved with making multi-seat work so nicely and natively on the Linux platform. You know who you are! Thanks a ton!


systemd Status Update

It has been way too long since my last status update on systemd. Here's another short, incomprehensive status update on what we worked on for systemd since then.

We have been working hard to turn systemd into the most viable set of components to build operating systems, appliances and devices from, and make it the best choice for servers, for desktops and for embedded environments alike. I think we have a really convincing set of features now, but we are actively working on making it even better.

Here's a list of some more and some less interesting features, in no particular order:

  1. We added an automatic pager to systemctl (and related tools), similar to how git has it.
  2. systemctl learnt a new switch --failed, to show only failed services.
  3. You may now start services immediately, overrding all dependency logic by passing --ignore-dependencies to systemctl. This is mostly a debugging tool and nothing people should use in real life.
  4. Sending SIGKILL as final part of the implicit shutdown logic of services is now optional and may be configured with the SendSIGKILL= option individually for each service.
  5. We split off the Vala/Gtk tools into its own project systemd-ui.
  6. systemd-tmpfiles learnt file globbing and creating FIFO special files as well as character and block device nodes, and symlinks. It also is capable of relabelling certain directories at boot now (in the SELinux sense).
  7. Immediately before shuttding dow we will now invoke all binaries found in /lib/systemd/system-shutdown/, which is useful for debugging late shutdown.
  8. You may now globally control where STDOUT/STDERR of services goes (unless individual service configuration overrides it).
  9. There's a new ConditionVirtualization= option, that makes systemd skip a specific service if a certain virtualization technology is found or not found. Similar, we now have a new option to detect whether a certain security technology (such as SELinux) is available, called ConditionSecurity=. There's also ConditionCapability= to check whether a certain process capability is in the capability bounding set of the system. There's also a new ConditionFileIsExecutable=, ConditionPathIsMountPoint=, ConditionPathIsReadWrite=, ConditionPathIsSymbolicLink=.
  10. The file system condition directives now support globbing.
  11. Service conditions may now be "triggering" and "mandatory", meaning that they can be a necessary requirement to hold for a service to start, or simply one trigger among many.
  12. At boot time we now print warnings if: /usr is on a split-off partition but not already mounted by an initrd; if /etc/mtab is not a symlink to /proc/mounts; CONFIG_CGROUPS is not enabled in the kernel. We'll also expose this as tainted flag on the bus.
  13. You may now boot the same OS image on a bare metal machine and in Linux namespace containers and will get a clean boot in both cases. This is more complicated than it sounds since device management with udev or write access to /sys, /proc/sys or things like /dev/kmsg is not available in a container. This makes systemd a first-class choice for managing thin container setups. This is all tested with systemd's own systemd-nspawn tool but should work fine in LXC setups, too. Basically this means that you do not have to adjust your OS manually to make it work in a container environment, but will just work out of the box. It also makes it easier to convert real systems into containers.
  14. We now automatically spawn gettys on HVC ttys when booting in VMs.
  15. We introduced /etc/machine-id as a generalization of D-Bus machine ID logic. See this blog story for more information. On stateless/read-only systems the machine ID is initialized randomly at boot. In virtualized environments it may be passed in from the machine manager (with qemu's -uuid switch, or via the container interface).
  16. All of the systemd-specific /etc/fstab mount options are now in the x-systemd-xyz format.
  17. To make it easy to find non-converted services we will now implicitly prefix all LSB and SysV init script descriptions with the strings "LSB:" resp. "SYSV:".
  18. We introduced /run and made it a hard dependency of systemd. This directory is now widely accepted and implemented on all relevant Linux distributions.
  19. systemctl can now execute all its operations remotely too (-H switch).
  20. We now ship systemd-nspawn, a really powerful tool that can be used to start containers for debugging, building and testing, much like chroot(1). It is useful to just get a shell inside a build tree, but is good enough to boot up a full system in it, too.
  21. If we query the user for a hard disk password at boot he may hit TAB to hide the asterisks we normally show for each key that is entered, for extra paranoia.
  22. We don't enable udev-settle.service anymore, which is only required for certain legacy software that still hasn't been updated to follow devices coming and going cleanly.
  23. We now include a tool that can plot boot speed graphs, similar to bootchartd, called systemd-analyze.
  24. At boot, we now initialize the kernel's binfmt_misc logic with the data from /etc/binfmt.d.
  25. systemctl now recognizes if it is run in a chroot() environment and will work accordingly (i.e. apply changes to the tree it is run in, instead of talking to the actual PID 1 for this). It also has a new --root= switch to work on an OS tree from outside of it.
  26. There's a new unit dependency type OnFailureIsolate= that allows entering a different target whenever a certain unit fails. For example, this is interesting to enter emergency mode if file system checks of crucial file systems failed.
  27. Socket units may now listen on Netlink sockets, special files from /proc and POSIX message queues, too.
  28. There's a new IgnoreOnIsolate= flag which may be used to ensure certain units are left untouched by isolation requests. There's a new IgnoreOnSnapshot= flag which may be used to exclude certain units from snapshot units when they are created.
  29. There's now small mechanism services for changing the local hostname and other host meta data, changing the system locale and console settings and the system clock.
  30. We now limit the capability bounding set for a number of our internal services by default.
  31. Plymouth may now be disabled globally with plymouth.enable=0 on the kernel command line.
  32. We now disallocate VTs when a getty finished running (and optionally other tools run on VTs). This adds extra security since it clears up the scrollback buffer so that subsequent users cannot get access to a user's session output.
  33. In socket units there are now options to control the IP_TRANSPARENT, SO_BROADCAST, SO_PASSCRED, SO_PASSSEC socket options.
  34. The receive and send buffers of socket units may now be set larger than the default system settings if needed by using SO_{RCV,SND}BUFFORCE.
  35. We now set the hardware timezone as one of the first things in PID 1, in order to avoid time jumps during normal userspace operation, and to guarantee sensible times on all generated logs. We also no longer save the system clock to the RTC on shutdown, assuming that this is done by the clock control tool when the user modifies the time, or automatically by the kernel if NTP is enabled.
  36. The SELinux directory got moved from /selinux to /sys/fs/selinux.
  37. We added a small service systemd-logind that keeps tracks of logged in users and their sessions. It creates control groups for them, implements the XDG_RUNTIME_DIR specification for them, maintains seats and device node ACLs and implements shutdown/idle inhibiting for clients. It auto-spawns gettys on all local VTs when the user switches to them (instead of starting six of them unconditionally), thus reducing the resource foot print by default. It has a D-Bus interface as well as a simple synchronous library interface. This mechanism obsoletes ConsoleKit which is now deprecated and should no longer be used.
  38. There's now full, automatic multi-seat support, and this is enabled in GNOME 3.4. Just by pluging in new seat hardware you get a new login screen on your seat's screen.
  39. There is now an option ControlGroupModify= to allow services to change the properties of their control groups dynamically, and one to make control groups persistent in the tree (ControlGroupPersistent=) so that they can be created and maintained by external tools.
  40. We now jump back into the initrd in shutdown, so that it can detach the root file system and the storage devices backing it. This allows (for the first time!) to reliably undo complex storage setups on shutdown and leave them in a clean state.
  41. systemctl now supports presets, a way for distributions and administrators to define their own policies on whether services should be enabled or disabled by default on package installation.
  42. systemctl now has high-level verbs for masking/unmasking units. There's also a new command (systemctl list-unit-files) for determining the list of all installed unit file files and whether they are enabled or not.
  43. We now apply sysctl variables to each new network device, as it appears. This makes /etc/sysctl.d compatible with hot-plug network devices.
  44. There's limited profiling for SELinux start-up perfomance built into PID 1.
  45. There's a new switch PrivateNetwork= to turn of any network access for a specific service.
  46. Service units may now include configuration for control group parameters. A few (such as MemoryLimit=) are exposed with high-level options, and all others are available via the generic ControlGroupAttribute= setting.
  47. There's now the option to mount certain cgroup controllers jointly at boot. We do this now for cpu and cpuacct by default.
  48. We added the journal and turned it on by default.
  49. All service output is now written to the Journal by default, regardless whether it is sent via syslog or simply written to stdout/stderr. Both message streams end up in the same location and are interleaved the way they should. All log messages even from the kernel and from early boot end up in the journal. Now, no service output gets unnoticed and is saved and indexed at the same location.
  50. systemctl status will now show the last 10 log lines for each service, directly from the journal.
  51. We now show the progress of fsck at boot on the console, again. We also show the much loved colorful [ OK ] status messages at boot again, as known from most SysV implementations.
  52. We merged udev into systemd.
  53. We implemented and documented interfaces to container managers and initrds for passing execution data to systemd. We also implemented and documented an interface for storage daemons that are required to back the root file system.
  54. There are two new options in service files to propagate reload requests between several units.
  55. systemd-cgls won't show kernel threads by default anymore, or show empty control groups.
  56. We added a new tool systemd-cgtop that shows resource usage of whole services in a top(1) like fasion.
  57. systemd may now supervise services in watchdog style. If enabled for a service the daemon daemon has to ping PID 1 in regular intervals or is otherwise considered failed (which might then result in restarting it, or even rebooting the machine, as configured). Also, PID 1 is capable of pinging a hardware watchdog. Putting this together, the hardware watchdogs PID 1 and PID 1 then watchdogs specific services. This is highly useful for high-availability servers as well as embedded machines. Since watchdog hardware is noawadays built into all modern chipsets (including desktop chipsets), this should hopefully help to make this a more widely used functionality.
  58. We added support for a new kernel command line option systemd.setenv= to set an environment variable system-wide.
  59. By default services which are started by systemd will have SIGPIPE set to ignored. The Unix SIGPIPE logic is used to reliably implement shell pipelines and when left enabled in services is usually just a source of bugs and problems.
  60. You may now configure the rate limiting that is applied to restarts of specific services. Previously the rate limiting parameters were hard-coded (similar to SysV).
  61. There's now support for loading the IMA integrity policy into the kernel early in PID 1, similar to how we already did it with the SELinux policy.
  62. There's now an official API to schedule and query scheduled shutdowns.
  63. We changed the license from GPL2+ to LGPL2.1+.
  64. We made systemd-detect-virt an official tool in the tool set. Since we already had code to detect certain VM and container environments we now added an official tool for administrators to make use of in shell scripts and suchlike.
  65. We documented numerous interfaces systemd introduced.

Much of the stuff above is already available in Fedora 15 and 16, or will be made available in the upcoming Fedora 17.

And that's it for now. There's a lot of other stuff in the git commits, but most of it is smaller and I will it thus spare you.

I'd like to thank everybody who contributed to systemd over the past years.

Thanks for your interest!


Control Groups vs. Control Groups

TL;DR: systemd does not require the performance-sensitive bits of Linux control groups enabled in the kernel. However, it does require some non-performance-sensitive bits of the control group logic.

In some areas of the community there's still some confusion about Linux control groups and their performance impact, and what precisely it is that systemd requires of them. In the hope to clear this up a bit, I'd like to point out a few things:

Control Groups are two things: (A) a way to hierarchally group and label processes, and (B) a way to then apply resource limits to these groups. systemd only requires the former (A), and not the latter (B). That means you can compile your kernel without any control group resource controllers (B) and systemd will work perfectly on it. However, if you in addition disable the grouping feature entirely (A) then systemd will loudly complain at boot and proceed only reluctantly with a big warning and in a limited functionality mode.

At compile time, the grouping/labelling feature in the kernel is enabled by CONFIG_CGROUPS=y, the individual controllers by CONFIG_CGROUP_FREEZER=y, CONFIG_CGROUP_DEVICE=y, CONFIG_CGROUP_CPUACCT=y, CONFIG_CGROUP_MEM_RES_CTLR=y, CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y, CONFIG_CGROUP_MEM_RES_CTLR_KMEM=y, CONFIG_CGROUP_PERF=y, CONFIG_CGROUP_SCHED=y, CONFIG_BLK_CGROUP=y, CONFIG_NET_CLS_CGROUP=y, CONFIG_NETPRIO_CGROUP=y. And since (as mentioned) we only need the former (A), not the latter (B) you may disable all of the latter options while enabling CONFIG_CGROUPS=y, if you want to run systemd on your system.

What about the performance impact of these options? Well, every bit of code comes at some price, so none of these options come entirely for free. However, the grouping feature (A) alters the general logic very little, it just sticks hierarchial labels on processes, and its impact is minimal since that is usually not in any hot path of the OS. This is different for the various controllers (B) which have a much bigger impact since they influence the resource management of the OS and are full of hot paths. This means that the kernel feature that systemd mandatorily requires (A) has a minimal effect on system performance, but the actually performance-sensitive features of control groups (B) are entirely optional.

On boot, systemd will mount all controller hierarchies it finds enabled in the kernel to individual directories below /sys/fs/cgroup/. This is the official place where kernel controllers are mounted to these days. The /sys/fs/cgroup/ mount point in the kernel was created precisely for this purpose. Since the control group controllers are a shared facility that might be used by a number of different subsystems a few projects have agreed on a set of rules in order to avoid that the various bits of code step on each other's toes when using these directories.

systemd will also maintain its own, private, controller-less, named control group hierarchy which is mounted to /sys/fs/cgroup/systemd/. This hierarchy is private property of systemd, and other software should not try to interfere with it. This hierarchy is how systemd makes use of the naming and grouping feature of control groups (A) without actually requiring any kernel controller enabled for that.

Now, you might notice that by default systemd does create per-service cgroups in the "cpu" controller if it finds it enabled in the kernel. This is entirely optional, however. We chose to make use of it by default to even out CPU usage between system services. Example: On a traditional web server machine Apache might end up having 100 CGI worker processes around, while MySQL only has 5 processes running. Without the use of the "cpu" controller this means that Apache all together ends up having 20x more CPU available than MySQL since the kernel tries to provide every process with the same amount of CPU time. On the other hand, if we add these two services to the "cpu" controller in individual groups by default, Apache and MySQL get the same amount of CPU, which we think is a good default.

Note that if the CPU controller is not enabled in the kernel systemd will not attempt to make use of the "cpu" hierarchy as described above. Also, even if it is enabled in the kernel it is trivial to tell systemd not to make use of it: Simply edit /etc/systemd/system.conf and set DefaultControllers= to the empty string.

Let's discuss a few frequently heard complaints regarding systemd's use of control groups:

  • systemd mounts all controllers to /sys/fs/cgroup/ even though my software requires it at /dev/cgroup/ (or some other place)! The standardization of /sys/fs/cgroup/ as mount point of the hierarchies is a relatively recent change in the kernel. Some software has not been updated yet for it. If you cannot change the software in question you are welcome to unmount the hierarchies from /sys/fs/cgroup/ and mount them wherever you need them instead. However, make sure to leave /sys/fs/cgroup/systemd/ untouched.
  • systemd makes use of the "cpu" hierarchy, but it should leave its dirty fingers from it! As mentioned above, just set the DefaultControllers= option of systemd to the empty string.
  • I need my two controllers "foo" and "bar" mounted into one hierarchy, but systemd mounts them in two! Use the JoinControllers= setting in /etc/systemd/system.conf to mount several controllers into a single hierarchy.
  • Control groups are evil and they make everything slower! Well, please read the text above and understand the difference between "control-groups-as-in-naming-and-grouping" (A) and "cgroups-as-in-controllers" (B). Then, please turn off all controllers in you kernel build (B) but leave CONFIG_CGROUPS=y (A) enabled.
  • I have heard some kernel developers really hate control groups and think systemd is evil because it requires them! Well, there are a couple of things behind the dislike of control groups by some folks. Primarily, this is probably caused because the hackers in question do not distuingish the naming-and-grouping bits of the control group logic (A) and the controllers that are based on it (B). Mainly, their beef is with the latter (which systemd does not require, which is the key point I am trying to make in the text above), but there are other issues as well: for example, the code of the grouping logic is not the most beautiful bit of code ever written by man (which is thankfully likely to get better now, since the control groups subsystem now has an active maintainer again). And then for some developers it is important that they can compare the runtime behaviour of many historic kernel versions in order to find bugs (git bisect). Since systemd requires kernels with basic control group support enabled, and this is a relatively recent feature addition to the kernel, this makes it difficult for them to use a newer distribution with all these old kernels that predate cgroups. Anyway, the summary is probably that what matters to developers is different from what matters to users and administrators.

I hope this explanation was useful for a reader or two! Thank you for your time!


GUADEC 2012 CFP Ending Soon!

In case you haven't submitted your talk proposal for GUADEC 2012 in A Coruña, Spain yet, hurry: the deadline is on April 14th, i.e. this saturday! Read der Call for Participation! Submit a proposal!


/tmp or not /tmp?

A number of Linux distributions have recently switched (or started switching) to /tmp on tmpfs by default (ArchLinux, Debian among others). Other distributions have plans/are discussing doing the same (Ubuntu, OpenSUSE). Since we believe this is a good idea and it's good to keep the delta between the distributions minimal we are proposing the same for Fedora 18, too. On Solaris a similar change has already been implemented in 1994 (and other Unixes have made a similar change long ago, too). Yet, not all of our software is written in a way that it works nicely together with /tmp on tmpfs.

Another Fedora feature (for Fedora 17) changed the semantics of /tmp for many system services to make them more secure, by isolating the /tmp namespaces of the various services. Handling of temporary files in /tmp has been security sensitive since it has been introduced since it traditionally has been a world writable, shared namespace and unless all user code safely uses randomized file names it is vulnerable to DoS attacks and worse.

In this blog story I'd like to shed some light on proper usage of /tmp and what your Linux application should use for what purpose. We'll not discuss why /tmp on tmpfs is a good idea, for that refer to the Fedora feature page. Here we'll just discuss what /tmp should be used for and for what it shouldn't be, as well as what should be used instead. All that in order to make sure your application remains compatible with these new features introduced to many newer Linux distributions.

/tmp is (as the name suggests) an area where temporary files applications require during operation may be placed. Of course, temporary files differ very much in their properties:

  • They can be large, or very small
  • They might be used for sharing between users, or be private to users
  • They might need to be persistent across boots, or very volatile
  • They might need to be machine-local or shared on the network

Traditionally, /tmp has not only been the place where actual temporary files are stored, but some software used to place (and often still continues to place) communication primitives such as sockets, FIFOs, shared memory there as well. Notably X11, but many others too. Usage of world-writable shared namespaces for communication purposes has always been problematic, since to establish communication you need stable names, but stable names open the doors for DoS attacks. This can be corrected partially, by establishing protected per-app directories for certain services during early boot (like we do for X11), but this only fixes the problem partially, since this only works correctly if every package installation is followed by a reboot.

Besides /tmp there are various other places where temporary files (or other files that traditionally have been stored in /tmp) can be stored. Here's a quick overview of the candidates:

  • /tmp, POSIX suggests this is flushed as boot, FHS says that files do not need to be persistent between two runs of the application. Old files are often cleaned up automatically after a time ("aging"). Usually it is recommended to use $TMPDIR if it is set before falling back to /tmp directly. As mentioned, this is a tmpfs on many Linuxes/Unixes (and most likely will be for most soon), and hence should be used only for small files. It's generally a shared namespace, hence the only APIs for using it should be mkstemp(), mkdtemp() (and friends) to be entirely safe.[1] Recently, improvements have been made to turn this shared namespace into a private namespace (see above), but that doesn't relieve developers from writing secure code that is also safe if /tmp is a shared namespace. Because /tmp is no longer necessarily a shared namespace it is generally unsuitable as a location for communication primitives. It is machine-private and local. It's usually fully featured (locking, ...). This directory is world writable and thus available for both privileged and unprivileged code.
  • /var/tmp, according to FHS "more persistent" than /tmp, and is less often cleaned up (it's persistent across reboots, for example). It's not on a tmpfs, but on a real disk, and hence can be used to store much larger files. The same namespace problems apply as with /tmp, hence also exclusively use mkstemp()/mkdtemp() for this directory. It is also automatically cleaned up by time. It is machine-private. It's not necessarily fully featured (no locking, ...). This directory is world writable and thus available for both privileged and unprivileged code. We suggest to also check $TMPDIR before falling back to /var/tmp. That way if $TMPDIR is set this overrides usage of both /tmp and /var/tmp.
  • /run (traditionally /var/run) where privileged daemons can store runtime data, such as communication primitives. This is where your daemon should place its sockets. It's guaranteed to be a shared namespace, but is only writable by privileged code and hence very safe to use. This file system is guaranteed to be a tmpfs and is hence automatically flushed at boots. No automatic clean-up is done beyond that. It is machine-private and local. It is fully-featured, and provides all functionality the local OS can provide (locking, sockets, ...).
  • $XDG_RUNTIME_DIR where unprivileged user software can store runtime data, such as communication primitives. This is similar to /run but for user applications. It's a user private namespace, and hence very safe to use. It's cleaned up automatically at logout and also is cleaned up by time via "aging". It is machine-private and fully featured. In GLib applications use g_get_user_runtime_dir() to query the path of this directory.
  • $XDG_CACHE_HOME where unprivileged user software can store non-essential data. It's a private namespace of the user. It might be shared between machines. It is not automatically cleaned up, and not fully featured (no locking, and so on, due to NFS). In GLib applications use g_get_user_cache_dir() to query this directory.
  • $XDG_DOWNLOAD_DIR where unprivileged user software can store downloads and downloads in progress. It should only be used for downloads, and is a private namespace fo the user, but might be shared between machines. It is not automatically cleaned up and not fully featured. In GLib applications use g_get_user_special_dir() to query the path of this directory.

Now that we have introduced the contestants, here's a rough guide how we suggest you (a Linux application developer) pick the right directory to use:

  1. You need a place to put your socket (or other communication primitive) and your code runs privileged: use a subdirectory beneath /run. (Or beneath /var/run for extra compatibility.)
  2. You need a place to put your socket (or other communication primitive) and your code runs unprivileged: use a subdirectory beneath $XDG_RUNTIME_DIR.
  3. You need a place to put your larger downloads and downloads in progress and run unprivileged: use $XDG_DOWNLOAD_DIR.
  4. You need a place to put cache files which should be persistent and run unprivileged: use $XDG_CACHE_HOME.
  5. Nothing of the above applies and you need to place a small file that needs no persistency: use $TMPDIR with a fallback on /tmp. And use mkstemp(), and mkdtemp() and nothing homegrown.
  6. Otherwise use $TMPDIR with a fallback on /var/tmp. Also use mkstemp()/mkdtemp().

Note that these rules above are only suggested by us. These rules take into account everything we know about this topic and avoid problems with current and future distributions, as far as we can see them. Please consider updating your projects to follow these rules, and keep them in mind if you write new code.

One thing we'd like to stress is that /tmp and /var/tmp more often than not are actually not the right choice for your usecase. There are valid uses of these directories, but quite often another directory might actually be the better place. So, be careful, consider the other options, but if you do go for /tmp or /var/tmp then at least make sure to use mkstemp()/mkdtemp().

Thank you for your interest!

Oh, and if you now complain that we don't understand Unix, and that we are morons and worse, then please read this again, and you might notice that this is just a best practice guide, not a specification we have written. Nothing that introduces anything new, just something that explains how things are.

If you want to complain about the tmp-on-tmpfs or ServicesPrivateTmp feature, then this is not the right place either, because this blog post is not really about that. Please direct this to fedora-devel instead. Thank you very much.

Footnotes

[1] Well, or to turn this around: unless you have a PhD in advanced Unixology and are not using mkstemp()/mkdtemp() but use /tmp nonetheless it's very likely you are writing vulnerable code.


/etc/os-release

One of the new configuration files systemd introduced is /etc/os-release. It replaces the multitude of per-distribution release files[1] with a single one. Yesterday we decided to drop support for systems lacking /etc/os-release in systemd since recently the majority of the big distributions adopted /etc/os-release and many small ones did, too[2]. It's our hope that by dropping support for non-compliant distributions we gently put some pressure on the remaining hold-outs to adopt this scheme as well.

I'd like to take the opportunity to explain a bit what the new file offers, why application developers should care, and why the distributions should adopt it. Of course, this file is pretty much a triviality in many ways, but I guess it's still one that deserves explanation.

So, you ask why this all?

  • It relieves application developers who just want to know the distribution they are running on to check for a multitude of individual release files.
  • It provides both a "pretty" name (i.e. one to show to the user), and machine parsable version/OS identifiers (i.e. for use in build systems).
  • It is extensible, can easily learn new fields if needed. For example, since we want to print a welcome message in the color of your distribution at boot we make it possible to configure the ANSI color for that in the file.

FAQs

There's already the lsb_release tool for this, why don't you just use that? Well, it's a very strange interface: a shell script you have to invoke (and hence spawn asynchronously from your C code), and it's not written to be extensible. It's an optional package in many distributions, and nothing we'd be happy to invoke as part of early boot in order to show a welcome message. (In times with sub-second userspace boot times we really don't want to invoke a huge shell script for a triviality like showing the welcome message). The lsb_release tool to us appears to be an attempt of abstracting distribution checks, where standardization of distribution checks is needed. It's simply a badly designed interface. In our opinion, it has its use as an interface to determine the LSB version itself, but not for checking the distribution or version.

Why haven't you adopted one of the generic release files, such as Fedora's /etc/system-release? Well, they are much nicer than lsb_release, so much is true. However, they are not extensible and are not really parsable, if the distribution needs to be identified programmatically or a specific version needs to be verified.

Why didn't you call this file /etc/bikeshed instead? The name /etc/os-release sucks! In a way, I think you kind of answered your own question there already.

Does this mean my distribution can now drop our equivalent of /etc/fedora-release? Unlikely, too much code exists that still checks for the individual release files, and you probably shouldn't break that. This new file makes things easy for applications, not for distributions: applications can now rely on a single file only, and use it in a nice way. Distributions will have to continue to ship the old files unless they are willing to break compatibility here.

This is so useless! My application needs to be compatible with distros from 1998, so how could I ever make use of the new file? I will have to continue using the old ones! True, if you need compatibility with really old distributions you do. But for new code this might not be an issue, and in general new APIs are new APIs. So if you decide to depend on it, you add a dependency on it. However, even if you need to stay compatible it might make sense to check /etc/os-release first and just fall back to the old files if it doesn't exist. The least it does for you is that you don't need 25+ open() attempts on modern distributions, but just one.

You evil people are forcing my beloved distro $XYZ to adopt your awful systemd schemes. I hate you! You hate too much, my friend. Also, I am pretty sure it's not difficult to see the benefit of this new file independently of systemd, and it's truly useful on systems without systemd, too.

I hate what you people do, can I just ignore this? Well, you really need to work on your constant feelings of hate, my friend. But, to a certain degree yes, you can ignore this for a while longer. But already, there are a number of applications making use of this file. You lose compatibility with those. Also, you are kinda working towards the further balkanization of the Linux landscape, but maybe that's your intention?

You guys add a new file because you think there are already too many? You guys are so confused! None of the existing files is generic and extensible enough to do what we want it to do. Hence we had to introduce a new one. We acknowledge the irony, however.

The file is extensible? Awesome! I want a new field XYZ= in it! Sure, it's extensible, and we are happy if distributions extend it. Please prefix your keys with your distribution's name however. Or even better: talk to us and we might be able update the documentation and make your field standard, if you convince us that it makes sense.

Anyway, to summarize all this: if you work on an application that needs to identify the OS it is being built on or is being run on, please consider making use of this new file, we created it for you. If you work on a distribution, and your distribution doesn't support this file yet, please consider adopting this file, too.

If you are working on a small/embedded distribution, or a legacy-free distribution we encourage you to adopt only this file and not establish any other per-distro release file.

Read the documentation for /etc/os-release.

Footnotes

[1] Yes, multitude, there's at least: /etc/redhat-release, /etc/SuSE-release, /etc/debian_version, /etc/arch-release, /etc/gentoo-release, /etc/slackware-version, /etc/frugalware-release, /etc/altlinux-release, /etc/mandriva-release, /etc/meego-release, /etc/angstrom-version, /etc/mageia-release. And some distributions even have multiple, for example Fedora has already four different files.

[2] To our knowledge at least OpenSUSE, Fedora, ArchLinux, Angstrom, Frugalware have adopted this. (This list is not comprehensive, there are probably more.)


The Case for the /usr Merge

One of the features of Fedora 17 is the /usr merge, put forward by Harald Hoyer and Kay Sievers[1]. In the time since this feature has been proposed repetitive discussions took place all over the various Free Software communities, and usually the same questions were asked: what the reasons behind this feature were, and whether it makes sense to adopt the same scheme for distribution XYZ, too.

Especially in the Non-Fedora world it appears to be socially unacceptable to actually have a look at the Fedora feature page (where many of the questions are already brought up and answered) which is very unfortunate. To improve the situation I spent some time today to summarize the reasons for the /usr merge independently. I'd hence like to direct you to this new page I put up which tries to summarize the reasons for this, with an emphasis on the compatibility point of view:

The Case for the /usr Merge

Note that even though this page is in the systemd wiki, what it covers is mostly orthogonal to systemd. systemd supports both systems with a merged /usr and with a split /usr, and the /usr merge should be interesting for non-systemd distributions as well.

Primarily I put this together to have a nice place to point all those folks who continue to write me annoyed emails, even though I am actually not even working on all of this...

Enjoy the read!

Footnotes:

[1] And not actually by me, I am just a supportive spectator and am not doing any work on it. Unfortunately some tech press folks created the false impression I was behind this. But credit where credit is due, this is all Harald's and Kay's work.


Plumbers Wishlist, The Third Edition, a.k.a. "The Thank You Edition"

Last October we published a wishlist for plumbing related features we'd like to see added to the Linux kernel. Three months later it's time to publish a short update, and explain what has been implemented in the kernel, what people have started working on, and what's still missing.

The full, updated list is available on Google Docs.

In general, I must say that the list turned out to be a great success. It shows how awesome the Open Source community is: Just ask nicely and there's a good chance they'll fulfill your wishes! Thank you very much, Linux community!

We'd like to thank everybody who worked on any of the features on that list: Lucas De Marchi, Andi Kleen, Dan Ballard, Li Zefan, Kirill A. Shutemov, Davidlohr Bueso, Cong Wang, Lennart Poettering, Kay Sievers.

Of the items on the list 5 have been fully implemented and are already part of a released kernel, or already merged for inclusion for the next kernels being released.

For 4 further items patches have been posted, and I am hoping they'll get merged eventually. Davidlohr, Wang, Zefan, Kirill, it would be great if you'd continue working on your patches, as we think they are following the right approach[1] even if there was some opposition to them on LKML. So, please keep pushing to solve the outstanding issues and thanks for your work so far!

Footnotes

[1] Yes, I still believe that tmpfs quota should be implemented via resource limits, as everything else wouldn't work, as we don't want to implement complex and fragile userspace infrastructure to racily upload complex quota data for all current and future UIDs ever used on the system into each tmpfs mount point at mount time.


systemd for Administrators, Part XII

Here's the twelfth installment of my ongoing series on systemd for Administrators:

Securing Your Services

One of the core features of Unix systems is the idea of privilege separation between the different components of the OS. Many system services run under their own user IDs thus limiting what they can do, and hence the impact they may have on the OS in case they get exploited.

This kind of privilege separation only provides very basic protection however, since in general system services run this way can still do at least as much as a normal local users, though not as much as root. For security purposes it is however very interesting to limit even further what services can do, and shut them off a couple of things that normal users are allowed to do.

A great way to limit the impact of services is by employing MAC technologies such as SELinux. If you are interested to secure down your server, running SELinux is a very good idea. systemd enables developers and administrators to apply additional restrictions to local services independently of a MAC. Thus, regardless whether you are able to make use of SELinux you may still enforce certain security limits on your services.

In this iteration of the series we want to focus on a couple of these security features of systemd and how to make use of them in your services. These features take advantage of a couple of Linux-specific technologies that have been available in the kernel for a long time, but never have been exposed in a widely usable fashion. These systemd features have been designed to be as easy to use as possible, in order to make them attractive to administrators and upstream developers:

  • Isolating services from the network
  • Service-private /tmp
  • Making directories appear read-only or inaccessible to services
  • Taking away capabilities from services
  • Disallowing forking, limiting file creation for services
  • Controlling device node access of services

All options described here are documented in systemd's man pages, notably systemd.exec(5). Please consult these man pages for further details.

All these options are available on all systemd systems, regardless if SELinux or any other MAC is enabled, or not.

All these options are relatively cheap, so if in doubt use them. Even if you might think that your service doesn't write to /tmp and hence enabling PrivateTmp=yes (as described below) might not be necessary, due to today's complex software it's still beneficial to enable this feature, simply because libraries you link to (and plug-ins to those libraries) which you do not control might need temporary files after all. Example: you never know what kind of NSS module your local installation has enabled, and what that NSS module does with /tmp.

These options are hopefully interesting both for administrators to secure their local systems, and for upstream developers to ship their services secure by default. We strongly encourage upstream developers to consider using these options by default in their upstream service units. They are very easy to make use of and have major benefits for security.

Isolating Services from the Network

A very simple but powerful configuration option you may use in systemd service definitions is PrivateNetwork=:

...
[Service]
ExecStart=...
PrivateNetwork=yes
...

With this simple switch a service and all the processes it consists of are entirely disconnected from any kind of networking. Network interfaces became unavailable to the processes, the only one they'll see is the loopback device "lo", but it is isolated from the real host loopback. This is a very powerful protection from network attacks.

Caveat: Some services require the network to be operational. Of course, nobody would consider using PrivateNetwork=yes on a network-facing service such as Apache. However even for non-network-facing services network support might be necessary and not always obvious. Example: if the local system is configured for an LDAP-based user database doing glibc name lookups with calls such as getpwnam() might end up resulting in network access. That said, even in those cases it is more often than not OK to use PrivateNetwork=yes since user IDs of system service users are required to be resolvable even without any network around. That means as long as the only user IDs your service needs to resolve are below the magic 1000 boundary using PrivateNetwork=yes should be OK.

Internally, this feature makes use of network namespaces of the kernel. If enabled a new network namespace is opened and only the loopback device configured in it.

Service-Private /tmp

Another very simple but powerful configuration switch is PrivateTmp=:

...
[Service]
ExecStart=...
PrivateTmp=yes
...

If enabled this option will ensure that the /tmp directory the service will see is private and isolated from the host system's /tmp. /tmp traditionally has been a shared space for all local services and users. Over the years it has been a major source of security problems for a multitude of services. Symlink attacks and DoS vulnerabilities due to guessable /tmp temporary files are common. By isolating the service's /tmp from the rest of the host, such vulnerabilities become moot.

For Fedora 17 a feature has been accepted in order to enable this option across a large number of services.

Caveat: Some services actually misuse /tmp as a location for IPC sockets and other communication primitives, even though this is almost always a vulnerability (simply because if you use it for communication you need guessable names, and guessable names make your code vulnerable to DoS and symlink attacks) and /run is the much safer replacement for this, simply because it is not a location writable to unprivileged processes. For example, X11 places it's communication sockets below /tmp (which is actually secure -- though still not ideal -- in this exception since it does so in a safe subdirectory which is created at early boot.) Services which need to communicate via such communication primitives in /tmp are no candidates for PrivateTmp=. Thankfully these days only very few services misusing /tmp like this remain.

Internally, this feature makes use of file system namespaces of the kernel. If enabled a new file system namespace is opened inheritng most of the host hierarchy with the exception of /tmp.

Making Directories Appear Read-Only or Inaccessible to Services

With the ReadOnlyDirectories= and InaccessibleDirectories= options it is possible to make the specified directories inaccessible for writing resp. both reading and writing to the service:

...
[Service]
ExecStart=...
InaccessibleDirectories=/home
ReadOnlyDirectories=/var
...

With these two configuration lines the whole tree below /home becomes inaccessible to the service (i.e. the directory will appear empty and with 000 access mode), and the tree below /var becomes read-only.

Caveat: Note that ReadOnlyDirectories= currently is not recursively applied to submounts of the specified directories (i.e. mounts below /var in the example above stay writable). This is likely to get fixed soon.

Internally, this is also implemented based on file system namspaces.

Taking Away Capabilities From Services

Another very powerful security option in systemd is CapabilityBoundingSet= which allows to limit in a relatively fine grained fashion which kernel capabilities a service started retains:

...
[Service]
ExecStart=...
CapabilityBoundingSet=CAP_CHOWN CAP_KILL
...

In the example above only the CAP_CHOWN and CAP_KILL capabilities are retained by the service, and the service and any processes it might create have no chance to ever acquire any other capabilities again, not even via setuid binaries. The list of currently defined capabilities is available in capabilities(7). Unfortunately some of the defined capabilities are overly generic (such as CAP_SYS_ADMIN), however they are still a very useful tool, in particular for services that otherwise run with full root privileges.

To identify precisely which capabilities are necessary for a service to run cleanly is not always easy and requires a bit of testing. To simplify this process a bit, it is possible to blacklist certain capabilities that are definitely not needed instead of whitelisting all that might be needed. Example: the CAP_SYS_PTRACE is a particularly powerful and security relevant capability needed for the implementation of debuggers, since it allows introspecting and manipulating any local process on the system. A service like Apache obviously has no business in being a debugger for other processes, hence it is safe to remove the capability from it:

...
[Service]
ExecStart=...
CapabilityBoundingSet=~CAP_SYS_PTRACE
...

The ~ character the value assignment here is prefixed with inverts the meaning of the option: instead of listing all capabalities the service will retain you may list the ones it will not retain.

Caveat: Some services might react confused if certain capabilities are made unavailable to them. Thus when determining the right set of capabilities to keep around you need to do this carefully, and it might be a good idea to talk to the upstream maintainers since they should know best which operations a service might need to run successfully.

Caveat 2: Capabilities are not a magic wand. You probably want to combine them and use them in conjunction with other security options in order to make them truly useful.

To easily check which processes on your system retain which capabilities use the pscap tool from the libcap-ng-utils package.

Making use of systemd's CapabilityBoundingSet= option is often a simple, discoverable and cheap replacement for patching all system daemons individually to control the capability bounding set on their own.

Disallowing Forking, Limiting File Creation for Services

Resource Limits may be used to apply certain security limits on services being run. Primarily, resource limits are useful for resource control (as the name suggests...) not so much access control. However, two of them can be useful to disable certain OS features: RLIMIT_NPROC and RLIMIT_FSIZE may be used to disable forking and disable writing of any files with a size > 0:

...
[Service]
ExecStart=...
LimitNPROC=1
LimitFSIZE=0
...

Note that this will work only if the service in question drops privileges and runs under a (non-root) user ID of its own or drops the CAP_SYS_RESOURCE capability, for example via CapabilityBoundingSet= as discussed above. Without that a process could simply increase the resource limit again thus voiding any effect.

Caveat: LimitFSIZE= is pretty brutal. If the service attempts to write a file with a size > 0, it will immeidately be killed with the SIGXFSZ which unless caught terminates the process. Also, creating files with size 0 is still allowed, even if this option is used.

For more information on these and other resource limits, see setrlimit(2).

Controlling Device Node Access of Services

Devices nodes are an important interface to the kernel and its drivers. Since drivers tend to get much less testing and security checking than the core kernel they often are a major entry point for security hacks. systemd allows you to control access to devices individually for each service:

...
[Service]
ExecStart=...
DeviceAllow=/dev/null rw
...

This will limit access to /dev/null and only this device node, disallowing access to any other device nodes.

The feature is implemented on top of the devices cgroup controller.

Other Options

Besides the easy to use options above there are a number of other security relevant options available. However they usually require a bit of preparation in the service itself and hence are probably primarily useful for upstream developers. These options are RootDirectory= (to set up chroot() environments for a service) as well as User= and Group= to drop privileges to the specified user and group. These options are particularly useful to greatly simplify writing daemons, where all the complexities of securely dropping privileges can be left to systemd, and kept out of the daemons themselves.

If you are wondering why these options are not enabled by default: some of them simply break seamntics of traditional Unix, and to maintain compatibility we cannot enable them by default. e.g. since traditional Unix enforced that /tmp was a shared namespace, and processes could use it for IPC we cannot just go and turn that off globally, just because /tmp's role in IPC is now replaced by /run.

And that's it for now. If you are working on unit files for upstream or in your distribution, please consider using one or more of the options listed above. If you service is secure by default by taking advantage of these options this will help not only your users but also make the Internet a safer place.

© Lennart Poettering. Built using Pelican. Theme by Giulio Fidente on github. .