Pid Eins

Plumbers Conference 2011

The Linux Plumbers Conference 2011 in Santa Rosa, CA, USA is coming nearer (Sep. 7-9). Together with Kay Sievers I am running the Boot&Init track, and together with Mark Brown the Audio track.

For both tracks we still need proposals. So if you haven't submitted anything yet, please consider doing so. And that quickly. i.e. if you can arrange for it, last sunday would be best, since that was actually the final deadline. However, the submission form is still open, so if you submit something really, really quickly we'll ignore the absence of time travel and the calendar for a bit. So, go, submit something. Now.

What are we looking for? Well, here's what I just posted on the audio related mailing lists:

So, please consider submitting something if you haven't done so yet. We
are looking for all kinds of technical talks covering everything audio
plumbing related: audio drivers, audio APIs, sound servers, pro audio,
consumer audio. If you can propose something audio related -- like talks
on media controller routing, on audio for ASOC/Embedded, submit
something! If you care for low-latency audio, submit something. If you
care about the Linux audio stack in general, submit something.

LPC is probably the most relevant technical conference on the general
Linux platform, so be sure that if you want your project, your work,
your ideas to be heard then this is the right forum for everything
related to the Linux stack. And the Audio track covers everything in our
Audio Stack, regardless whether it is pro or consumer audio.

And here's what I posted to the init related lists:

So, please consider submitting something if you haven't done so yet. We
are looking for all kinds of technical talks covering everything from
the BIOS (i.e. CoreBoot and friends), over boot loaders (i.e. GRUB and
friends), to initramfs (i.e. Dracut and friends) and init systems
(i.e. systemd and friends). If you have something smart to say about any
of these areas or maybe about related tools (i.e. you wrote a fancy new
tool to measure boot performance) or fancy boot schemes in your
favourite Linux based OS (i.e. the new Meego zero second boot ;-)) then
don't hesitate to submit something on the LPC web site, in the Boot&Init
track!

And now, quickly, go to the LPC website and post your session proposal in the Audio resp. Boot&Init; track! Thank you!

Thanks

As some of you might know Fedora 15 went Gold a couple of days ago. The first big distribution based on systemd will be released 2011-05-24. Mark the date!

In little over a year systemd went from nowhere to became a core piece of Fedora. This wasn't possible without the numerous folks who worked with us on getting systemd right, supplied patches, chased bugs, tested releases and posted comments and generally made sure everything was in shape for the big release.

At this point we'd like to thank everybody who contributed and a few folks in particular:

A. Costa Adrian Spinu Alexey Shabalin Andreas Jaeger Andrew Edmunds Andrey Borzenkov Bill Nottingham Brandon Philips Brendan Jones Brett Witherspoon Chris E Ferron Christian Ruppert Conrad Meyer Daniel J Walsh Dave Reisner Eric Paris Fabian Henze Fabiano Fidêncio Florian Kriener Franz Dietrich Greg Kroah-Hartman Gustavo Sverzut Barbieri Harald Hoyer James Laska Jan Engelhardt Jeff Mahoney Jesse Zhang Jóhann B. Guðmundsson Karel Zak Koen Kooi Lucas De Marchi Ludwig Nussel Luis Felipe Strano Moraes Maarten Lankhorst Malcolm Studd Marc-Antoine Perennou Martin Mikkelsen Matthew Miller Matthias Clasen Matthias Schiffer Michael Biebl Michael Olbrich Michael Tremer Michał Piotrowski Michal Schmidt Mike Kazantsev Mike Kelly Miklos Vajna Milan Broz Ozan Çağlayan Paul Menzel Pavol Rusnak Rahul Sundaram Rainer Gerhards Ran Benita Ray Strode Robert Gerus Sedat Dilek Tero Roponen Thierry Reding Tollef Fog Heen Tomasz Torcz Tom Callaway Tom Gundersen Toshio Kuratomi William Jon McCann Wulf C. Krueger Zbigniew Jędrzejewski-Szmek

And everybody else who I (or git shortlog) forgot.

Thank you!

Lennart and Kay

BTW, the interface stability promise is valid now.

systemd for Developers I

systemd not only brings improvements for administrators and users, it also brings a (small) number of new APIs with it. In this blog story (which might become the first of a series) I hope to shed some light on one of the most important new APIs in systemd:

Socket Activation

In the original blog story about systemd I tried to explain why socket activation is a wonderful technology to spawn services. Let's reiterate the background here a bit.

The basic idea of socket activation is not new. The inetd superserver was a standard component of most Linux and Unix systems since time began: instead of spawning all local Internet services already at boot, the superserver would listen on behalf of the services and whenever a connection would come in an instance of the respective service would be spawned. This allowed relatively weak machines with few resources to offer a big variety of services at the same time. However it quickly got a reputation for being somewhat slow: since daemons would be spawned for each incoming connection a lot of time was spent on forking and initialization of the services -- once for each connection, instead of once for them all.

Spawning one instance per connection was how inetd was primarily used, even though inetd actually understood another mode: on the first incoming connection it would notice this via poll() (or select()) and spawn a single instance for all future connections. (This was controllable with the wait/nowait options.) That way the first connection would be slow to set up, but subsequent ones would be as fast as with a standalone service. In this mode inetd would work in a true on-demand mode: a service would be made available lazily when it was required.

inetd's focus was clearly on AF_INET (i.e. Internet) sockets. As time progressed and Linux/Unix left the server niche and became increasingly relevant on desktops, mobile and embedded environments inetd was somehow lost in the troubles of time. Its reputation for being slow, and the fact that Linux' focus shifted away from only Internet servers made a Linux machine running inetd (or one of its newer implementations, like xinetd) the exception, not the rule.

When Apple engineers worked on optimizing the MacOS boot time they found a new way to make use of the idea of socket activation: they shifted the focus away from AF_INET sockets towards AF_UNIX sockets. And they noticed that on-demand socket activation was only part of the story: much more powerful is socket activation when used for all local services including those which need to be started anyway on boot. They implemented these ideas in launchd, a central building block of modern MacOS X systems, and probably the main reason why MacOS is so fast booting up.

But, before we continue, let's have a closer look what the benefits of socket activation for non-on-demand, non-Internet services in detail are. Consider the four services Syslog, D-Bus, Avahi and the Bluetooth daemon. D-Bus logs to Syslog, hence on traditional Linux systems it would get started after Syslog. Similarly, Avahi requires Syslog and D-Bus, hence would get started after both. Finally Bluetooth is similar to Avahi and also requires Syslog and D-Bus but does not interface at all with Avahi. Sinceoin a traditional SysV-based system only one service can be in the process of getting started at a time, the following serialization of startup would take place: Syslog → D-Bus → Avahi → Bluetooth (Of course, Avahi and Bluetooth could be started in the opposite order too, but we have to pick one here, so let's simply go alphabetically.). To illustrate this, here's a plot showing the order of startup beginning with system startup (at the top).

Certain distributions tried to improve this strictly serialized start-up: since Avahi and Bluetooth are independent from each other, they can be started simultaneously. The parallelization is increased, the overall startup time slightly smaller. (This is visualized in the middle part of the plot.)

Socket activation makes it possible to start all four services completely simultaneously, without any kind of ordering. Since the creation of the listening sockets is moved outside of the daemons themselves we can start them all at the same time, and they are able to connect to each other's sockets right-away. I.e. in a single step the /dev/log and /run/dbus/system_bus_socket sockets are created, and in the next step all four services are spawned simultaneously. When D-Bus then wants to log to syslog, it just writes its messages to /dev/log. As long as the socket buffer does not run full it can go on immediately with what else it wants to do for initialization. As soon as the syslog service catches up it will process the queued messages. And if the socket buffer runs full then the client logging will temporarily block until the socket is writable again, and continue the moment it can write its log messages. That means the scheduling of our services is entirely done by the kernel: from the userspace perspective all services are run at the same time, and when one service cannot keep up the others needing it will temporarily block on their request but go on as soon as these requests are dispatched. All of this is completely automatic and invisible to userspace. Socket activation hence allows us to drastically parallelize start-up, enabling simultaneous start-up of services which previously were thought to strictly require serialization. Most Linux services use sockets as communication channel. Socket activation allows starting of clients and servers of these channels at the same time.

But it's not just about parallelization. It offers a number of other benefits:

We no longer need to configure dependencies explicitly. Since the sockets are initialized before all services they are simply available, and no userspace ordering of service start-up needs to take place anymore. Socket activation hence drastically simplifies configuration and development of services.
If a service dies its listening socket stays around, not losing a single message. After a restart of the crashed service it can continue right where it left off.
If a service is upgraded we can restart the service while keeping around its sockets, thus ensuring the service is continously responsive. Not a single connection is lost during the upgrade.
We can even replace a service during runtime in a way that is invisible to the client. For example, all systems running systemd start up with a tiny syslog daemon at boot which passes all log messages written to /dev/log on to the kernel message buffer. That way we provide reliable userspace logging starting from the first instant of boot-up. Then, when the actual rsyslog daemon is ready to start we terminate the mini daemon and replace it with the real daemon. And all that while keeping around the original logging socket and sharing it between the two daemons and not losing a single message. Since rsyslog flushes the kernel log buffer to disk after start-up all log messages from the kernel, from early-boot and from runtime end up on disk.

For another explanation of this idea consult the original blog story about systemd.

Socket activation has been available in systemd since its inception. On Fedora 15 a number of services have been modified to implement socket activation, including Avahi, D-Bus and rsyslog (to continue with the example above).

systemd's socket activation is quite comprehensive. Not only classic sockets are support but related technologies as well:

AF_UNIX sockets, in the flavours SOCK_DGRAM, SOCK_STREAM and SOCK_SEQPACKET; both in the filesystem and in the abstract namespace
AF_INET sockets, i.e. TCP/IP and UDP/IP; both IPv4 and IPv6
Unix named pipes/FIFOs in the filesystem
AF_NETLINK sockets, to subscribe to certain kernel features. This is currently used by udev, but could be useful for other netlink-related services too, such as audit.
Certain special files like /proc/kmsg or device nodes like /dev/input/*.
POSIX Message Queues

A service capable of socket activation must be able to receive its preinitialized sockets from systemd, instead of creating them internally. For most services this requires (minimal) patching. However, since systemd actually provides inetd compatibility a service working with inetd will also work with systemd -- which is quite useful for services like sshd for example.

So much about the background of socket activation, let's now have a look how to patch a service to make it socket activatable. Let's start with a theoretic service foobard. (In a later blog post we'll focus on real-life example.)

Our little (theoretic) service includes code like the following for creating sockets (most services include code like this in one way or another):

/* Source Code Example #1: ORIGINAL, NOT SOCKET-ACTIVATABLE SERVICE */
...
union {
        struct sockaddr sa;
        struct sockaddr_un un;
} sa;
int fd;

fd = socket(AF_UNIX, SOCK_STREAM, 0);
if (fd < 0) {
        fprintf(stderr, "socket(): %m\n");
        exit(1);
}

memset(&sa, 0, sizeof(sa));
sa.un.sun_family = AF_UNIX;
strncpy(sa.un.sun_path, "/run/foobar.sk", sizeof(sa.un.sun_path));

if (bind(fd, &sa.sa, sizeof(sa)) < 0) {
        fprintf(stderr, "bind(): %m\n");
        exit(1);
}

if (listen(fd, SOMAXCONN) < 0) {
        fprintf(stderr, "listen(): %m\n");
        exit(1);
}
...

A socket activatable service may use the following code instead:

/* Source Code Example #2: UPDATED, SOCKET-ACTIVATABLE SERVICE */
...
#include "sd-daemon.h"
...
int fd;

if (sd_listen_fds(0) != 1) {
        fprintf(stderr, "No or too many file descriptors received.\n");
        exit(1);
}

fd = SD_LISTEN_FDS_START + 0;
...

systemd might pass you more than one socket (based on configuration, see below). In this example we are interested in one only. sd_listen_fds() returns how many file descriptors are passed. We simply compare that with 1, and fail if we got more or less. The file descriptors systemd passes to us are inherited one after the other beginning with fd #3. (SD_LISTEN_FDS_START is a macro defined to 3). Our code hence just takes possession of fd #3.

As you can see this code is actually much shorter than the original. This of course comes at the price that our little service with this change will no longer work in a non-socket-activation environment. With minimal changes we can adapt our example to work nicely both with and without socket activation:

/* Source Code Example #3: UPDATED, SOCKET-ACTIVATABLE SERVICE WITH COMPATIBILITY */
...
#include "sd-daemon.h"
...
int fd, n;

n = sd_listen_fds(0);
if (n > 1) {
        fprintf(stderr, "Too many file descriptors received.\n");
        exit(1);
} else if (n == 1)
        fd = SD_LISTEN_FDS_START + 0;
else {
        union {
                struct sockaddr sa;
                struct sockaddr_un un;
        } sa;

        fd = socket(AF_UNIX, SOCK_STREAM, 0);
        if (fd < 0) {
                fprintf(stderr, "socket(): %m\n");
                exit(1);
        }

        memset(&sa, 0, sizeof(sa));
        sa.un.sun_family = AF_UNIX;
        strncpy(sa.un.sun_path, "/run/foobar.sk", sizeof(sa.un.sun_path));

        if (bind(fd, &sa.sa, sizeof(sa)) < 0) {
                fprintf(stderr, "bind(): %m\n");
                exit(1);
        }

        if (listen(fd, SOMAXCONN) < 0) {
                fprintf(stderr, "listen(): %m\n");
                exit(1);
        }
}
...

With this simple change our service can now make use of socket activation but still works unmodified in classic environments. Now, let's see how we can enable this service in systemd. For this we have to write two systemd unit files: one describing the socket, the other describing the service. First, here's foobar.socket:

[Socket]
ListenStream=/run/foobar.sk

[Install]
WantedBy=sockets.target

And here's the matching service file foobar.service:

[Service]
ExecStart=/usr/bin/foobard

If we place these two files in /etc/systemd/system we can enable and start them:

# systemctl enable foobar.socket
# systemctl start foobar.socket

Now our little socket is listening, but our service not running yet. If we now connect to /run/foobar.sk the service will be automatically spawned, for on-demand service start-up. With a modification of foobar.service we can start our service already at startup, thus using socket activation only for parallelization purposes, not for on-demand auto-spawning anymore:

[Service]
ExecStart=/usr/bin/foobard

[Install]
WantedBy=multi-user.target

And now let's enable this too:

# systemctl enable foobar.service
# systemctl start foobar.service

Now our little daemon will be started at boot and on-demand, whatever comes first. It can be started fully in parallel with its clients, and when it dies it will be automatically restarted when it is used the next time.

A single .socket file can include multiple ListenXXX stanzas, which is useful for services that listen on more than one socket. In this case all configured sockets will be passed to the service in the exact order they are configured in the socket unit file. Also, you may configure various socket settings in the .socket files.

In real life it's a good idea to include description strings in these unit files, to keep things simple we'll leave this out of our example. Speaking of real-life: our next installment will cover an actual real-life example. We'll add socket activation to the CUPS printing server.

The sd_listen_fds() function call is defined in sd-daemon.h and sd-daemon.c. These two files are currently drop-in .c sources which projects should simply copy into their source tree. Eventually we plan to turn this into a proper shared library, however using the drop-in files allows you to compile your project in a way that is compatible with socket activation even without any compile time dependencies on systemd. sd-daemon.c is liberally licensed, should compile fine on the most exotic Unixes and the algorithms are trivial enough to be reimplemented with very little code if the license should nonetheless be a problem for your project. sd-daemon.c contains a couple of other API functions besides sd_listen_fds() that are useful when implementing socket activation in a project. For example, there's sd_is_socket() which can be used to distuingish and identify particular sockets when a service gets passed more than one.

Let me point out that the interfaces used here are in no way bound directly to systemd. They are generic enough to be implemented in other systems as well. We deliberately designed them as simple and minimal as possible to make it possible for others to adopt similar schemes.

Stay tuned for the next installment. As mentioned, it will cover a real-life example of turning an existing daemon into a socket-activatable one: the CUPS printing service. However, I hope this blog story might already be enough to get you started if you plan to convert an existing service into a socket activatable one. We invite everybody to convert upstream projects to this scheme. If you have any questions join us on #systemd on freenode.

Bê-á-bá do systemd

Pablo Hess has been posting a series of articles on systemd on IBM DeveloperWorks Brasil. So, if you speak portuguese head over there and have a look!

PulseAudio Saves Power

#nocomments yes

D. Jansen has put up a blog story including some power saving results when running PulseAudio on modern HDA drivers. This shows off some work Pierre-Louis Bossart from Intel did on the HDA drivers which now enables the timer-based scheduling code in PulseAudio I added quite some time ago to come to its full potential. You can save half a Watt and reduce wakeups while playing audio to 1 wakeup/s.

Previously there was little public profiling data available about the benefits PA brings you for low-power devices. Thanks to Dennis' data there's now public data available that hopefully explains why PA is the best choice for low-power devices as well as desktops. Hopefully this cleans up some misconceptions.

Pierre-Louis, thanks for your work!

Update: Arun Raghavan has posted a follow-up to this.

systemd for Administrators as PDF

Sankarasivasubramanian Pasupathilingam has put together a PDF of my ongoing systemd for Administrators series. This might be handy for reading on an ebook reader or similar.

Enjoy!

Why systemd?

systemd is still a young project, but it is not a baby anymore. The initial announcement I posted precisely a year ago. Since then most of the big distributions have decided to adopt it in one way or another, many smaller distributions have already switched. The first big distribution with systemd by default will be Fedora 15, due end of May. It is expected that the others will follow the lead a bit later (with one exception). Many embedded developers have already adopted it too, and there's even a company specializing on engineering and consulting services for systemd. In short: within one year systemd became a really successful project.

However, there are still folks who we haven't won over yet. If you fall into one of the following categories, then please have a look on the comparison of init systems below:

You are working on an embedded project and are wondering whether it should be based on systemd.
You are a user or administrator and wondering which distribution to pick, and are pondering whether it should be based on systemd or not.
You are a user or administrator and wondering why your favourite distribution has switched to systemd, if everything already worked so well before.
You are developing a distribution that hasn't switched yet, and you are wondering whether to invest the work and go systemd.

And even if you don't fall into any of these categories, you might still find the comparison interesting.

We'll be comparing the three most relevant init systems for Linux: sysvinit, Upstart and systemd. Of course there are other init systems in existance, but they play virtually no role in the big picture. Unless you run Android (which is a completely different beast anyway), you'll almost definitely run one of these three init systems on your Linux kernel. (OK, or busybox, but then you are basically not running any init system at all.) Unless you have a soft spot for exotic init systems there's little need to look further. Also, I am kinda lazy, and don't want to spend the time on analyzing those other systems in enough detail to be completely fair to them.

Speaking of fairness: I am of course one of the creators of systemd. I will try my best to be fair to the other two contenders, but in the end, take it with a grain of salt. I am sure though that should I be grossly unfair or otherwise incorrect somebody will point it out in the comments of this story, so consider having a look on those, before you put too much trust in what I say.

We'll look at the currently implemented features in a released version. Grand plans don't count.

General Features

	sysvinit	Upstart	systemd
Interfacing via D-Bus	no	yes	yes
Shell-free bootup	no	no	yes
Modular C coded early boot services included	no	no	yes
Read-Ahead	no	no^[1]	yes
Socket-based Activation	no	no^[2]	yes
Socket-based Activation: inetd compatibility	no	no^[2]	yes
Bus-based Activation	no	no^[3]	yes
Device-based Activation	no	no^[4]	yes
Configuration of device dependencies with udev rules	no	no	yes
Path-based Activation (inotify)	no	no	yes
Timer-based Activation	no	no	yes
Mount handling	no	no^[5]	yes
fsck handling	no	no^[5]	yes
Quota handling	no	no	yes
Automount handling	no	no	yes
Swap handling	no	no	yes
Snapshotting of system state	no	no	yes
XDG_RUNTIME_DIR Support	no	no	yes
Optionally kills remaining processes of users logging out	no	no	yes
Linux Control Groups Integration	no	no	yes
Audit record generation for started services	no	no	yes
SELinux integration	no	no	yes
PAM integration	no	no	yes
Encrypted hard disk handling (LUKS)	no	no	yes
SSL Certificate/LUKS Password handling, including Plymouth, Console, wall(1), TTY and GNOME agents	no	no	yes
Network Loopback device handling	no	no	yes
binfmt_misc handling	no	no	yes
System-wide locale handling	no	no	yes
Console and keyboard setup	no	no	yes
Infrastructure for creating, removing, cleaning up of temporary and volatile files	no	no	yes
Handling for `/proc/sys` sysctl	no	no	yes
Plymouth integration	no	yes	yes
Save/restore random seed	no	no	yes
Static loading of kernel modules	no	no	yes
Automatic serial console handling	no	no	yes
Unique Machine ID handling	no	no	yes
Dynamic host name and machine meta data handling	no	no	yes
Reliable termination of services	no	no	yes
Early boot /dev/log logging	no	no	yes
Minimal kmsg-based syslog daemon for embedded use	no	no	yes
Respawning on service crash without losing connectivity	no	no	yes
Gapless service upgrades	no	no	yes
Graphical UI	no	no	yes
Built-In Profiling and Tools	no	no	yes
Instantiated services	no	yes	yes
PolicyKit integration	no	no	yes
Remote access/Cluster support built into client tools	no	no	yes
Can list all processes of a service	no	no	yes
Can identify service of a process	no	no	yes
Automatic per-service CPU cgroups to even out CPU usage between them	no	no	yes
Automatic per-user cgroups	no	no	yes
SysV compatibility	yes	yes	yes
SysV services controllable like native services	yes	no	yes
SysV-compatible /dev/initctl	yes	no	yes
Reexecution with full serialization of state	yes	no	yes
Interactive boot-up	no^[6]	no^[6]	yes
Container support (as advanced chroot() replacement)	no	no	yes
Dependency-based bootup	no^[7]	no	yes
Disabling of services without editing files	yes	no	yes
Masking of services without editing files	no	no	yes
Robust system shutdown within PID 1	no	no	yes
Built-in kexec support	no	no	yes
Dynamic service generation	no	no	yes
Upstream support in various other OS components	yes	no	yes
Service files compatible between distributions	no	no	yes
Signal delivery to services	no	no	yes
Reliable termination of user sessions before shutdown	no	no	yes
utmp/wtmp support	yes	yes	yes
Easily writable, extensible and parseable service files, suitable for manipulation with enterprise management tools	no	no	yes

^[1] Read-Ahead implementation for Upstart available in separate package ureadahead, requires non-standard kernel patch.

^[2] Socket activation implementation for Upstart available as preview, lacks parallelization support hence entirely misses the point of socket activation.

^[3] Bus activation implementation for Upstart posted as patch, not merged.

^[4] udev device event bridge implementation for Upstart available as preview, forwards entire udev database into Upstart, not practical.

^[5] Mount handling utility mountall for Upstart available in separate package, covers only boot-time mounts, very limited dependency system.

^[6] Some distributions offer this implemented in shell.

^[7] LSB init scripts support this, if they are used.

Available Native Service Settings

	sysvinit	Upstart	systemd
OOM Adjustment	no	yes^[1]	yes
Working Directory	no	yes	yes
Root Directory (chroot())	no	yes	yes
Environment Variables	no	yes	yes
Environment Variables from external file	no	no	yes
Resource Limits	no	some^[2]	yes
umask	no	yes	yes
User/Group/Supplementary Groups	no	no	yes
IO Scheduling Class/Priority	no	no	yes
CPU Scheduling Nice Value	no	yes	yes
CPU Scheduling Policy/Priority	no	no	yes
CPU Scheduling Reset on fork() control	no	no	yes
CPU affinity	no	no	yes
Timer Slack	no	no	yes
Capabilities Control	no	no	yes
Secure Bits Control	no	no	yes
Control Group Control	no	no	yes
High-level file system namespace control: making directories inacessible	no	no	yes
High-level file system namespace control: making directories read-only	no	no	yes
High-level file system namespace control: private /tmp	no	no	yes
High-level file system namespace control: mount inheritance	no	no	yes
Input on Console	yes	yes	yes
Output on Syslog	no	no	yes
Output on kmsg/dmesg	no	no	yes
Output on arbitrary TTY	no	no	yes
Kill signal control	no	no	yes
Conditional execution: by identified CPU virtualization/container	no	no	yes
Conditional execution: by file existance	no	no	yes
Conditional execution: by security framework	no	no	yes
Conditional execution: by kernel command line	no	no	yes

^[1] Upstart supports only the deprecated oom_score_adj mechanism, not the current oom_adj logic.

^[2] Upstart lacks support for RLIMIT_RTTIME and RLIMIT_RTPRIO.

Note that some of these options are relatively easily added to SysV init scripts, by editing the shell sources. The table above focusses on easily accessible options that do not require source code editing.

Miscellaneous

	sysvinit	Upstart	systemd
Maturity	> 15 years	6 years	1 year
Specialized professional consulting and engineering services available	no	no	yes
SCM	Subversion	Bazaar	git
Copyright-assignment-free contributing	yes	no	yes

Summary

As the tables above hopefully show in all clarity systemd has left behind both sysvinit and Upstart in almost every aspect. With the exception of the project's age/maturity systemd wins in every category. At this point in time it will be very hard for sysvinit and Upstart to catch up with the features systemd provides today. In one year we managed to push systemd forward much further than Upstart has been pushed in six.

It is our intention to drive forward the development of the Linux platform with systemd. In the next release cycle we will focus more strongly on providing the same features and speed improvement we already offer for the system to the user login session. This will bring much closer integration with the other parts of the OS and applications, making the most of the features the service manager provides, and making it available to login sessions. Certain components such as ConsoleKit will be made redundant by these upgrades, and services relying on them will be updated. The burden for maintaining these then obsolete components will be passed on the vendors who plan to continue to rely on them.

If you are wondering whether or not to adopt systemd, then systemd obviously wins when it comes to mere features. Of course that should not be the only aspect to keep in mind. In the long run, sticking with the existing infrastructure (such as ConsoleKit) comes at a price: porting work needs to take place, and additional maintainance work for bitrotting code needs to be done. Going it on your own means increased workload.

That said, adopting systemd is also not free. Especially if you made investments in the other two solutions adopting systemd means work. The basic work to adopt systemd is relatively minimal for porting over SysV systems (since compatibility is provided), but can mean substantial work when coming from Upstart. If you plan to go for a 100% systemd system without any SysV compatibility (recommended for embedded, long run goal for the big distributions) you need to be willing to invest some work to rewrite init scripts as simple systemd unit files.

systemd is in the process of becoming a comprehensive, integrated and modular platform providing everything needed to bootstrap and maintain an operating system's userspace. It includes C rewrites of all basic early boot init scripts that are shipped with the various distributions. Especially for the embedded case adopting systemd provides you in one step with almost everything you need, and you can pick the modules you want. The other two init systems are singular individual components, which to be useful need a great number of additional components with differing interfaces. The emphasis of systemd to provide a platform instead of just a component allows for closer integration, and cleaner APIs. Sooner or later this will trickle up to the applications. Already, there are accepted XDG specifications (e.g. XDG basedir spec, more specifically XDG_RUNTIME_DIR) that are not supported on the other init systems.

systemd is also a big opportunity for Linux standardization. Since it standardizes many interfaces of the system that previously have been differing on every distribution, on every implementation, adopting it helps to work against the balkanization of the Linux interfaces. Choosing systemd means redefining more closely what the Linux platform is about. This improves the lifes of programmers, users and administrators alike.

I believe that momentum is clearly with systemd. We invite you to join our community and be part of that momentum.

systemd for Administrators, Part VIII

Another episode of my ongoing series on systemd for Administrators:

The New Configuration Files

One of the formidable new features of systemd is that it comes with a complete set of modular early-boot services that are written in simple, fast, parallelizable and robust C, replacing the shell "novels" the various distributions featured before. Our little Project Zero Shell^[1] has been a full success. We currently cover pretty much everything most desktop and embedded distributions should need, plus a big part of the server needs:

Checking and mounting of all file systems
Updating and enabling quota on all file systems
Setting the host name
Configuring the loopback network device
Loading the SELinux policy and relabelling /run and /dev as necessary on boot
Registering additional binary formats in the kernel, such as Java, Mono and WINE binaries
Setting the system locale
Setting up the console font and keyboard map
Creating, removing and cleaning up of temporary and volatile files and directories
Applying mount options from /etc/fstab to pre-mounted API VFS
Applying sysctl kernel settings
Collecting and replaying readahead information
Updating utmp boot and shutdown records
Loading and saving the random seed
Statically loading specific kernel modules
Setting up encrypted hard disks and partitions
Spawning automatic gettys on serial kernel consoles
Maintenance of Plymouth
Machine ID maintenance
Setting of the UTC distance for the system clock

On a standard Fedora 15 install, only a few legacy and storage services still require shell scripts during early boot. If you don't need those, you can easily disable them end enjoy your shell-free boot (like I do every day). The shell-less boot systemd offers you is a unique feature on Linux.

Many of these small components are configured via configuration files in /etc. Some of these are fairly standardized among distributions and hence supporting them in the C implementations was easy and obvious. Examples include: /etc/fstab, /etc/crypttab or /etc/sysctl.conf. However, for others no standardized file or directory existed which forced us to add #ifdef orgies to our sources to deal with the different places the distributions we want to support store these things. All these configuration files have in common that they are dead-simple and there is simply no good reason for distributions to distuingish themselves with them: they all do the very same thing, just a bit differently.

To improve the situation and benefit from the unifying force that systemd is we thus decided to read the per-distribution configuration files only as fallbacks -- and to introduce new configuration files as primary source of configuration wherever applicable. Of course, where possible these standardized configuration files should not be new inventions but rather just standardizations of the best distribution-specific configuration files previously used. Here's a little overview over these new common configuration files systemd supports on all distributions:

/etc/hostname: the host name for the system. One of the most basic and trivial system settings. Nonetheless previously all distributions used different files for this. Fedora used /etc/sysconfig/network, OpenSUSE /etc/HOSTNAME. We chose to standardize on the Debian configuration file /etc/hostname.
/etc/vconsole.conf: configuration of the default keyboard mapping and console font.
/etc/locale.conf: configuration of the system-wide locale.
/etc/modules-load.d/*.conf: a drop-in directory for kernel modules to statically load at boot (for the very few that still need this).
/etc/sysctl.d/*.conf: a drop-in directory for kernel sysctl parameters, extending what you can already do with /etc/sysctl.conf.
/etc/tmpfiles.d/*.conf: a drop-in directory for configuration of runtime files that need to be removed/created/cleaned up at boot and during uptime.
/etc/binfmt.d/*.conf: a drop-in directory for registration of additional binary formats for systems like Java, Mono and WINE.
/etc/os-release: a standardization of the various distribution ID files like /etc/fedora-release and similar. Really every distribution introduced their own file here; writing a simple tool that just prints out the name of the local distribution usually means including a database of release files to check. The LSB tried to standardize something like this with the lsb_release tool, but quite frankly the idea of employing a shell script in this is not the best choice the LSB folks ever made. To rectify this we just decided to generalize this, so that everybody can use the same file here.
/etc/machine-id: a machine ID file, superseding D-Bus' machine ID file. This file is guaranteed to be existing and valid on a systemd system, covering also stateless boots. By moving this out of the D-Bus logic it is hopefully interesting for a lot of additional uses as a unique and stable machine identifier.
/etc/machine-info: a new information file encoding meta data about a host, like a pretty host name and an icon name, replacing stuff like /etc/favicon.png and suchlike. This is maintained by systemd-hostnamed.

It is our definite intention to convince you to use these new configuration files in your configuration tools: if your configuration frontend writes these files instead of the old ones, it automatically becomes more portable between Linux distributions, and you are helping standardizing Linux. This makes things simpler to understand and more obvious for users and administrators. Of course, right now, only systemd-based distributions read these files, but that already covers all important distributions in one way or another, except for one. And it's a bit of a chicken-and-egg problem: a standard becomes a standard by being used. In order to gently push everybody to standardize on these files we also want to make clear that sooner or later we plan to drop the fallback support for the old configuration files from systemd. That means adoption of this new scheme can happen slowly and piece by piece. But the final goal of only having one set of configuration files must be clear.

Many of these configuration files are relevant not only for configuration tools but also (and sometimes even primarily) in upstream projects. For example, we invite projects like Mono, Java, or WINE to install a drop-in file in /etc/binfmt.d/ from their upstream build systems. Per-distribution downstream support for binary formats would then no longer be necessary and your platform would work the same on all distributions. Something similar applies to all software which need creation/cleaning of certain runtime files and directories at boot, for example beneath the /run hierarchy (i.e. /var/run as it used to be known). These projects should just drop in configuration files in /etc/tmpfiles.d, also from the upstream build systems. This also helps speeding up the boot process, as separate per-project SysV shell scripts which implement trivial things like registering a binary format or removing/creating temporary/volatile files at boot are no longer necessary. Or another example, where upstream support would be fantastic: projects like X11 could probably benefit from reading the default keyboard mapping for its displays from /etc/vconsole.conf.

Of course, I have no doubt that not everybody is happy with our choice of names (and formats) for these configuration files. In the end we had to pick something, and from all the choices these appeared to be the most convincing. The file formats are as simple as they can be, and usually easily written and read even from shell scripts. That said, /etc/bikeshed.conf could of course also have been a fantastic configuration file name!

So, help us standardizing Linux! Use the new configuration files! Adopt them upstream, adopt them downstream, adopt them all across the distributions!

Oh, and in case you are wondering: yes, all of these files were discussed in one way or another with various folks from the various distributions. And there has even been some push towards supporting some of these files even outside of systemd systems.

Footnotes

[1] Our slogan: "The only shell that should get started during boot is gnome-shell!" -- Yes, the slogan needs a bit of work, but you get the idea.

systemd for Administrators, Part VII

Here's yet another installment of my ongoing series on systemd for Administrators:

The Blame Game

Fedora 15^[1] is the first Fedora release to sport systemd. Our primary goal for F15 was to get everything integrated and working well. One focus for Fedora 16 will be to further polish and speed up what we have in the distribution now. To prepare for this cycle we have implemented a few tools (which are already available in F15), which can help us pinpoint where exactly the biggest problems in our boot-up remain. With this blog story I hope to shed some light on how to figure out what to blame for your slow boot-up, and what to do about it. We want to allow you to put the blame where the blame belongs: on the system component responsible.

The first utility is a very simple one: systemd will automatically write a log message with the time it needed to syslog/kmsg when it finished booting up.

systemd[1]: Startup finished in 2s 65ms 924us (kernel) + 2s 828ms 195us (initrd) + 11s 900ms 471us (userspace) = 16s 794ms 590us.

And here's how you read this: 2s have been spent for kernel initialization, until the time where the initial RAM disk (initrd, i.e. dracut) was started. A bit less than 3s have then been spent in the initrd. Finally, a bit less than 12s have been spent after the actual system init daemon (systemd) has been invoked by the initrd to bring up userspace. Summing this up the time that passed since the boot loader jumped into the kernel code until systemd was finished doing everything it needed to do at boot was a bit less than 17s. This number is nice and simple to understand -- and also easy to misunderstand: it does not include the time that is spent initializing your GNOME session, as that is outside of the scope of the init system. Also, in many cases this is just where systemd finished doing everything it needed to do. Very likely some daemons are still busy doing whatever they need to do to finish startup when this time is elapsed. Hence: while the time logged here is a good indication on the general boot speed, it is not the time the user might feel the boot actually takes.

Also, it is a pretty superficial value: it gives no insight which system component systemd was waiting for all the time. To break this up, we introduced the tool systemd-analyze blame:

$ systemd-analyze blame
  6207ms udev-settle.service
  5228ms cryptsetup@luks\x2d9899b85d\x2df790\x2d4d2a\x2da650\x2d8b7d2fb92cc3.service
   735ms NetworkManager.service
   642ms avahi-daemon.service
   600ms abrtd.service
   517ms rtkit-daemon.service
   478ms fedora-storage-init.service
   396ms dbus.service
   390ms rpcidmapd.service
   346ms systemd-tmpfiles-setup.service
   322ms fedora-sysinit-unhack.service
   316ms cups.service
   310ms console-kit-log-system-start.service
   309ms libvirtd.service
   303ms rpcbind.service
   298ms ksmtuned.service
   288ms lvm2-monitor.service
   281ms rpcgssd.service
   277ms sshd.service
   276ms livesys.service
   267ms iscsid.service
   236ms mdmonitor.service
   234ms nfslock.service
   223ms ksm.service
   218ms mcelog.service
...

This tool lists which systemd unit needed how much time to finish initialization at boot, the worst offenders listed first. What we can see here is that on this boot two services required more than 1s of boot time: udev-settle.service and cryptsetup@luks\x2d9899b85d\x2df790\x2d4d2a\x2da650\x2d8b7d2fb92cc3.service. This tool's output is easily misunderstood as well, it does not shed any light on why the services in question actually need this much time, it just determines that they did. Also note that the times listed here might be spent "in parallel", i.e. two services might be initializing at the same time and thus the time spent to initialize them both is much less than the sum of both individual times combined.

Let's have a closer look at the worst offender on this boot: a service by the name of udev-settle.service. So why does it take that much time to initialize, and what can we do about it? This service actually does very little: it just waits for the device probing being done by udev to finish and then exits. Device probing can be slow. In this instance for example, the reason for the device probing to take more than 6s is the 3G modem built into the machine, which when not having an inserted SIM card takes this long to respond to software probe requests. The software probing is part of the logic that makes ModemManager work and enables NetworkManager to offer easy 3G setup. An obvious reflex might now be to blame ModemManager for having such a slow prober. But that's actually ill-directed: hardware probing quite frequently is this slow, and in the case of ModemManager it's a simple fact that the 3G hardware takes this long. It is an essential requirement for a proper hardware probing solution that individual probers can take this much time to finish probing. The actual culprit is something else: the fact that we actually wait for the probing, in other words: that udev-settle.service is part of our boot process.

So, why is udev-settle.service part of our boot process? Well, it actually doesn't need to be. It is pulled in by the storage setup logic of Fedora: to be precise, by the LVM, RAID and Multipath setup script. These storage services have not been implemented in the way hardware detection and probing work today: they expect to be initialized at a point in time where "all devices have been probed", so that they can simply iterate through the list of available disks and do their work on it. However, on modern machinery this is not how things actually work: hardware can come and hardware can go all the time, during boot and during runtime. For some technologies it is not even possible to know when the device enumeration is complete (example: USB, or iSCSI), thus waiting for all storage devices to show up and be probed must necessarily include a fixed delay when it is assumed that all devices that can show up have shown up, and got probed. In this case all this shows very negatively in the boot time: the storage scripts force us to delay bootup until all potential devices have shown up and all devices that did got probed -- and all that even though we don't actually need most devices for anything. In particular since this machine actually does not make use of LVM, RAID or Multipath!^[2]

Knowing what we know now we can go and disable udev-settle.service for the next boots: since neither LVM, RAID nor Multipath is used we can mask the services in question and thus speed up our boot a little:

# ln -s /dev/null /etc/systemd/system/udev-settle.service
# ln -s /dev/null /etc/systemd/system/fedora-wait-storage.service
# ln -s /dev/null /etc/systemd/system/fedora-storage-init.service
# systemctl daemon-reload

After restarting we can measure that the boot is now about 1s faster. Why just 1s? Well, the second worst offender is cryptsetup here: the machine in question has an encrypted /home directory. For testing purposes I have stored the passphrase in a file on disk, so that the boot-up is not delayed because I as the user am a slow typer. The cryptsetup tool unfortunately still takes more han 5s to set up the encrypted partition. Being lazy instead of trying to fix cryptsetup^[3] we'll just tape over it here ^[4]: systemd will normally wait for all file systems not marked with the noauto option in /etc/fstab to show up, to be fscked and to be mounted before proceeding bootup and starting the usual system services. In the case of /home (unlike for example /var) we know that it is needed only very late (i.e. when the user actually logs in). An easy fix is hence to make the mount point available already during boot, but not actually wait until cryptsetup, fsck and mount finished running for it. You ask how we can make a mount point available before actually mounting the file system behind it? Well, systemd possesses magic powers, in form of the comment=systemd.automount mount option in /etc/fstab. If you specify it, systemd will create an automount point at /home and when at the time of the first access to the file system it still isn't backed by a proper file system systemd will wait for the device, fsck and mount it.

And here's the result with this change to /etc/fstab made:

systemd[1]: Startup finished in 2s 47ms 112us (kernel) + 2s 663ms 942us (initrd) + 5s 540ms 522us (userspace) = 10s 251ms 576us.

Nice! With a few fixes we took almost 7s off our boot-time. And these two changes are only fixes for the two most superficial problems. With a bit of love and detail work there's a lot of additional room for improvements. In fact, on a different machine, a more than two year old X300 laptop (which even back then wasn't the fastest machine on earth) and a bit of decrufting we have boot times of around 4s (total) now, with a resonably complete GNOME system. And there's still a lot of room in it.

systemd-analyze blame is a nice and simple tool for tracking down slow services. However, it suffers by a big problem: it does not visualize how the parallel execution of the services actually diminishes the price one pays for slow starting services. For that we have prepared systemd-analyize plot for you. Use it like this:

$ systemd-analyze plot > plot.svg
$ eog plot.svg

It creates pretty graphs, showing the time services spent to start up in relation to the other services. It currently doesn't visualize explicitly which services wait for which ones, but with a bit of guess work this is easily seen nonetheless.

To see the effect of our two little optimizations here are two graphs generated with systemd-analyze plot, the first before and the other after our change:

(For the sake of completeness, here are the two complete outputs of systemd-analyze blame for these two boots: before and after.)

The well-informed reader probably wonders how this relates to Michael Meeks' bootchart. This plot and bootchart do show similar graphs, that is true. Bootchart is by far the more powerful tool. It plots in all detail what is happening during the boot, how much CPU and IO is used. systemd-analyze plot shows more high-level data: which service took how much time to initialize, and what needed to wait for it. If you use them both together you'll have a wonderful toolset to figure out why your boot is not as fast as it could be.

Now, before you now take these tools and start filing bugs against the worst boot-up time offenders on your system: think twice. These tools give you raw data, don't misread it. As my optimization example above hopefully shows, the blame for the slow bootup was not actually with udev-settle.service, and not with the ModemManager prober run by it either. It is with the subsystem that pulled this service in in the first place. And that's where the problem needs to be fixed. So, file the bugs at the right places. Put the blame where the blame belongs.

As mentioned, these three utilities are available on your Fedora 15 system out-of-the-box.

And here's what to take home from this little blog story:

systemd-analyze is a wonderful tool and systemd comes with profiling built in.
Don't misread the data these tools generate!
With two simple changes you might be able to speed up your system by 7s!
Fix your software if it can't handle dynamic hardware properly!
The Fedora default of installing the OS on an enterprise-level storage managing system might be something to rethink.

And that's all for now. Thank you for your interest.

Footnotes

[1] Also known as the greatest Free Software OS release ever.

[2] The right fix here is to improve the services in question to actively listen to hotplug events via libudev or similar and act on the devices showing up as they show up, so that we can continue with the bootup the instant everything we really need to go on has shown up. To get a quick bootup we should wait for what we actually need to proceed, not for everything. Also note that the storage services are not the only services which do not cope well with modern dynamic hardware, and assume that the device list is static and stays unchanged. For example, in this example the reason the initrd is actually as slow as it is is mostly due to the fact that Plymouth expects to be executed when all video devices have shown up and have been probed. For an unknown reason (at least unknown to me) loading the video kernel modules for my Intel graphics cards takes multiple seconds, and hence the entire boot is delayed unnecessarily. (Here too I'd not put the blame on the probing but on the fact that we wait for it to complete before going on.)

[3] Well, to be precise, I actually did try to get this fixed. Most of the delay of crypsetup stems from the -- in my eyes -- unnecessarily high default values for --iter-time in cryptsetup. I tried to convince our cryptsetup maintainers that 100ms as a default here are not really less secure than 1s, but well, I failed.

[4] Of course, it's usually not our style to just tape over problems instead of fixing them, but this is such a nice occasion to show off yet another cool systemd feature...

systemd for Administrators, Part VI

Here's another installment of my ongoing series on systemd for Administrators:

Changing Roots

As administrator or developer sooner or later you'll ecounter chroot() environments. The chroot() system call simply shifts what a process and all its children consider the root directory /, thus limiting what the process can see of the file hierarchy to a subtree of it. Primarily chroot() environments have two uses:

For security purposes: In this use a specific isolated daemon is chroot()ed into a private subdirectory, so that when exploited the attacker can see only the subdirectory instead of the full OS hierarchy: he is trapped inside the chroot() jail.
To set up and control a debugging, testing, building, installation or recovery image of an OS: For this a whole guest operating system hierarchy is mounted or bootstraped into a subdirectory of the host OS, and then a shell (or some other application) is started inside it, with this subdirectory turned into its /. To the shell it appears as if it was running inside a system that can differ greatly from the host OS. For example, it might run a different distribution or even a different architecture (Example: host x86_64, guest i386). The full hierarchy of the host OS it cannot see.

On a classic System-V-based operating system it is relatively easy to use chroot() environments. For example, to start a specific daemon for test or other reasons inside a chroot()-based guest OS tree, mount /proc, /sys and a few other API file systems into the tree, and then use chroot(1) to enter the chroot, and finally run the SysV init script via /sbin/service from inside the chroot.

On a systemd-based OS things are not that easy anymore. One of the big advantages of systemd is that all daemons are guaranteed to be invoked in a completely clean and independent context which is in no way related to the context of the user asking for the service to be started. While in sysvinit-based systems a large part of the execution context (like resource limits, environment variables and suchlike) is inherited from the user shell invoking the init skript, in systemd the user just notifies the init daemon, and the init daemon will then fork off the daemon in a sane, well-defined and pristine execution context and no inheritance of the user context parameters takes place. While this is a formidable feature it actually breaks traditional approaches to invoke a service inside a chroot() environment: since the actual daemon is always spawned off PID 1 and thus inherits the chroot() settings from it, it is irrelevant whether the client which asked for the daemon to start is chroot()ed or not. On top of that, since systemd actually places its local communications sockets in /run/systemd a process in a chroot() environment will not even be able to talk to the init system (which however is probably a good thing, and the daring can work around this of course by making use of bind mounts.)

This of course opens the question how to use chroot()s properly in a systemd environment. And here's what we came up with for you, which hopefully answers this question thoroughly and comprehensively:

Let's cover the first usecase first: locking a daemon into a chroot() jail for security purposes. To begin with, chroot() as a security tool is actually quite dubious, since chroot() is not a one-way street. It is relatively easy to escape a chroot() environment, as even the man page points out. Only in combination with a few other techniques it can be made somewhat secure. Due to that it usually requires specific support in the applications to chroot() themselves in a tamper-proof way. On top of that it usually requires a deep understanding of the chroot()ed service to set up the chroot() environment properly, for example to know which directories to bind mount from the host tree, in order to make available all communication channels in the chroot() the service actually needs. Putting this together, chroot()ing software for security purposes is almost always done best in the C code of the daemon itself. The developer knows best (or at least should know best) how to properly secure down the chroot(), and what the minimal set of files, file systems and directories is the daemon will need inside the chroot(). These days a number of daemons are capable of doing this, unfortunately however of those running by default on a normal Fedora installation only two are doing this: Avahi and RealtimeKit. Both apparently written by the same really smart dude. Chapeau! ;-) (Verify this easily by running ls -l /proc/*/root on your system.)

That all said, systemd of course does offer you a way to chroot() specific daemons and manage them like any other with the usual tools. This is supported via the RootDirectory= option in systemd service files. Here's an example:

[Unit]
Description=A chroot()ed Service

[Service]
RootDirectory=/srv/chroot/foobar
ExecStartPre=/usr/local/bin/setup-foobar-chroot.sh
ExecStart=/usr/bin/foobard
RootDirectoryStartOnly=yes

In this example, RootDirectory= configures where to chroot() to before invoking the daemon binary specified with ExecStart=. Note that the path specified in ExecStart= needs to refer to the binary inside the chroot(), it is not a path to the binary in the host tree (i.e. in this example the binary executed is seen as /srv/chroot/foobar/usr/bin/foobard from the host OS). Before the daemon is started a shell script setup-foobar-chroot.sh is invoked, whose purpose it is to set up the chroot environment as necessary, i.e. mount /proc and similar file systems into it, depending on what the service might need. With the RootDirectoryStartOnly= switch we ensure that only the daemon as specified in ExecStart= is chrooted, but not the ExecStartPre= script which needs to have access to the full OS hierarchy so that it can bind mount directories from there. (For more information on these switches see the respective man pages.) If you place a unit file like this in /etc/systemd/system/foobar.service you can start your chroot()ed service by typing systemctl start foobar.service. You may then introspect it with systemctl status foobar.service. It is accessible to the administrator like any other service, the fact that it is chroot()ed does -- unlike on SysV -- not alter how your monitoring and control tools interact with it.

Newer Linux kernels support file system namespaces. These are similar to chroot() but a lot more powerful, and they do not suffer by the same security problems as chroot(). systemd exposes a subset of what you can do with file system namespaces right in the unit files themselves. Often these are a useful and simpler alternative to setting up full chroot() environment in a subdirectory. With the switches ReadOnlyDirectories= and InaccessibleDirectories= you may setup a file system namespace jail for your service. Initially, it will be identical to your host OS' file system namespace. By listing directories in these directives you may then mark certain directories or mount points of the host OS as read-only or even completely inaccessible to the daemon. Example:

[Unit]
Description=A Service With No Access to /home

[Service]
ExecStart=/usr/bin/foobard
InaccessibleDirectories=/home

This service will have access to the entire file system tree of the host OS with one exception: /home will not be visible to it, thus protecting the user's data from potential exploiters. (See the man page for details on these options.)

File system namespaces are in fact a better replacement for chroot()s in many many ways. Eventually Avahi and RealtimeKit should probably be updated to make use of namespaces replacing chroot()s.

So much about the security usecase. Now, let's look at the other use case: setting up and controlling OS images for debugging, testing, building, installing or recovering.

chroot() environments are relatively simple things: they only virtualize the file system hierarchy. By chroot()ing into a subdirectory a process still has complete access to all system calls, can kill all processes and shares about everything else with the host it is running on. To run an OS (or a small part of an OS) inside a chroot() is hence a dangerous affair: the isolation between host and guest is limited to the file system, everything else can be freely accessed from inside the chroot(). For example, if you upgrade a distribution inside a chroot(), and the package scripts send a SIGTERM to PID 1 to trigger a reexecution of the init system, this will actually take place in the host OS! On top of that, SysV shared memory, abstract namespace sockets and other IPC primitives are shared between host and guest. While a completely secure isolation for testing, debugging, building, installing or recovering an OS is probably not necessary, a basic isolation to avoid accidental modifications of the host OS from inside the chroot() environment is desirable: you never know what code package scripts execute which might interfere with the host OS.

To deal with chroot() setups for this use systemd offers you a couple of features:

First of all, systemctl detects when it is run in a chroot. If so, most of its operations will become NOPs, with the exception of systemctl enable and systemctl disable. If a package installation script hence calls these two commands, services will be enabled in the guest OS. However, should a package installation script include a command like systemctl restart as part of the package upgrade process this will have no effect at all when run in a chroot() environment.

More importantly however systemd comes out-of-the-box with the systemd-nspawn tool which acts as chroot(1) on steroids: it makes use of file system and PID namespaces to boot a simple lightweight container on a file system tree. It can be used almost like chroot(1), except that the isolation from the host OS is much more complete, a lot more secure and even easier to use. In fact, systemd-nspawn is capable of booting a complete systemd or sysvinit OS in container with a single command. Since it virtualizes PIDs, the init system in the container can act as PID 1 and thus do its job as normal. In contrast to chroot(1) this tool will implicitly mount /proc, /sys for you.

Here's an example how in three commands you can boot a Debian OS on your Fedora machine inside an nspawn container:

# yum install debootstrap
# debootstrap --arch=amd64 unstable debian-tree/
# systemd-nspawn -D debian-tree/

This will bootstrap the OS directory tree and then simply invoke a shell in it. If you want to boot a full system in the container, use a command like this:

# systemd-nspawn -D debian-tree/ /sbin/init

And after a quick bootup you should have a shell prompt, inside a complete OS, booted in your container. The container will not be able to see any of the processes outside of it. It will share the network configuration, but not be able to modify it. (Expect a couple of EPERMs during boot for that, which however should not be fatal). Directories like /sys and /proc/sys are available in the container, but mounted read-only in order to avoid that the container can modify kernel or hardware configuration. Note however that this protects the host OS only from accidental changes of its parameters. A process in the container can manually remount the file systems read-writeable and then change whatever it wants to change.

So, what's so great about systemd-nspawn again?

It's really easy to use. No need to manually mount /proc and /sys into your chroot() environment. The tool will do it for you and the kernel automatically cleans it up when the container terminates.
The isolation is much more complete, protecting the host OS from accidental changes from inside the container.
It's so good that you can actually boot a full OS in the container, not just a single lonesome shell.
It's actually tiny and installed everywhere where systemd is installed. No complicated installation or setup.

systemd itself has been modified to work very well in such a container. For example, when shutting down and detecting that it is run in a container, it just calls exit(), instead of reboot() as last step.

Note that systemd-nspawn is not a full container solution. If you need that LXC is the better choice for you. It uses the same underlying kernel technology but offers a lot more, including network virtualization. If you so will, systemd-nspawn is the GNOME 3 of container solutions: slick and trivially easy to use -- but with few configuration options. LXC OTOH is more like KDE: more configuration options than lines of code. I wrote systemd-nspawn specifically to cover testing, debugging, building, installing, recovering. That's what you should use it for and what it is really good at, and where it is a much much nicer alternative to chroot(1).

So, let's get this finished, this was already long enough. Here's what to take home from this little blog story:

Secure chroot()s are best done natively in the C sources of your program.
ReadOnlyDirectories=, InaccessibleDirectories= might be suitable alternatives to a full chroot() environment.
RootDirectory= is your friend if you want to chroot() a specific service.
systemd-nspawn is made of awesome.
chroot()s are lame, file system namespaces are totally l33t.

All of this is readily available on your Fedora 15 system.

And that's it for today. See you again for the next installment.