Index | Archives | Atom Feed | RSS Feed

What Are We Breaking Now?

End of February devconf.cz took place in Brno, Czech Republic. At the conference Kay Sievers, Harald Hoyer and I did two presentations about our work on systemd and about the systemd Journal. These talks were taped and the recordings are now available online.

First, here's our talk about What Are We Breaking Now?, in which we try to give an overview on what we are working on currently in the systemd context, and what we expect to do in the next few months. We cover Predictable Network Interface Names, the Boot Loader Spec, kdbus, the Apps framework, and more.

And then, I did my second talk about The systemd Journal, with a focus on how to practically make use of journalctl, as a day-to-day tool for administrators (these practical bits start around 28:40). The commands demoed here are all explained in an earlier blog story of mine.

Unfortunately, the audience questions are sometimes hard or impossible to understand from the videos, and sometimes the text on the slides is hard to read, but I still believe that the two talks are quite interesting.


systemd Hackfest!

Hey, you, systemd hacker, Fedora hacker! Listen up! This Thu/Fri is the systemd Hackfest in Brno/Czech Rep, right before devconf.cz! On thursday we'll talk about (and hack on) all things systemd. And the hackfest friday is going to be a Fedora Activity Day, so we'll have a focus on systemd integration into Fedora.

You are invited!

See you in Brno!


The Biggest Myths

Since we first proposed systemd for inclusion in the distributions it has been frequently discussed in many forums, mailing lists and conferences. In these discussions one can often hear certain myths about systemd, that are repeated over and over again, but certainly don't gain any truth by constant repetition. Let's take the time to debunk a few of them:

  1. Myth: systemd is monolithic.

    If you build systemd with all configuration options enabled you will build 69 individual binaries. These binaries all serve different tasks, and are neatly separated for a number of reasons. For example, we designed systemd with security in mind, hence most daemons run at minimal privileges (using kernel capabilities, for example) and are responsible for very specific tasks only, to minimize their security surface and impact. Also, systemd parallelizes the boot more than any prior solution. This parallization happens by running more processes in parallel. Thus it is essential that systemd is nicely split up into many binaries and thus processes. In fact, many of these binaries[1] are separated out so nicely, that they are very useful outside of systemd, too.

    A package involving 69 individual binaries can hardly be called monolithic. What is different from prior solutions however, is that we ship more components in a single tarball, and maintain them upstream in a single repository with a unified release cycle.

  2. Myth: systemd is about speed.

    Yes, systemd is fast (A pretty complete userspace boot-up in ~900ms, anyone?), but that's primarily just a side-effect of doing things right. In fact, we never really sat down and optimized the last tiny bit of performance out of systemd. Instead, we actually frequently knowingly picked the slightly slower code paths in order to keep the code more readable. This doesn't mean being fast was irrelevant for us, but reducing systemd to its speed is certainly quite a misconception, since that is certainly not anywhere near the top of our list of goals.

  3. Myth: systemd's fast boot-up is irrelevant for servers.

    That is just completely not true. Many administrators actually are keen on reduced downtimes during maintenance windows. In High Availability setups it's kinda nice if the failed machine comes back up really fast. In cloud setups with a large number of VMs or containers the price of slow boots multiplies with the number of instances. Spending minutes of CPU and IO on really slow boots of hundreds of VMs or containers reduces your system's density drastically, heck, it even costs you more energy. Slow boots can be quite financially expensive. Then, fast booting of containers allows you to implement a logic such as socket activated containers, allowing you to drastically increase the density of your cloud system.

    Of course, in many server setups boot-up is indeed irrelevant, but systemd is supposed to cover the whole range. And yes, I am aware that often it is the server firmware that costs the most time at boot-up, and the OS anyways fast compared to that, but well, systemd is still supposed to cover the whole range (see above...), and no, not all servers have such bad firmware, and certainly not VMs and containers, which are servers of a kind, too.[2]

  4. Myth: systemd is incompatible with shell scripts.

    This is entirely bogus. We just don't use them for the boot process, because we believe they aren't the best tool for that specific purpose, but that doesn't mean systemd was incompatible with them. You can easily run shell scripts as systemd services, heck, you can run scripts written in any language as systemd services, systemd doesn't care the slightest bit what's inside your executable. Moreover, we heavily use shell scripts for our own purposes, for installing, building, testing systemd. And you can stick your scripts in the early boot process, use them for normal services, you can run them at latest shutdown, there are practically no limits.

  5. Myth: systemd is difficult.

    This also is entire non-sense. A systemd platform is actually much simpler than traditional Linuxes because it unifies system objects and their dependencies as systemd units. The configuration file language is very simple, and redundant configuration files we got rid of. We provide uniform tools for much of the configuration of the system. The system is much less conglomerate than traditional Linuxes are. We also have pretty comprehensive documentation (all linked from the homepage) about pretty much every detail of systemd, and this not only covers admin/user-facing interfaces, but also developer APIs.

    systemd certainly comes with a learning curve. Everything does. However, we like to believe that it is actually simpler to understand systemd than a Shell-based boot for most people. Surprised we say that? Well, as it turns out, Shell is not a pretty language to learn, it's syntax is arcane and complex. systemd unit files are substantially easier to understand, they do not expose a programming language, but are simple and declarative by nature. That all said, if you are experienced in shell, then yes, adopting systemd will take a bit of learning.

    To make learning easy we tried hard to provide the maximum compatibility to previous solutions. But not only that, on many distributions you'll find that some of the traditional tools will now even tell you -- while executing what you are asking for -- how you could do it with the newer tools instead, in a possibly nicer way.

    Anyway, the take-away is probably that systemd is probably as simple as such a system can be, and that we try hard to make it easy to learn. But yes, if you know sysvinit then adopting systemd will require a bit learning, but quite frankly if you mastered sysvinit, then systemd should be easy for you.

  6. Myth: systemd is not modular.

    Not true at all. At compile time you have a number of configure switches to select what you want to build, and what not. And we document how you can select in even more detail what you need, going beyond our configure switches.

    This modularity is not totally unlike the one of the Linux kernel, where you can select many features individually at compile time. If the kernel is modular enough for you then systemd should be pretty close, too.

  7. Myth: systemd is only for desktops.

    That is certainly not true. With systemd we try to cover pretty much the same range as Linux itself does. While we care for desktop uses, we also care pretty much the same way for server uses, and embedded uses as well. You can bet that Red Hat wouldn't make it a core piece of RHEL7 if it wasn't the best option for managing services on servers.

    People from numerous companies work on systemd. Car manufactureres build it into cars, Red Hat uses it for a server operating system, and GNOME uses many of its interfaces for improving the desktop. You find it in toys, in space telescopes, and in wind turbines.

    Most features I most recently worked on are probably relevant primarily on servers, such as container support, resource management or the security features. We cover desktop systems pretty well already, and there are number of companies doing systemd development for embedded, some even offer consulting services in it.

  8. Myth: systemd was created as result of the NIH syndrome.

    This is not true. Before we began working on systemd we were pushing for Canonical's Upstart to be widely adopted (and Fedora/RHEL used it too for a while). However, we eventually came to the conclusion that its design was inherently flawed at its core (at least in our eyes: most fundamentally, it leaves dependency management to the admin/developer, instead of solving this hard problem in code), and if something's wrong in the core you better replace it, rather than fix it. This was hardly the only reason though, other things that came into play, such as the licensing/contribution agreement mess around it. NIH wasn't one of the reasons, though...[3]

  9. Myth: systemd is a freedesktop.org project.

    Well, systemd is certainly hosted at fdo, but freedesktop.org is little else but a repository for code and documentation. Pretty much any coder can request a repository there and dump his stuff there (as long as it's somewhat relevant for the infrastructure of free systems). There's no cabal involved, no "standardization" scheme, no project vetting, nothing. It's just a nice, free, reliable place to have your repository. In that regard it's a bit like SourceForge, github, kernel.org, just not commercial and without over-the-top requirements, and hence a good place to keep our stuff.

    So yes, we host our stuff at fdo, but the implied assumption of this myth in that there was a group of people who meet and then agree on how the future free systems look like, is entirely bogus.

  10. Myth: systemd is not UNIX.

    There's certainly some truth in that. systemd's sources do not contain a single line of code originating from original UNIX. However, we derive inspiration from UNIX, and thus there's a ton of UNIX in systemd. For example, the UNIX idea of "everything is a file" finds reflection in that in systemd all services are exposed at runtime in a kernel file system, the cgroupfs. Then, one of the original features of UNIX was multi-seat support, based on built-in terminal support. Text terminals are hardly the state of the art how you interface with your computer these days however. With systemd we brought native multi-seat support back, but this time with full support for today's hardware, covering graphics, mice, audio, webcams and more, and all that fully automatic, hotplug-capable and without configuration. In fact the design of systemd as a suite of integrated tools that each have their individual purposes but when used together are more than just the sum of the parts, that's pretty much at the core of UNIX philosophy. Then, the way our project is handled (i.e. maintaining much of the core OS in a single git repository) is much closer to the BSD model (which is a true UNIX, unlike Linux) of doing things (where most of the core OS is kept in a single CVS/SVN repository) than things on Linux ever were.

    Ultimately, UNIX is something different for everybody. For us systemd maintainers it is something we derive inspiration from. For others it is a religion, and much like the other world religions there are different readings and understandings of it. Some define UNIX based on specific pieces of code heritage, others see it just as a set of ideas, others as a set of commands or APIs, and even others as a definition of behaviours. Of course, it is impossible to ever make all these people happy.

    Ultimately the question whether something is UNIX or not matters very little. Being technically excellent is hardly exclusive to UNIX. For us, UNIX is a major influence (heck, the biggest one), but we also have other influences. Hence in some areas systemd will be very UNIXy, and in others a little bit less.

  11. Myth: systemd is complex.

    There's certainly some truth in that. Modern computers are complex beasts, and the OS running on it will hence have to be complex too. However, systemd is certainly not more complex than prior implementations of the same components. Much rather, it's simpler, and has less redundancy (see above). Moreover, building a simple OS based on systemd will involve much fewer packages than a traditional Linux did. Fewer packages makes it easier to build your system, gets rid of interdependencies and of much of the different behaviour of every component involved.

  12. Myth: systemd is bloated.

    Well, bloated certainly has many different definitions. But in most definitions systemd is probably the opposite of bloat. Since systemd components share a common code base, they tend to share much more code for common code paths. Here's an example: in a traditional Linux setup, sysvinit, start-stop-daemon, inetd, cron, dbus, all implemented a scheme to execute processes with various configuration options in a certain, hopefully clean environment. On systemd the code paths for all of this, for the configuration parsing, as well as the actual execution is shared. This means less code, less place for mistakes, less memory and cache pressure, and is thus a very good thing. And as a side-effect you actually get a ton more functionality for it...

    As mentioned above, systemd is also pretty modular. You can choose at build time which components you need, and which you don't need. People can hence specifically choose the level of "bloat" they want.

    When you build systemd, it only requires three dependencies: glibc, libcap and dbus. That's it. It can make use of more dependencies, but these are entirely optional.

    So, yeah, whichever way you look at it, it's really not bloated.

  13. Myth: systemd being Linux-only is not nice to the BSDs.

    Completely wrong. The BSD folks are pretty much uninterested in systemd. If systemd was portable, this would change nothing, they still wouldn't adopt it. And the same is true for the other Unixes in the world. Solaris has SMF, BSD has their own "rc" system, and they always maintained it separately from Linux. The init system is very close to the core of the entire OS. And these other operating systems hence define themselves among other things by their core userspace. The assumption that they'd adopt our core userspace if we just made it portable, is completely without any foundation.

  14. Myth: systemd being Linux-only makes it impossible for Debian to adopt it as default.

    Debian supports non-Linux kernels in their distribution. systemd won't run on those. Is that a problem though, and should that hinder them to adopt system as default? Not really. The folks who ported Debian to these other kernels were willing to invest time in a massive porting effort, they set up test and build systems, and patched and built numerous packages for their goal. The maintainance of both a systemd unit file and a classic init script for the packaged services is a negligable amount of work compared to that, especially since those scripts more often than not exist already.

  15. Myth: systemd could be ported to other kernels if its maintainers just wanted to.

    That is simply not true. Porting systemd to other kernel is not feasible. We just use too many Linux-specific interfaces. For a few one might find replacements on other kernels, some features one might want to turn off, but for most this is nor really possible. Here's a small, very incomprehensive list: cgroups, fanotify, umount2(), /proc/self/mountinfo (including notification), /dev/swaps (same), udev, netlink, the structure of /sys, /proc/$PID/comm, /proc/$PID/cmdline, /proc/$PID/loginuid, /proc/$PID/stat, /proc/$PID/session, /proc/$PID/exe, /proc/$PID/fd, tmpfs, devtmpfs, capabilities, namespaces of all kinds, various prctl()s, numerous ioctls, the mount() system call and its semantics, selinux, audit, inotify, statfs, O_DIRECTORY, O_NOATIME, /proc/$PID/root, waitid(), SCM_CREDENTIALS, SCM_RIGHTS, mkostemp(), /dev/input, ...

    And no, if you look at this list and pick out the few where you can think of obvious counterparts on other kernels, then think again, and look at the others you didn't pick, and the complexity of replacing them.

  16. Myth: systemd is not portable for no reason.

    Non-sense! We use the Linux-specific functionality because we need it to implement what we want. Linux has so many features that UNIX/POSIX didn't have, and we want to empower the user with them. These features are incredibly useful, but only if they are actually exposed in a friendly way to the user, and that's what we do with systemd.

  17. Myth: systemd uses binary configuration files.

    No idea who came up with this crazy myth, but it's absolutely not true. systemd is configured pretty much exclusively via simple text files. A few settings you can also alter with the kernel command line and via environment variables. There's nothing binary in its configuration (not even XML). Just plain, simple, easy-to-read text files.

  18. Myth: systemd is a feature creep.

    Well, systemd certainly covers more ground that it used to. It's not just an init system anymore, but the basic userspace building block to build an OS from, but we carefully make sure to keep most of the features optional. You can turn a lot off at compile time, and even more at runtime. Thus you can choose freely how much feature creeping you want.

  19. Myth: systemd forces you to do something.

    systemd is not the mafia. It's Free Software, you can do with it whatever you want, and that includes not using it. That's pretty much the opposite of "forcing".

  20. Myth: systemd makes it impossible to run syslog.

    Not true, we carefully made sure when we introduced the journal that all data is also passed on to any syslog daemon running. In fact, if something changed, then only that syslog gets more complete data now than it got before, since we now cover early boot stuff as well as STDOUT/STDERR of any system service.

  21. Myth: systemd is incompatible.

    We try very hard to provide the best possible compatibility with sysvinit. In fact, the vast majority of init scripts should work just fine on systemd, unmodified. However, there actually are indeed a few incompatibilities, but we try to document these and explain what to do about them. Ultimately every system that is not actually sysvinit itself will have a certain amount of incompatibilities with it since it will not share the exect same code paths.

    It is our goal to ensure that differences between the various distributions are kept at a minimum. That means unit files usually work just fine on a different distribution than you wrote it on, which is a big improvement over classic init scripts which are very hard to write in a way that they run on multiple Linux distributions, due to numerous incompatibilities between them.

  22. Myth: systemd is not scriptable, because of its D-Bus use.

    Not true. Pretty much every single D-Bus interface systemd provides is also available in a command line tool, for example in systemctl, loginctl, timedatectl, hostnamectl, localectl and suchlike. You can easily call these tools from shell scripts, they open up pretty much the entire API from the command line with easy-to-use commands.

    That said, D-Bus actually has bindings for almost any scripting language this world knows. Even from the shell you can invoke arbitrary D-Bus methods with dbus-send or gdbus. If anything, this improves scriptability due to the good support of D-Bus in the various scripting languages.

  23. Myth: systemd requires you to use some arcane configuration tools instead of allowing you to edit your configuration files directly.

    Not true at all. We offer some configuration tools, and using them gets you a bit of additional functionality (for example, command line completion for all settings!), but there's no need at all to use them. You can always edit the files in question directly if you wish, and that's fully supported. Of course sometimes you need to explicitly reload configuration of some daemon after editing the configuration, but that's pretty much true for most UNIX services.

  24. Myth: systemd is unstable and buggy.

    Certainly not according to our data. We have been monitoring the Fedora bug tracker (and some others) closely for a long long time. The number of bugs is very low for such a central component of the OS, especially if you discount the numerous RFE bugs we track for the project. We are pretty good in keeping systemd out of the list of blocker bugs of the distribution. We have a relatively fast development cycle with mostly incremental changes to keep quality and stability high.

  25. Myth: systemd is not debuggable.

    False. Some people try to imply that the shell was a good debugger. Well, it isn't really. In systemd we provide you with actual debugging features instead. For example: interactive debugging, verbose tracing, the ability to mask any component during boot, and more. Also, we provide documentation for it.

    It's certainly well debuggable, we needed that for our own development work, after all. But we'll grant you one thing: it uses different debugging tools, we believe more appropriate ones for the purpose, though.

  26. Myth: systemd makes changes for the changes' sake.

    Very much untrue. We pretty much exclusively have technical reasons for the changes we make, and we explain them in the various pieces of documentation, wiki pages, blog articles, mailing list announcements. We try hard to avoid making incompatible changes, and if we do we try to document the why and how in detail. And if you wonder about something, just ask us!

  27. Myth: systemd is a Red-Hat-only project, is private property of some smart-ass developers, who use it to push their views to the world.

    Not true. Currently, there are 16 hackers with commit powers to the systemd git tree. Of these 16 only six are employed by Red Hat. The 10 others are folks from ArchLinux, from Debian, from Intel, even from Canonical, Mandriva, Pantheon and a number of community folks with full commit rights. And they frequently commit big stuff, major changes. Then, there are 374 individuals with patches in our tree, and they too came from a number of different companies and backgrounds, and many of those have way more than one patch in the tree. The discussions about where we want to take systemd are done in the open, on our IRC channel (#systemd on freenode, you are always weclome), on our mailing list, and on public hackfests (such as our next one in Brno, you are invited). We regularly attend various conferences, to collect feedback, to explain what we are doing and why, like few others do. We maintain blogs, engage in social networks (we actually have some pretty interesting content on Google+, and our Google+ Community is pretty alive, too.), and try really hard to explain the why and the how how we do things, and to listen to feedback and figure out where the current issues are (for example, from that feedback we compiled this lists of often heard myths about systemd...).

    What most systemd contributors probably share is a rough idea how a good OS should look like, and the desire to make it happen. However, by the very nature of the project being Open Source, and rooted in the community systemd is just what people want it to be, and if it's not what they want then they can drive the direction with patches and code, and if that's not feasible, then there are numerous other options to use, too, systemd is never exclusive.

    One goal of systemd is to unify the dispersed Linux landscape a bit. We try to get rid of many of the more pointless differences of the various distributions in various areas of the core OS. As part of that we sometimes adopt schemes that were previously used by only one of the distributions and push it to a level where it's the default of systemd, trying to gently push everybody towards the same set of basic configuration. This is never exclusive though, distributions can continue to deviate from that if they wish, however, if they end-up using the well-supported default their work becomes much easier and they might gain a feature or two. Now, as it turns out, more frequently than not we actually adopted schemes that where Debianisms, rather than Fedoraisms/Redhatisms as best supported scheme by systemd. For example, systems running systemd now generally store their hostname in /etc/hostname, something that used to be specific to Debian and now is used across distributions.

    One thing we'll grant you though, we sometimes can be smart-asses. We try to be prepared whenever we open our mouth, in order to be able to back-up with facts what we claim. That might make us appear as smart-asses.

    But in general, yes, some of the more influental contributors of systemd work for Red Hat, but they are in the minority, and systemd is a healthy, open community with different interests, different backgrounds, just unified by a few rough ideas where the trip should go, a community where code and its design counts, and certainly not company affiliation.

  28. Myth: systemd doesn't support /usr split from the root directory.

    Non-sense. Since its beginnings systemd supports the --with-rootprefix= option to its configure script which allows you to tell systemd to neatly split up the stuff needed for early boot and the stuff needed for later on. All this logic is fully present and we keep it up-to-date right there in systemd's build system.

    Of course, we still don't think that actually booting with /usr unavailable is a good idea, but we support this just fine in our build system. This won't fix the inherent problems of the scheme that you'll encounter all across the board, but you can't blame that on systemd, because in systemd we support this just fine.

  29. Myth: systemd doesn't allow your to replace its components.

    Not true, you can turn off and replace pretty much any part of systemd, with very few exceptions. And those exceptions (such as journald) generally allow you to run an alternative side by side to it, while cooperating nicely with it.

  30. Myth: systemd's use of D-Bus instead of sockets makes it intransparent.

    This claim is already contradictory in itself: D-Bus uses sockets as transport, too. Hence whenever D-Bus is used to send something around, a socket is used for that too. D-Bus is mostly a standardized serialization of messages to send over these sockets. If anything this makes it more transparent, since this serialization is well documented, understood and there are numerous tracing tools and language bindings for it. This is very much unlike the usual homegrown protocols the various classic UNIX daemons use to communicate locally.

Hmm, did I write I just wanted to debunk a "few" myths? Maybe these were more than just a few... Anyway, I hope I managed to clear up a couple of misconceptions. Thanks for your time.

Footnotes

[1] For example, systemd-detect-virt, systemd-tmpfiles, systemd-udevd are.

[2] Also, we are trying to do our little part on maybe making this better. By exposing boot-time performance of the firmware more prominently in systemd's boot output we hope to shame the firmware writers to clean up their stuff.

[3] And anyways, guess which project includes a library "libnih" -- Upstart or systemd?[4]

[4] Hint: it's not systemd!


systemd for Administrators, Part XX

This is no time for procrastination, here is already the twentieth installment of my ongoing series on systemd for Administrators:

Socket Activated Internet Services and OS Containers

Socket Activation is an important feature of systemd. When we first announced systemd we already tried to make the point how great socket activation is for increasing parallelization and robustness of socket services, but also for simplifying the dependency logic of the boot. In this episode I'd like to explain why socket activation is an important tool for drastically improving how many services and even containers you can run on a single system with the same resource usage. Or in other words, how you can drive up the density of customer sites on a system while spending less on new hardware.

Socket Activated Internet Services

First, let's take a step back. What was socket activation again? -- Basically, socket activation simply means that systemd sets up listening sockets (IP or otherwise) on behalf of your services (without these running yet), and then starts (activates) the services as soon as the first connection comes in. Depending on the technology the services might idle for a while after having processed the connection and possible follow-up connections before they exit on their own, so that systemd will again listen on the sockets and activate the services again the next time they are connected to. For the client it is not visible whether the service it is interested in is currently running or not. The service's IP socket stays continously connectable, no connection attempt ever fails, and all connects will be processed promptly.

A setup like this lowers resource usage: as services are only running when needed they only consume resources when required. Many internet sites and services can benefit from that. For example, web site hosters will have noticed that of the multitude of web sites that are on the Internet only a tiny fraction gets a continous stream of requests: the huge majority of web sites still needs to be available all the time but gets requests only very unfrequently. With a scheme like socket activation you take benefit of this. By hosting many of these sites on a single system like this and only activating their services as necessary allows a large degree of over-commit: you can run more sites on your system than the available resources actually allow. Of course, one shouldn't over-commit too much to avoid contention during peak times.

Socket activation like this is easy to use in systemd. Many modern Internet daemons already support socket activation out of the box (and for those which don't yet it's not hard to add). Together with systemd's instantiated units support it is easy to write a pair of service and socket templates that then may be instantiated multiple times, once for each site. Then, (optionally) make use of some of the security features of systemd to nicely isolate the customer's site's services from each other (think: each customer's service should only see the home directory of the customer, everybody else's directories should be invisible), and there you go: you now have a highly scalable and reliable server system, that serves a maximum of securely sandboxed services at a minimum of resources, and all nicely done with built-in technology of your OS.

This kind of setup is already in production use in a number of companies. For example, the great folks at Pantheon are running their scalable instant Drupal system on a setup that is similar to this. (In fact, Pantheon's David Strauss pioneered this scheme. David, you rock!)

Socket Activated OS Containers

All of the above can already be done with older versions of systemd. If you use a distribution that is based on systemd, you can right-away set up a system like the one explained above. But let's take this one step further. With systemd 197 (to be included in Fedora 19), we added support for socket activating not only individual services, but entire OS containers. And I really have to say it at this point: this is stuff I am really excited about. ;-)

Basically, with socket activated OS containers, the host's systemd instance will listen on a number of ports on behalf of a container, for example one for SSH, one for web and one for the database, and as soon as the first connection comes in, it will spawn the container this is intended for, and pass to it all three sockets. Inside of the container, another systemd is running and will accept the sockets and then distribute them further, to the services running inside the container using normal socket activation. The SSH, web and database services will only see the inside of the container, even though they have been activated by sockets that were originally created on the host! Again, to the clients this all is not visible. That an entire OS container is spawned, triggered by simple network connection is entirely transparent to the client side.[1]

The OS containers may contain (as the name suggests) a full operating system, that might even be a different distribution than is running on the host. For example, you could run your host on Fedora, but run a number of Debian containers inside of it. The OS containers will have their own systemd init system, their own SSH instances, their own process tree, and so on, but will share a number of other facilities (such as memory management) with the host.

For now, only systemd's own trivial container manager, systemd-nspawn has been updated to support this kind of socket activation. We hope that libvirt-lxc will soon gain similar functionality. At this point, let's see in more detail how such a setup is configured in systemd using nspawn:

First, please use a tool such as debootstrap or yum's --installroot to set up a container OS tree[2]. The details of that are a bit out-of-focus for this story, there's plenty of documentation around how to do this. Of course, make sure you have systemd v197 installed inside the container. For accessing the container from the command line, consider using systemd-nspawn itself. After you configured everything properly, try to boot it up from the command line with systemd-nspawn's -b switch.

Assuming you now have a working container that boots up fine, let's write a service file for it, to turn the container into a systemd service on the host you can start and stop. Let's create /etc/systemd/system/mycontainer.service on the host:

[Unit]
Description=My little container

[Service]
ExecStart=/usr/bin/systemd-nspawn -jbD /srv/mycontainer 3
KillMode=process

This service can already be started and stopped via systemctl start and systemctl stop. However, there's no nice way to actually get a shell prompt inside the container. So let's add SSH to it, and even more: let's configure SSH so that a connection to the container's SSH port will socket-activate the entire container. First, let's begin with telling the host that it shall now listen on the SSH port of the container. Let's create /etc/systemd/system/mycontainer.socket on the host:

[Unit]
Description=The SSH socket of my little container

[Socket]
ListenStream=23

If we start this unit with systemctl start on the host then it will listen on port 23, and as soon as a connection comes in it will activate our container service we defined above. We pick port 23 here, instead of the usual 22, as our host's SSH is already listening on that. nspawn virtualizes the process list and the file system tree, but does not actually virtualize the network stack, hence we just pick different ports for the host and the various containers here.

Of course, the system inside the container doesn't yet know what to do with the socket it gets passed due to socket activation. If you'd now try to connect to the port, the container would start-up but the incoming connection would be immediately closed since the container can't handle it yet. Let's fix that!

All that's necessary for that is teach SSH inside the container socket activation. For that let's simply write a pair of socket and service units for SSH. Let's create /etc/systemd/system/sshd.socket in the container:

[Unit]
Description=SSH Socket for Per-Connection Servers

[Socket]
ListenStream=23
Accept=yes

Then, let's add the matching SSH service file /etc/systemd/system/sshd@.service in the container:

[Unit]
Description=SSH Per-Connection Server for %I

[Service]
ExecStart=-/usr/sbin/sshd -i
StandardInput=socket

Then, make sure to hook sshd.socket into the sockets.target so that unit is started automatically when the container boots up:

ln -s /etc/systemd/system/sshd.socket /etc/systemd/system/sockets.target.wants/

And that's it. If we now activate mycontainer.socket on the host, the host's systemd will bind the socket and we can connect to it. If we do this, the host's systemd will activate the container, and pass the socket in to it. The container's systemd will then take the socket, match it up with sshd.socket inside the container. As there's still our incoming connection queued on it, it will then immediately trigger an instance of sshd@.service, and we'll have our login.

And that's already everything there is to it. You can easily add additional sockets to listen on to mycontainer.socket. Everything listed therein will be passed to the container on activation, and will be matched up as good as possible with all socket units configured inside the container. Sockets that cannot be matched up will be closed, and sockets that aren't passed in but are configured for listening will be bound be the container's systemd instance.

So, let's take a step back again. What did we gain through all of this? Well, basically, we can now offer a number of full OS containers on a single host, and the containers can offer their services without running continously. The density of OS containers on the host can hence be increased drastically.

Of course, this only works for kernel-based virtualization, not for hardware virtualization. i.e. something like this can only be implemented on systems such as libvirt-lxc or nspawn, but not in qemu/kvm.

If you have a number of containers set up like this, here's one cool thing the journal allows you to do. If you pass -m to journalctl on the host, it will automatically discover the journals of all local containers and interleave them on display. Nifty, eh?

With systemd 197 you have everything to set up your own socket activated OS containers on-board. However, there are a couple of improvements we're likely to add soon: for example, right now even if all services inside the container exit on idle, the container still will stay around, and we really should make it exit on idle too, if all its services exited and no logins are around. As it turns out we already have much of the infrastructure for this around: we can reuse the auto-suspend functionality we added for laptops: detecting when a laptop is idle and suspending it then is a very similar problem to detecting when a container is idle and shutting it down then.

Anyway, this blog story is already way too long. I hope I haven't lost you half-way already with all this talk of virtualization, sockets, services, different OSes and stuff. I hope this blog story is a good starting point for setting up powerful highly scalable server systems. If you want to know more, consult the documentation and drop by our IRC channel. Thank you!

Footnotes

[1] And BTW, this is another reason why fast boot times the way systemd offers them are actually a really good thing on servers, too.

[2] To make it easy: you need a command line such as yum --releasever=19 --nogpg --installroot=/srv/mycontainer/ --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal to install Fedora, and debootstrap --arch=amd64 unstable /srv/mycontainer/ to install Debian. Also see the bottom of systemd-nspawn(1). Also note that auditing is currently broken for containers, and if enabled in the kernel will cause all kinds of errors in the container. Use audit=0 on the host's kernel command line to turn it off.


systemd for Administrators, Part XIX

Happy new year 2013! Here is now the nineteenth installment of my ongoing series on systemd for Administrators:

Detecting Virtualization

When we started working on systemd we had a closer look on what the various existing init scripts used on Linux where actually doing. Among other things we noticed that a number of them where checking explicitly whether they were running in a virtualized environment (i.e. in a kvm, VMWare, LXC guest or suchlike) or not. Some init scripts disabled themselves in such cases[1], others enabled themselves only in such cases[2]. Frequently, it would probably have been a better idea to check for other conditions rather than explicitly checking for virtualization, but after looking at this from all sides we came to the conclusion that in many cases explicitly conditionalizing services based on detected virtualization is a valid thing to do. As a result we added a new configuration option to systemd that can be used to conditionalize services this way: ConditionVirtualization; we also added a small tool that can be used in shell scripts to detect virtualization: systemd-detect-virt(1); and finally, we added a minimal bus interface to query this from other applications.

Detecting whether your code is run inside a virtualized environment is actually not that hard. Depending on what precisely you want to detect it's little more than running the CPUID instruction and maybe checking a few files in /sys and /proc. The complexity is mostly about knowing the strings to look for, and keeping this list up-to-date. Currently, the the virtualization detection code in systemd can detect the following virtualization systems:

  • Hardware virtualization (i.e. VMs):

    • qemu
    • kvm
    • vmware
    • microsoft
    • oracle
    • xen
    • bochs
  • Same-kernel virtualization (i.e. containers):

Let's have a look how one may make use if this functionality.

Conditionalizing Units

Adding ConditionVirtualization to the [Unit] section of a unit file is enough to conditionalize it depending on which virtualization is used or whether one is used at all. Here's an example:

[Unit]
Name=My Foobar Service (runs only only on guests)
ConditionVirtualization=yes

[Service]
ExecStart=/usr/bin/foobard

Instead of specifiying "yes" or "no" it is possible to specify the ID of a specific virtualization solution (Example: "kvm", "vmware", ...), or either "container" or "vm" to check whether the kernel is virtualized or the hardware. Also, checks can be prefixed with an exclamation mark ("!") to invert a check. For further details see the manual page.

In Shell Scripts

In shell scripts it is easy to check for virtualized systems with the systemd-detect-virt(1) tool. Here's an example:

if systemd-detect-virt -q ; then
        echo "Virtualization is used:" `systemd-detect-virt`
else
        echo "No virtualization is used."
fi

If this tool is run it will return with an exit code of zero (success) if a virtualization solution has been found, non-zero otherwise. It will also print a short identifier of the used virtualization solution, which can be suppressed with -q. Also, with the -c and -v parameters it is possible to detect only kernel or only hardware virtualization environments. For further details see the manual page.

In Programs

Whether virtualization is available is also exported on the system bus:

$ gdbus call --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop/systemd1 --method org.freedesktop.DBus.Properties.Get org.freedesktop.systemd1.Manager Virtualization
(<'systemd-nspawn'>,)

This property contains the empty string if no virtualization is detected. Note that some container environments cannot be detected directly from unprivileged code. That's why we expose this property on the bus rather than providing a library -- the bus implicitly solves the privilege problem quite nicely.

Note that all of this will only ever detect and return information about the "inner-most" virtualization solution. If you stack virtualization ("We must go deeper!") then these interfaces will expose the one the code is most directly interfacing with. Specifically that means that if a container solution is used inside of a VM, then only the container is generally detected and returned.

Footonotes

[1] For example: running certain device management service in a container environment that has no access to any physical hardware makes little sense.

[2] For example: some VM solutions work best if certain vendor-specific userspace components are running that connect the guest with the host in some way.


Third Berlin Open Source Meetup

The Third Berlin Open Source Meetup is going to take place on Sunday, January 20th. You are invited!

It's a public event, so everybody is welcome, and please feel free to invite others!


foss.in Needs Your Funding!

One of the most exciting conferences in the Free Software world, foss.in in Bangalore, India has trouble finding enough sponsoring for this year's edition. Many speakers from all around the Free Software world (including yours truly) have signed up to present at the event, and the conference would appreciate any corporate funding they can get!

Please check if your company can help and contact the organizers for details!

See you in Bangalore!

FOSS.IN


systemd for Developers III

Here's the third episode of of my systemd for Developers series.

Logging to the Journal

In a recent blog story intended for administrators I shed some light on how to use the journalctl(1) tool to browse and search the systemd journal. In this blog story for developers I want to explain a little how to get log data into the systemd Journal in the first place.

The good thing is that getting log data into the Journal is not particularly hard, since there's a good chance the Journal already collects it anyway and writes it to disk. The journal collects:

  1. All data logged via libc syslog()
  2. The data from the kernel logged with printk()
  3. Everything written to STDOUT/STDERR of any system service

This covers pretty much all of the traditional log output of a Linux system, including messages from the kernel initialization phase, the initial RAM disk, the early boot logic, and the main system runtime.

syslog()

Let's have a quick look how syslog() is used again. Let's write a journal message using this call:

#include <syslog.h>

int main(int argc, char *argv[]) {
        syslog(LOG_NOTICE, "Hello World");
        return 0;
}

This is C code, of course. Many higher level languages provide APIs that allow writing local syslog messages. Regardless which language you choose, all data written like this ends up in the Journal.

Let's have a look how this looks after it has been written into the journal (this is the JSON output journalctl -o json-pretty generates):

{
        "_BOOT_ID" : "5335e9cf5d954633bb99aefc0ec38c25",
        "_TRANSPORT" : "syslog",
        "PRIORITY" : "5",
        "_UID" : "500",
        "_GID" : "500",
        "_AUDIT_SESSION" : "2",
        "_AUDIT_LOGINUID" : "500",
        "_SYSTEMD_CGROUP" : "/user/lennart/2",
        "_SYSTEMD_SESSION" : "2",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "_MACHINE_ID" : "a91663387a90b89f185d4e860000001a",
        "_HOSTNAME" : "epsilon",
        "_COMM" : "test-journal-su",
        "_CMDLINE" : "./test-journal-submit",
        "SYSLOG_FACILITY" : "1",
        "_EXE" : "/home/lennart/projects/systemd/test-journal-submit",
        "_PID" : "3068",
        "SYSLOG_IDENTIFIER" : "test-journal-submit",
        "MESSAGE" : "Hello World!",
        "_SOURCE_REALTIME_TIMESTAMP" : "1351126905014938"
}

This nicely shows how the Journal implicitly augmented our little log message with various meta data fields which describe in more detail the context our message was generated from. For an explanation of the various fields, please refer to systemd.journal-fields(7)

printf()

If you are writing code that is run as a systemd service, generating journal messages is even easier:

#include <stdio.h>

int main(int argc, char *argv[]) {
        printf("Hello World\n");
        return 0;
}

Yupp, that's easy, indeed.

The printed string in this example is logged at a default log priority of LOG_INFO[1]. Sometimes it is useful to change the log priority for such a printed string. When systemd parses STDOUT/STDERR of a service it will look for priority values enclosed in < > at the beginning of each line[2], following the scheme used by the kernel's printk() which in turn took inspiration from the BSD syslog network serialization of messages. We can make use of this systemd feature like this:

#include <stdio.h>

#define PREFIX_NOTICE "<5>"

int main(int argc, char *argv[]) {
        printf(PREFIX_NOTICE "Hello World\n");
        return 0;
}

Nice! Logging with nothing but printf() but we still get log priorities!

This scheme works with any programming language, including, of course, shell:

#!/bin/bash

echo "<5>Hellow world"

Native Messages

Now, what I explained above is not particularly exciting: the take-away is pretty much only that things end up in the journal if they are output using the traditional message printing APIs. Yaaawn!

Let's make this more interesting, let's look at what the Journal provides as native APIs for logging, and let's see what its benefits are. Let's translate our little example into the 1:1 counterpart using the Journal's logging API sd_journal_print(3):

#include <systemd/sd-journal.h>

int main(int argc, char *argv[]) {
        sd_journal_print(LOG_NOTICE, "Hello World");
        return 0;
}

This doesn't look much more interesting than the two examples above, right? After compiling this with `pkg-config --cflags --libs libsystemd-journal` appended to the compiler parameters, let's have a closer look at the JSON representation of the journal entry this generates:

 {
        "_BOOT_ID" : "5335e9cf5d954633bb99aefc0ec38c25",
        "PRIORITY" : "5",
        "_UID" : "500",
        "_GID" : "500",
        "_AUDIT_SESSION" : "2",
        "_AUDIT_LOGINUID" : "500",
        "_SYSTEMD_CGROUP" : "/user/lennart/2",
        "_SYSTEMD_SESSION" : "2",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "_MACHINE_ID" : "a91663387a90b89f185d4e860000001a",
        "_HOSTNAME" : "epsilon",
        "CODE_FUNC" : "main",
        "_TRANSPORT" : "journal",
        "_COMM" : "test-journal-su",
        "_CMDLINE" : "./test-journal-submit",
        "CODE_FILE" : "src/journal/test-journal-submit.c",
        "_EXE" : "/home/lennart/projects/systemd/test-journal-submit",
        "MESSAGE" : "Hello World",
        "CODE_LINE" : "4",
        "_PID" : "3516",
        "_SOURCE_REALTIME_TIMESTAMP" : "1351128226954170"
}

This looks pretty much the same, right? Almost! I highlighted three new lines compared to the earlier output. Yes, you guessed it, by using sd_journal_print() meta information about the generating source code location is implicitly appended to each message[3], which is helpful for a developer to identify the source of a problem.

The primary reason for using the Journal's native logging APIs is a not just the source code location however: it is to allow passing additional structured log messages from the program into the journal. This additional log data may the be used to search the journal for, is available for consumption for other programs, and might help the administrator to track down issues beyond what is expressed in the human readable message text. Here's and example how to do that with sd_journal_send():

#include <systemd/sd-journal.h>
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
        sd_journal_send("MESSAGE=Hello World!",
                        "MESSAGE_ID=52fb62f99e2c49d89cfbf9d6de5e3555",
                        "PRIORITY=5",
                        "HOME=%s", getenv("HOME"),
                        "TERM=%s", getenv("TERM"),
                        "PAGE_SIZE=%li", sysconf(_SC_PAGESIZE),
                        "N_CPUS=%li", sysconf(_SC_NPROCESSORS_ONLN),
                        NULL);
        return 0;
}

This will write a log message to the journal much like the earlier examples. However, this times a few additional, structured fields are attached:

{
        "__CURSOR" : "s=ac9e9c423355411d87bf0ba1a9b424e8;i=5930;b=5335e9cf5d954633bb99aefc0ec38c25;m=16544f875b;t=4ccd863cdc4f0;x=896defe53cc1a96a",
        "__REALTIME_TIMESTAMP" : "1351129666274544",
        "__MONOTONIC_TIMESTAMP" : "95903778651",
        "_BOOT_ID" : "5335e9cf5d954633bb99aefc0ec38c25",
        "PRIORITY" : "5",
        "_UID" : "500",
        "_GID" : "500",
        "_AUDIT_SESSION" : "2",
        "_AUDIT_LOGINUID" : "500",
        "_SYSTEMD_CGROUP" : "/user/lennart/2",
        "_SYSTEMD_SESSION" : "2",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "_MACHINE_ID" : "a91663387a90b89f185d4e860000001a",
        "_HOSTNAME" : "epsilon",
        "CODE_FUNC" : "main",
        "_TRANSPORT" : "journal",
        "_COMM" : "test-journal-su",
        "_CMDLINE" : "./test-journal-submit",
        "CODE_FILE" : "src/journal/test-journal-submit.c",
        "_EXE" : "/home/lennart/projects/systemd/test-journal-submit",
        "MESSAGE" : "Hello World!",
        "_PID" : "4049",
        "CODE_LINE" : "6",
        "MESSAGE_ID" : "52fb62f99e2c49d89cfbf9d6de5e3555",
        "HOME" : "/home/lennart",
        "TERM" : "xterm-256color",
        "PAGE_SIZE" : "4096",
        "N_CPUS" : "4",
        "_SOURCE_REALTIME_TIMESTAMP" : "1351129666241467"
}

Awesome! Our simple example worked! The five meta data fields we attached to our message appeared in the journal. We used sd_journal_send() for this which works much like sd_journal_print() but takes a NULL terminated list of format strings each followed by its arguments. The format strings must include the field name and a '=' before the values.

Our little structured message included seven fields. The first three we passed are well-known fields:

  1. MESSAGE= is the actual human readable message part of the structured message.
  2. PRIORITY= is the numeric message priority value as known from BSD syslog formatted as an integer string.
  3. MESSAGE_ID= is a 128bit ID that identifies our specific message call, formatted as hexadecimal string. We randomly generated this string with journalctl --new-id128. This can be used by applications to track down all occasions of this specific message. The 128bit can be a UUID, but this is not a requirement or enforced.

Applications may relatively freely define additional fields as they see fit (we defined four pretty arbitrary ones in our example). A complete list of the currently well-known fields is available in systemd.journal-fields(7).

Let's see how the message ID helps us finding this message and all its occasions in the journal:

$ journalctl MESSAGE_ID=52fb62f99e2c49d89cfbf9d6de5e3555
-- Logs begin at Thu, 2012-10-18 04:07:03 CEST, end at Thu, 2012-10-25 04:48:21 CEST. --
Oct 25 03:47:46 epsilon test-journal-se[4049]: Hello World!
Oct 25 04:40:36 epsilon test-journal-se[4480]: Hello World!

Seems I already invoked this example tool twice!

Many messages systemd itself generates have message IDs. This is useful for example, to find all occasions where a program dumped core (journalctl MESSAGE_ID=fc2e22bc6ee647b6b90729ab34a250b1), or when a user logged in (journalctl MESSAGE_ID=8d45620c1a4348dbb17410da57c60c66). If your application generates a message that might be interesting to recognize in the journal stream later on, we recommend attaching such a message ID to it. You can easily allocate a new one for your message with journalctl --new-id128.

This example shows how we can use the Journal's native APIs to generate structured, recognizable messages. You can do much more than this with the C API. For example, you may store binary data in journal fields as well, which is useful to attach coredumps or hard disk SMART states to events where this applies. In order to make this blog story not longer than it already is we'll not go into detail about how to do this, an I ask you to check out sd_journal_send(3) for further information on this.

Python

The examples above focus on C. Structured logging to the Journal is also available from other languages. Along with systemd itself we ship bindings for Python. Here's an example how to use this:

from systemd import journal
journal.send('Hello world')
journal.send('Hello, again, world', FIELD2='Greetings!', FIELD3='Guten tag')

Other binding exist for Node.js, PHP, Lua.

Portability

Generating structured data is a very useful feature for services to make their logs more accessible both for administrators and other programs. In addition to the implicit structure the Journal adds to all logged messages it is highly beneficial if the various components of our stack also provide explicit structure in their messages, coming from within the processes themselves.

Porting an existing program to the Journal's logging APIs comes with one pitfall though: the Journal is Linux-only. If non-Linux portability matters for your project it's a good idea to provide an alternative log output, and make it selectable at compile-time.

Regardless which way to log you choose, in all cases we'll forward the message to a classic syslog daemon running side-by-side with the Journal, if there is one. However, much of the structured meta data of the message is not forwarded since the classic syslog protocol simply has no generally accepted way to encode this and we shouldn't attempt to serialize meta data into classic syslog messages which might turn /var/log/messages into an unreadable dump of machine data. Anyway, to summarize this: regardless if you log with syslog(), printf(), sd_journal_print() or sd_journal_send(), the message will be stored and indexed by the journal and it will also be forwarded to classic syslog.

And that's it for today. In a follow-up episode we'll focus on retrieving messages from the Journal using the C API, possibly filtering for a specific subset of messages. Later on, I hope to give a real-life example how to port an existing service to the Journal's logging APIs. Stay tuned!

Footnotes

[1] This can be changed with the SyslogLevel= service setting. See systemd.exec(5) for details.

[2] Interpretation of the < > prefixes of logged lines may be disabled with the SyslogLevelPrefix= service setting. See systemd.exec(5) for details.

[3] Appending the code location to the log messages can be turned off at compile time by defining -DSD_JOURNAL_SUPPRESS_LOCATION.


systemd for Administrators, Part XVIII

Hot on the heels of the previous story, here's now the eighteenth installment of my ongoing series on systemd for Administrators:

Managing Resources

An important facet of modern computing is resource management: if you run more than one program on a single machine you want to assign the available resources to them enforcing particular policies. This is particularly crucial on smaller, embedded or mobile systems where the scarce resources are the main constraint, but equally for large installations such as cloud setups, where resources are plenty, but the number of programs/services/containers on a single node is drastically higher.

Traditionally, on Linux only one policy was really available: all processes got about the same CPU time, or IO bandwith, modulated a bit via the process nice value. This approach is very simple and covered the various uses for Linux quite well for a long time. However, it has drawbacks: not all all processes deserve to be even, and services involving lots of processes (think: Apache with a lot of CGI workers) this way would get more resources than services whith very few (think: syslog).

When thinking about service management for systemd, we quickly realized that resource management must be core functionality of it. In a modern world -- regardless if server or embedded -- controlling CPU, Memory, and IO resources of the various services cannot be an afterthought, but must be built-in as first-class service settings. And it must be per-service and not per-process as the traditional nice values or POSIX Resource Limits were.

In this story I want to shed some light on what you can do to enforce resource policies on systemd services. Resource Management in one way or another has been available in systemd for a while already, so it's really time we introduce this to the broader audience.

In an earlier blog post I highlighted the difference between Linux Control Croups (cgroups) as a labelled, hierarchal grouping mechanism, and Linux cgroups as a resource controlling subsystem. While systemd requires the former, the latter is optional. And this optional latter part is now what we can make use of to manage per-service resources. (At this points, it's probably a good idea to read up on cgroups before reading on, to get at least a basic idea what they are and what they accomplish. Even thought the explanations below will be pretty high-level, it all makes a lot more sense if you grok the background a bit.)

The main Linux cgroup controllers for resource management are cpu, memory and blkio. To make use of these, they need to be enabled in the kernel, which many distributions (including Fedora) do. systemd exposes a couple of high-level service settings to make use of these controllers without requiring too much knowledge of the gory kernel details.

Managing CPU

As a nice default, if the cpu controller is enabled in the kernel, systemd will create a cgroup for each service when starting it. Without any further configuration this already has one nice effect: on a systemd system every system service will get an even amount of CPU, regardless how many processes it consists off. Or in other words: on your web server MySQL will get the roughly same amount of CPU as Apache, even if the latter consists a 1000 CGI script processes, but the former only of a few worker tasks. (This behavior can be turned off, see DefaultControllers= in /etc/systemd/system.conf.)

On top of this default, it is possible to explicitly configure the CPU shares a service gets with the CPUShares= setting. The default value is 1024, if you increase this number you'll assign more CPU to a service than an unaltered one at 1024, if you decrease it, less.

Let's see in more detail, how we can make use of this. Let's say we want to assign Apache 1500 CPU shares instead of the default of 1024. For that, let's create a new administrator service file for Apache in /etc/systemd/system/httpd.service, overriding the vendor supplied one in /usr/lib/systemd/system/httpd.service, but let's change the CPUShares= parameter:

.include /usr/lib/systemd/system/httpd.service

[Service]
CPUShares=1500

The first line will pull in the vendor service file. Now, lets's reload systemd's configuration and restart Apache so that the new service file is taken into account:

systemctl daemon-reload
systemctl restart httpd.service

And yeah, that's already it, you are done!

(Note that setting CPUShares= in a unit file will cause the specific service to get its own cgroup in the cpu hierarchy, even if cpu is not included in DefaultControllers=.)

Analyzing Resource usage

Of course, changing resource assignments without actually understanding the resource usage of the services in questions is like blind flying. To help you understand the resource usage of all services, we created the tool systemd-cgtop, that will enumerate all cgroups of the system, determine their resource usage (CPU, Memory, and IO) and present them in a top-like fashion. Building on the fact that systemd services are managed in cgroups this tool hence can present to you for services what top shows you for processes.

Unfortunately, by default cgtop will only be able to chart CPU usage per-service for you, IO and Memory are only tracked as total for the entire machine. The reason for this is simply that by default there are no per-service cgroups in the blkio and memory controller hierarchies but that's what we need to determine the resource usage. The best way to get this data for all services is to simply add the memory and blkio controllers to the aforementioned DefaultControllers= setting in system.conf.

Managing Memory

To enforce limits on memory systemd provides the MemoryLimit=, and MemorySoftLimit= settings for services, summing up the memory of all its processes. These settings take memory sizes in bytes that are the total memory limit for the service. This setting understands the usual K, M, G, T suffixes for Kilobyte, Megabyte, Gigabyte, Terabyte (to the base of 1024).

.include /usr/lib/systemd/system/httpd.service

[Service]
MemoryLimit=1G

(Analogue to CPUShares= above setting this option will cause the service to get its own cgroup in the memory cgroup hierarchy.)

Managing Block IO

To control block IO multiple settings are available. First of all BlockIOWeight= may be used which assigns an IO weight to a specific service. In behaviour the weight concept is not unlike the shares concept of CPU resource control (see above). However, the default weight is 1000, and the valid range is from 10 to 1000:

.include /usr/lib/systemd/system/httpd.service

[Service]
BlockIOWeight=500

Optionally, per-device weights can be specified:

.include /usr/lib/systemd/system/httpd.service

[Service]
BlockIOWeight=/dev/disk/by-id/ata-SAMSUNG_MMCRE28G8MXP-0VBL1_DC06K01009SE009B5252 750

Instead of specifiying an actual device node you also specify any path in the file system:

.include /usr/lib/systemd/system/httpd.service

[Service]
BlockIOWeight=/home/lennart 750

If the specified path does not refer to a device node systemd will determine the block device /home/lennart is on, and assign the bandwith weight to it.

You can even add per-device and normal lines at the same time, which will set the per-device weight for the device, and the other value as default for everything else.

Alternatively one may control explicit bandwith limits with the BlockIOReadBandwidth= and BlockIOWriteBandwidth= settings. These settings take a pair of device node and bandwith rate (in bytes per second) or of a file path and bandwith rate:

.include /usr/lib/systemd/system/httpd.service

[Service]
BlockIOReadBandwith=/var/log 5M

This sets the maximum read bandwith on the block device backing /var/log to 5Mb/s.

(Analogue to CPUShares= and MemoryLimit= using any of these three settings will result in the service getting its own cgroup in the blkio hierarchy.)

Managing Other Resource Parameters

The options described above cover only a small subset of the available controls the various Linux control group controllers expose. We picked these and added high-level options for them since we assumed that these are the most relevant for most folks, and that they really needed a nice interface that can handle units properly and resolve block device names.

In many cases the options explained above might not be sufficient for your usecase, but a low-level kernel cgroup setting might help. It is easy to make use of these options from systemd unit files, without having them covered with a high-level setting. For example, sometimes it might be useful to set the swappiness of a service. The kernel makes this controllable via the memory.swappiness cgroup attribute, but systemd does not expose it as a high-level option. Here's how you use it nonetheless, using the low-level ControlGroupAttribute= setting:

.include /usr/lib/systemd/system/httpd.service

[Service]
ControlGroupAttribute=memory.swappiness 70

(Analogue to the other cases this too causes the service to be added to the memory hierarchy.)

Later on we might add more high-level controls for the various cgroup attributes. In fact, please ping us if you frequently use one and believe it deserves more focus. We'll consider adding a high-level option for it then. (Even better: send us a patch!)

Disclaimer: note that making use of the various resource controllers does have a runtime impact on the system. Enforcing resource limits comes at a price. If you do use them, certain operations do get slower. Especially the memory controller has (used to have?) a bad reputation to come at a performance cost.

For more details on all of this, please have a look at the documenation of the mentioned unit settings, and of the cpu, memory and blkio controllers.

And that's it for now. Of course, this blog story only focussed on the per-service resource settings. On top this, you can also set the more traditional, well-known per-process resource settings, which will then be inherited by the various subprocesses, but always only be enforced per-process. More specifically that's IOSchedulingClass=, IOSchedulingPriority=, CPUSchedulingPolicy=, CPUSchedulingPriority=, CPUAffinity=, LimitCPU= and related. These do not make use of cgroup controllers and have a much lower performance cost. We might cover those in a later article in more detail.


systemd for Administrators, Part XVII

It's that time again, here's now the seventeenth installment of my ongoing series on systemd for Administrators:

Using the Journal

A while back I already posted a blog story introducing some functionality of the journal, and how it is exposed in systemctl. In this episode I want to explain a few more uses of the journal, and how you can make it work for you.

If you are wondering what the journal is, here's an explanation in a few words to get you up to speed: the journal is a component of systemd, that captures Syslog messages, Kernel log messages, initial RAM disk and early boot messages as well as messages written to STDOUT/STDERR of all services, indexes them and makes this available to the user. It can be used in parallel, or in place of a traditional syslog daemon, such as rsyslog or syslog-ng. For more information, see the initial announcement.

The journal has been part of Fedora since F17. With Fedora 18 it now has grown into a reliable, powerful tool to handle your logs. Note however, that on F17 and F18 the journal is configured by default to store logs only in a small ring-buffer in /run/log/journal, i.e. not persistent. This of course limits its usefulness quite drastically but is sufficient to show a bit of recent log history in systemctl status. For Fedora 19, we plan to change this, and enable persistent logging by default. Then, journal files will be stored in /var/log/journal and can grow much larger, thus making the journal a lot more useful.

Enabling Persistency

In the meantime, on F17 or F18, you can enable journald's persistent storage manually:

# mkdir -p /var/log/journal

After that, it's a good idea to reboot, to get some useful structured data into your journal to play with. Oh, and since you have the journal now, you don't need syslog anymore (unless having /var/log/messages as text file is a necessity for you.), so you can choose to deinstall rsyslog:

# yum remove rsyslog

Basics

Now we are ready to go. The following text shows a lot of features of systemd 195 as it will be included in Fedora 18[1], so if your F17 can't do the tricks you see, please wait for F18. First, let's start with some basics. To access the logs of the journal use the journalctl(1) tool. To have a first look at the logs, just type in:

# journalctl

If you run this as root you will see all logs generated on the system, from system components the same way as for logged in users. The output you will get looks like a pixel-perfect copy of the traditional /var/log/messages format, but actually has a couple of improvements over it:

  • Lines of error priority (and higher) will be highlighted red.
  • Lines of notice/warning priority will be highlighted bold.
  • The timestamps are converted into your local time-zone.
  • The output is auto-paged with your pager of choice (defaults to less).
  • This will show all available data, including rotated logs.
  • Between the output of each boot we'll add a line clarifying that a new boot begins now.

Note that in this blog story I will not actually show you any of the output this generates, I cut that out for brevity -- and to give you a reason to try it out yourself with a current image for F18's development version with systemd 195. But I do hope you get the idea anyway.

Access Control

Browsing logs this way is already pretty nice. But requiring to be root sucks of course, even administrators tend to do most of their work as unprivileged users these days. By default, Journal users can only watch their own logs, unless they are root or in the adm group. To make watching system logs more fun, let's add ourselves to adm:

# usermod -a -G adm lennart

After logging out and back in as lennart I know have access to the full journal of the system and all users:

$ journalctl

Live View

If invoked without parameters journalctl will show me the current log database. Sometimes one needs to watch logs as they grow, where one previously used tail -f /var/log/messages:

$ journalctl -f

Yes, this does exactly what you expect it to do: it will show you the last ten logs lines and then wait for changes and show them as they take place.

Basic Filtering

When invoking journalctl without parameters you'll see the whole set of logs, beginning with the oldest message stored. That of course, can be a lot of data. Much more useful is just viewing the logs of the current boot:

$ journalctl -b

This will show you only the logs of the current boot, with all the aforementioned gimmicks mentioned. But sometimes even this is way too much data to process. So what about just listing all the real issues to care about: all messages of priority levels ERROR and worse, from the current boot:

$ journalctl -b -p err

If you reboot only seldom the -b makes little sense, filtering based on time is much more useful:

$ journalctl --since=yesterday

And there you go, all log messages from the day before at 00:00 in the morning until right now. Awesome! Of course, we can combine this with -p err or a similar match. But humm, we are looking for something that happened on the 15th of October, or was it the 16th?

$ journalctl --since=2012-10-15 --until="2011-10-16 23:59:59"

Yupp, there we go, we found what we were looking for. But humm, I noticed that some CGI script in Apache was acting up earlier today, let's see what Apache logged at that time:

$ journalctl -u httpd --since=00:00 --until=9:30

Oh, yeah, there we found it. But hey, wasn't there an issue with that disk /dev/sdc? Let's figure out what was going on there:

$ journalctl /dev/sdc

OMG, a disk error![2] Hmm, let's quickly replace the disk before we lose data. Done! Next! -- Hmm, didn't I see that the vpnc binary made a booboo? Let's check for that:

$ journalctl /usr/sbin/vpnc

Hmm, I don't get this, this seems to be some weird interaction with dhclient, let's see both outputs, interleaved:

$ journalctl /usr/sbin/vpnc /usr/sbin/dhclient

That did it! Found it!

Advanced Filtering

Whew! That was awesome already, but let's turn this up a notch. Internally systemd stores each log entry with a set of implicit meta data. This meta data looks a lot like an environment block, but actually is a bit more powerful: values can take binary, large values (though this is the exception, and usually they just contain UTF-8), and fields can have multiple values assigned (an exception too, usually they only have one value). This implicit meta data is collected for each and every log message, without user intervention. The data will be there, and wait to be used by you. Let's see how this looks:

$ journalctl -o verbose -n
[...]
Tue, 2012-10-23 23:51:38 CEST [s=ac9e9c423355411d87bf0ba1a9b424e8;i=4301;b=5335e9cf5d954633bb99aefc0ec38c25;m=882ee28d2;t=4ccc0f98326e6;x=f21e8b1b0994d7ee]
        PRIORITY=6
        SYSLOG_FACILITY=3
        _MACHINE_ID=a91663387a90b89f185d4e860000001a
        _HOSTNAME=epsilon
        _TRANSPORT=syslog
        SYSLOG_IDENTIFIER=avahi-daemon
        _COMM=avahi-daemon
        _EXE=/usr/sbin/avahi-daemon
        _SYSTEMD_CGROUP=/system/avahi-daemon.service
        _SYSTEMD_UNIT=avahi-daemon.service
        _SELINUX_CONTEXT=system_u:system_r:avahi_t:s0
        _UID=70
        _GID=70
        _CMDLINE=avahi-daemon: registering [epsilon.local]
        MESSAGE=Joining mDNS multicast group on interface wlan0.IPv4 with address 172.31.0.53.
        _BOOT_ID=5335e9cf5d954633bb99aefc0ec38c25
        _PID=27937
        SYSLOG_PID=27937
        _SOURCE_REALTIME_TIMESTAMP=1351029098747042

(I cut out a lot of noise here, I don't want to make this story overly long. -n without parameter shows you the last 10 log entries, but I cut out all but the last.)

With the -o verbose switch we enabled verbose output. Instead of showing a pixel-perfect copy of classic /var/log/messages that only includes a minimimal subset of what is available we now see all the gory details the journal has about each entry. But it's highly interesting: there is user credential information, SELinux bits, machine information and more. For a full list of common, well-known fields, see the man page.

Now, as it turns out the journal database is indexed by all of these fields, out-of-the-box! Let's try this out:

$ journalctl _UID=70

And there you go, this will show all log messages logged from Linux user ID 70. As it turns out one can easily combine these matches:

$ journalctl _UID=70 _UID=71

Specifying two matches for the same field will result in a logical OR combination of the matches. All entries matching either will be shown, i.e. all messages from either UID 70 or 71.

$ journalctl _HOSTNAME=epsilon _COMM=avahi-daemon

You guessed it, if you specify two matches for different field names, they will be combined with a logical AND. All entries matching both will be shown now, meaning that all messages from processes named avahi-daemon and host epsilon.

But of course, that's not fancy enough for us. We are computer nerds after all, we live off logical expressions. We must go deeper!

$ journalctl _HOSTNAME=theta _UID=70 + _HOSTNAME=epsilon _COMM=avahi-daemon

The + is an explicit OR you can use in addition to the implied OR when you match the same field twice. The line above hence means: show me everything from host theta with UID 70, or of host epsilon with a process name of avahi-daemon.

And now, it becomes magic!

That was already pretty cool, right? Righ! But heck, who can remember all those values a field can take in the journal, I mean, seriously, who has thaaaat kind of photographic memory? Well, the journal has:

$ journalctl -F _SYSTEMD_UNIT

This will show us all values the field _SYSTEMD_UNIT takes in the database, or in other words: the names of all systemd services which ever logged into the journal. This makes it super-easy to build nice matches. But wait, turns out this all is actually hooked up with shell completion on bash! This gets even more awesome: as you type your match expression you will get a list of well-known field names, and of the values they can take! Let's figure out how to filter for SELinux labels again. We remember the field name was something with SELINUX in it, let's try that:

$ journalctl _SE<TAB>

And yupp, it's immediately completed:

$ journalctl _SELINUX_CONTEXT=

Cool, but what's the label again we wanted to match for?

$ journalctl _SELINUX_CONTEXT=<TAB><TAB>
kernel                                                       system_u:system_r:local_login_t:s0-s0:c0.c1023               system_u:system_r:udev_t:s0-s0:c0.c1023
system_u:system_r:accountsd_t:s0                             system_u:system_r:lvm_t:s0                                   system_u:system_r:virtd_t:s0-s0:c0.c1023
system_u:system_r:avahi_t:s0                                 system_u:system_r:modemmanager_t:s0-s0:c0.c1023              system_u:system_r:vpnc_t:s0
system_u:system_r:bluetooth_t:s0                             system_u:system_r:NetworkManager_t:s0                        system_u:system_r:xdm_t:s0-s0:c0.c1023
system_u:system_r:chkpwd_t:s0-s0:c0.c1023                    system_u:system_r:policykit_t:s0                             unconfined_u:system_r:rpm_t:s0-s0:c0.c1023
system_u:system_r:chronyd_t:s0                               system_u:system_r:rtkit_daemon_t:s0                          unconfined_u:system_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:crond_t:s0-s0:c0.c1023                     system_u:system_r:syslogd_t:s0                               unconfined_u:system_r:useradd_t:s0-s0:c0.c1023
system_u:system_r:devicekit_disk_t:s0                        system_u:system_r:system_cronjob_t:s0-s0:c0.c1023            unconfined_u:unconfined_r:unconfined_dbusd_t:s0-s0:c0.c1023
system_u:system_r:dhcpc_t:s0                                 system_u:system_r:system_dbusd_t:s0-s0:c0.c1023              unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:dnsmasq_t:s0-s0:c0.c1023                   system_u:system_r:systemd_logind_t:s0
system_u:system_r:init_t:s0                                  system_u:system_r:systemd_tmpfiles_t:s0

Ah! Right! We wanted to see everything logged under PolicyKit's security label:

$ journalctl _SELINUX_CONTEXT=system_u:system_r:policykit_t:s0

Wow! That was easy! I didn't know anything related to SELinux could be thaaat easy! ;-) Of course this kind of completion works with any field, not just SELinux labels.

So much for now. There's a lot more cool stuff in journalctl(1) than this. For example, it generates JSON output for you! You can match against kernel fields! You can get simple /var/log/messages-like output but with relative timestamps! And so much more!

Anyway, in the next weeks I hope to post more stories about all the cool things the journal can do for you. This is just the beginning, stay tuned.

Footnotes

[1] systemd 195 is currently still in Bodhi but hopefully will get into F18 proper soon, and definitely before the release of Fedora 18.

[2] OK, I cheated here, indexing by block device is not in the kernel yet, but on its way due to Hannes' fantastic work, and I hope it will make appearence in F18.

© Lennart Poettering. Built using Pelican. Theme by Giulio Fidente on github. .