Index | Archives | Atom Feed | RSS Feed

Third Berlin Open Source Meetup

The Third Berlin Open Source Meetup is going to take place on Sunday, January 20th. You are invited!

It's a public event, so everybody is welcome, and please feel free to invite others!


foss.in Needs Your Funding!

One of the most exciting conferences in the Free Software world, foss.in in Bangalore, India has trouble finding enough sponsoring for this year's edition. Many speakers from all around the Free Software world (including yours truly) have signed up to present at the event, and the conference would appreciate any corporate funding they can get!

Please check if your company can help and contact the organizers for details!

See you in Bangalore!

FOSS.IN


systemd for Developers III

Here's the third episode of of my systemd for Developers series.

Logging to the Journal

In a recent blog story intended for administrators I shed some light on how to use the journalctl(1) tool to browse and search the systemd journal. In this blog story for developers I want to explain a little how to get log data into the systemd Journal in the first place.

The good thing is that getting log data into the Journal is not particularly hard, since there's a good chance the Journal already collects it anyway and writes it to disk. The journal collects:

  1. All data logged via libc syslog()
  2. The data from the kernel logged with printk()
  3. Everything written to STDOUT/STDERR of any system service

This covers pretty much all of the traditional log output of a Linux system, including messages from the kernel initialization phase, the initial RAM disk, the early boot logic, and the main system runtime.

syslog()

Let's have a quick look how syslog() is used again. Let's write a journal message using this call:

#include <syslog.h>

int main(int argc, char *argv[]) {
        syslog(LOG_NOTICE, "Hello World");
        return 0;
}

This is C code, of course. Many higher level languages provide APIs that allow writing local syslog messages. Regardless which language you choose, all data written like this ends up in the Journal.

Let's have a look how this looks after it has been written into the journal (this is the JSON output journalctl -o json-pretty generates):

{
        "_BOOT_ID" : "5335e9cf5d954633bb99aefc0ec38c25",
        "_TRANSPORT" : "syslog",
        "PRIORITY" : "5",
        "_UID" : "500",
        "_GID" : "500",
        "_AUDIT_SESSION" : "2",
        "_AUDIT_LOGINUID" : "500",
        "_SYSTEMD_CGROUP" : "/user/lennart/2",
        "_SYSTEMD_SESSION" : "2",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "_MACHINE_ID" : "a91663387a90b89f185d4e860000001a",
        "_HOSTNAME" : "epsilon",
        "_COMM" : "test-journal-su",
        "_CMDLINE" : "./test-journal-submit",
        "SYSLOG_FACILITY" : "1",
        "_EXE" : "/home/lennart/projects/systemd/test-journal-submit",
        "_PID" : "3068",
        "SYSLOG_IDENTIFIER" : "test-journal-submit",
        "MESSAGE" : "Hello World!",
        "_SOURCE_REALTIME_TIMESTAMP" : "1351126905014938"
}

This nicely shows how the Journal implicitly augmented our little log message with various meta data fields which describe in more detail the context our message was generated from. For an explanation of the various fields, please refer to systemd.journal-fields(7)

printf()

If you are writing code that is run as a systemd service, generating journal messages is even easier:

#include <stdio.h>

int main(int argc, char *argv[]) {
        printf("Hello World\n");
        return 0;
}

Yupp, that's easy, indeed.

The printed string in this example is logged at a default log priority of LOG_INFO[1]. Sometimes it is useful to change the log priority for such a printed string. When systemd parses STDOUT/STDERR of a service it will look for priority values enclosed in < > at the beginning of each line[2], following the scheme used by the kernel's printk() which in turn took inspiration from the BSD syslog network serialization of messages. We can make use of this systemd feature like this:

#include <stdio.h>

#define PREFIX_NOTICE "<5>"

int main(int argc, char *argv[]) {
        printf(PREFIX_NOTICE "Hello World\n");
        return 0;
}

Nice! Logging with nothing but printf() but we still get log priorities!

This scheme works with any programming language, including, of course, shell:

#!/bin/bash

echo "<5>Hellow world"

Native Messages

Now, what I explained above is not particularly exciting: the take-away is pretty much only that things end up in the journal if they are output using the traditional message printing APIs. Yaaawn!

Let's make this more interesting, let's look at what the Journal provides as native APIs for logging, and let's see what its benefits are. Let's translate our little example into the 1:1 counterpart using the Journal's logging API sd_journal_print(3):

#include <systemd/sd-journal.h>

int main(int argc, char *argv[]) {
        sd_journal_print(LOG_NOTICE, "Hello World");
        return 0;
}

This doesn't look much more interesting than the two examples above, right? After compiling this with `pkg-config --cflags --libs libsystemd-journal` appended to the compiler parameters, let's have a closer look at the JSON representation of the journal entry this generates:

 {
        "_BOOT_ID" : "5335e9cf5d954633bb99aefc0ec38c25",
        "PRIORITY" : "5",
        "_UID" : "500",
        "_GID" : "500",
        "_AUDIT_SESSION" : "2",
        "_AUDIT_LOGINUID" : "500",
        "_SYSTEMD_CGROUP" : "/user/lennart/2",
        "_SYSTEMD_SESSION" : "2",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "_MACHINE_ID" : "a91663387a90b89f185d4e860000001a",
        "_HOSTNAME" : "epsilon",
        "CODE_FUNC" : "main",
        "_TRANSPORT" : "journal",
        "_COMM" : "test-journal-su",
        "_CMDLINE" : "./test-journal-submit",
        "CODE_FILE" : "src/journal/test-journal-submit.c",
        "_EXE" : "/home/lennart/projects/systemd/test-journal-submit",
        "MESSAGE" : "Hello World",
        "CODE_LINE" : "4",
        "_PID" : "3516",
        "_SOURCE_REALTIME_TIMESTAMP" : "1351128226954170"
}

This looks pretty much the same, right? Almost! I highlighted three new lines compared to the earlier output. Yes, you guessed it, by using sd_journal_print() meta information about the generating source code location is implicitly appended to each message[3], which is helpful for a developer to identify the source of a problem.

The primary reason for using the Journal's native logging APIs is a not just the source code location however: it is to allow passing additional structured log messages from the program into the journal. This additional log data may the be used to search the journal for, is available for consumption for other programs, and might help the administrator to track down issues beyond what is expressed in the human readable message text. Here's and example how to do that with sd_journal_send():

#include <systemd/sd-journal.h>
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
        sd_journal_send("MESSAGE=Hello World!",
                        "MESSAGE_ID=52fb62f99e2c49d89cfbf9d6de5e3555",
                        "PRIORITY=5",
                        "HOME=%s", getenv("HOME"),
                        "TERM=%s", getenv("TERM"),
                        "PAGE_SIZE=%li", sysconf(_SC_PAGESIZE),
                        "N_CPUS=%li", sysconf(_SC_NPROCESSORS_ONLN),
                        NULL);
        return 0;
}

This will write a log message to the journal much like the earlier examples. However, this times a few additional, structured fields are attached:

{
        "__CURSOR" : "s=ac9e9c423355411d87bf0ba1a9b424e8;i=5930;b=5335e9cf5d954633bb99aefc0ec38c25;m=16544f875b;t=4ccd863cdc4f0;x=896defe53cc1a96a",
        "__REALTIME_TIMESTAMP" : "1351129666274544",
        "__MONOTONIC_TIMESTAMP" : "95903778651",
        "_BOOT_ID" : "5335e9cf5d954633bb99aefc0ec38c25",
        "PRIORITY" : "5",
        "_UID" : "500",
        "_GID" : "500",
        "_AUDIT_SESSION" : "2",
        "_AUDIT_LOGINUID" : "500",
        "_SYSTEMD_CGROUP" : "/user/lennart/2",
        "_SYSTEMD_SESSION" : "2",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "_MACHINE_ID" : "a91663387a90b89f185d4e860000001a",
        "_HOSTNAME" : "epsilon",
        "CODE_FUNC" : "main",
        "_TRANSPORT" : "journal",
        "_COMM" : "test-journal-su",
        "_CMDLINE" : "./test-journal-submit",
        "CODE_FILE" : "src/journal/test-journal-submit.c",
        "_EXE" : "/home/lennart/projects/systemd/test-journal-submit",
        "MESSAGE" : "Hello World!",
        "_PID" : "4049",
        "CODE_LINE" : "6",
        "MESSAGE_ID" : "52fb62f99e2c49d89cfbf9d6de5e3555",
        "HOME" : "/home/lennart",
        "TERM" : "xterm-256color",
        "PAGE_SIZE" : "4096",
        "N_CPUS" : "4",
        "_SOURCE_REALTIME_TIMESTAMP" : "1351129666241467"
}

Awesome! Our simple example worked! The five meta data fields we attached to our message appeared in the journal. We used sd_journal_send() for this which works much like sd_journal_print() but takes a NULL terminated list of format strings each followed by its arguments. The format strings must include the field name and a '=' before the values.

Our little structured message included seven fields. The first three we passed are well-known fields:

  1. MESSAGE= is the actual human readable message part of the structured message.
  2. PRIORITY= is the numeric message priority value as known from BSD syslog formatted as an integer string.
  3. MESSAGE_ID= is a 128bit ID that identifies our specific message call, formatted as hexadecimal string. We randomly generated this string with journalctl --new-id128. This can be used by applications to track down all occasions of this specific message. The 128bit can be a UUID, but this is not a requirement or enforced.

Applications may relatively freely define additional fields as they see fit (we defined four pretty arbitrary ones in our example). A complete list of the currently well-known fields is available in systemd.journal-fields(7).

Let's see how the message ID helps us finding this message and all its occasions in the journal:

$ journalctl MESSAGE_ID=52fb62f99e2c49d89cfbf9d6de5e3555
-- Logs begin at Thu, 2012-10-18 04:07:03 CEST, end at Thu, 2012-10-25 04:48:21 CEST. --
Oct 25 03:47:46 epsilon test-journal-se[4049]: Hello World!
Oct 25 04:40:36 epsilon test-journal-se[4480]: Hello World!

Seems I already invoked this example tool twice!

Many messages systemd itself generates have message IDs. This is useful for example, to find all occasions where a program dumped core (journalctl MESSAGE_ID=fc2e22bc6ee647b6b90729ab34a250b1), or when a user logged in (journalctl MESSAGE_ID=8d45620c1a4348dbb17410da57c60c66). If your application generates a message that might be interesting to recognize in the journal stream later on, we recommend attaching such a message ID to it. You can easily allocate a new one for your message with journalctl --new-id128.

This example shows how we can use the Journal's native APIs to generate structured, recognizable messages. You can do much more than this with the C API. For example, you may store binary data in journal fields as well, which is useful to attach coredumps or hard disk SMART states to events where this applies. In order to make this blog story not longer than it already is we'll not go into detail about how to do this, an I ask you to check out sd_journal_send(3) for further information on this.

Python

The examples above focus on C. Structured logging to the Journal is also available from other languages. Along with systemd itself we ship bindings for Python. Here's an example how to use this:

from systemd import journal
journal.send('Hello world')
journal.send('Hello, again, world', FIELD2='Greetings!', FIELD3='Guten tag')

Other binding exist for Node.js, PHP, Lua.

Portability

Generating structured data is a very useful feature for services to make their logs more accessible both for administrators and other programs. In addition to the implicit structure the Journal adds to all logged messages it is highly beneficial if the various components of our stack also provide explicit structure in their messages, coming from within the processes themselves.

Porting an existing program to the Journal's logging APIs comes with one pitfall though: the Journal is Linux-only. If non-Linux portability matters for your project it's a good idea to provide an alternative log output, and make it selectable at compile-time.

Regardless which way to log you choose, in all cases we'll forward the message to a classic syslog daemon running side-by-side with the Journal, if there is one. However, much of the structured meta data of the message is not forwarded since the classic syslog protocol simply has no generally accepted way to encode this and we shouldn't attempt to serialize meta data into classic syslog messages which might turn /var/log/messages into an unreadable dump of machine data. Anyway, to summarize this: regardless if you log with syslog(), printf(), sd_journal_print() or sd_journal_send(), the message will be stored and indexed by the journal and it will also be forwarded to classic syslog.

And that's it for today. In a follow-up episode we'll focus on retrieving messages from the Journal using the C API, possibly filtering for a specific subset of messages. Later on, I hope to give a real-life example how to port an existing service to the Journal's logging APIs. Stay tuned!

Footnotes

[1] This can be changed with the SyslogLevel= service setting. See systemd.exec(5) for details.

[2] Interpretation of the < > prefixes of logged lines may be disabled with the SyslogLevelPrefix= service setting. See systemd.exec(5) for details.

[3] Appending the code location to the log messages can be turned off at compile time by defining -DSD_JOURNAL_SUPPRESS_LOCATION.


systemd for Administrators, Part XVIII

Hot on the heels of the previous story, here's now the eighteenth installment of my ongoing series on systemd for Administrators:

Managing Resources

An important facet of modern computing is resource management: if you run more than one program on a single machine you want to assign the available resources to them enforcing particular policies. This is particularly crucial on smaller, embedded or mobile systems where the scarce resources are the main constraint, but equally for large installations such as cloud setups, where resources are plenty, but the number of programs/services/containers on a single node is drastically higher.

Traditionally, on Linux only one policy was really available: all processes got about the same CPU time, or IO bandwith, modulated a bit via the process nice value. This approach is very simple and covered the various uses for Linux quite well for a long time. However, it has drawbacks: not all all processes deserve to be even, and services involving lots of processes (think: Apache with a lot of CGI workers) this way would get more resources than services whith very few (think: syslog).

When thinking about service management for systemd, we quickly realized that resource management must be core functionality of it. In a modern world -- regardless if server or embedded -- controlling CPU, Memory, and IO resources of the various services cannot be an afterthought, but must be built-in as first-class service settings. And it must be per-service and not per-process as the traditional nice values or POSIX Resource Limits were.

In this story I want to shed some light on what you can do to enforce resource policies on systemd services. Resource Management in one way or another has been available in systemd for a while already, so it's really time we introduce this to the broader audience.

In an earlier blog post I highlighted the difference between Linux Control Croups (cgroups) as a labelled, hierarchal grouping mechanism, and Linux cgroups as a resource controlling subsystem. While systemd requires the former, the latter is optional. And this optional latter part is now what we can make use of to manage per-service resources. (At this points, it's probably a good idea to read up on cgroups before reading on, to get at least a basic idea what they are and what they accomplish. Even thought the explanations below will be pretty high-level, it all makes a lot more sense if you grok the background a bit.)

The main Linux cgroup controllers for resource management are cpu, memory and blkio. To make use of these, they need to be enabled in the kernel, which many distributions (including Fedora) do. systemd exposes a couple of high-level service settings to make use of these controllers without requiring too much knowledge of the gory kernel details.

Managing CPU

As a nice default, if the cpu controller is enabled in the kernel, systemd will create a cgroup for each service when starting it. Without any further configuration this already has one nice effect: on a systemd system every system service will get an even amount of CPU, regardless how many processes it consists off. Or in other words: on your web server MySQL will get the roughly same amount of CPU as Apache, even if the latter consists a 1000 CGI script processes, but the former only of a few worker tasks. (This behavior can be turned off, see DefaultControllers= in /etc/systemd/system.conf.)

On top of this default, it is possible to explicitly configure the CPU shares a service gets with the CPUShares= setting. The default value is 1024, if you increase this number you'll assign more CPU to a service than an unaltered one at 1024, if you decrease it, less.

Let's see in more detail, how we can make use of this. Let's say we want to assign Apache 1500 CPU shares instead of the default of 1024. For that, let's create a new administrator service file for Apache in /etc/systemd/system/httpd.service, overriding the vendor supplied one in /usr/lib/systemd/system/httpd.service, but let's change the CPUShares= parameter:

.include /usr/lib/systemd/system/httpd.service

[Service]
CPUShares=1500

The first line will pull in the vendor service file. Now, lets's reload systemd's configuration and restart Apache so that the new service file is taken into account:

systemctl daemon-reload
systemctl restart httpd.service

And yeah, that's already it, you are done!

(Note that setting CPUShares= in a unit file will cause the specific service to get its own cgroup in the cpu hierarchy, even if cpu is not included in DefaultControllers=.)

Analyzing Resource usage

Of course, changing resource assignments without actually understanding the resource usage of the services in questions is like blind flying. To help you understand the resource usage of all services, we created the tool systemd-cgtop, that will enumerate all cgroups of the system, determine their resource usage (CPU, Memory, and IO) and present them in a top-like fashion. Building on the fact that systemd services are managed in cgroups this tool hence can present to you for services what top shows you for processes.

Unfortunately, by default cgtop will only be able to chart CPU usage per-service for you, IO and Memory are only tracked as total for the entire machine. The reason for this is simply that by default there are no per-service cgroups in the blkio and memory controller hierarchies but that's what we need to determine the resource usage. The best way to get this data for all services is to simply add the memory and blkio controllers to the aforementioned DefaultControllers= setting in system.conf.

Managing Memory

To enforce limits on memory systemd provides the MemoryLimit=, and MemorySoftLimit= settings for services, summing up the memory of all its processes. These settings take memory sizes in bytes that are the total memory limit for the service. This setting understands the usual K, M, G, T suffixes for Kilobyte, Megabyte, Gigabyte, Terabyte (to the base of 1024).

.include /usr/lib/systemd/system/httpd.service

[Service]
MemoryLimit=1G

(Analogue to CPUShares= above setting this option will cause the service to get its own cgroup in the memory cgroup hierarchy.)

Managing Block IO

To control block IO multiple settings are available. First of all BlockIOWeight= may be used which assigns an IO weight to a specific service. In behaviour the weight concept is not unlike the shares concept of CPU resource control (see above). However, the default weight is 1000, and the valid range is from 10 to 1000:

.include /usr/lib/systemd/system/httpd.service

[Service]
BlockIOWeight=500

Optionally, per-device weights can be specified:

.include /usr/lib/systemd/system/httpd.service

[Service]
BlockIOWeight=/dev/disk/by-id/ata-SAMSUNG_MMCRE28G8MXP-0VBL1_DC06K01009SE009B5252 750

Instead of specifiying an actual device node you also specify any path in the file system:

.include /usr/lib/systemd/system/httpd.service

[Service]
BlockIOWeight=/home/lennart 750

If the specified path does not refer to a device node systemd will determine the block device /home/lennart is on, and assign the bandwith weight to it.

You can even add per-device and normal lines at the same time, which will set the per-device weight for the device, and the other value as default for everything else.

Alternatively one may control explicit bandwith limits with the BlockIOReadBandwidth= and BlockIOWriteBandwidth= settings. These settings take a pair of device node and bandwith rate (in bytes per second) or of a file path and bandwith rate:

.include /usr/lib/systemd/system/httpd.service

[Service]
BlockIOReadBandwith=/var/log 5M

This sets the maximum read bandwith on the block device backing /var/log to 5Mb/s.

(Analogue to CPUShares= and MemoryLimit= using any of these three settings will result in the service getting its own cgroup in the blkio hierarchy.)

Managing Other Resource Parameters

The options described above cover only a small subset of the available controls the various Linux control group controllers expose. We picked these and added high-level options for them since we assumed that these are the most relevant for most folks, and that they really needed a nice interface that can handle units properly and resolve block device names.

In many cases the options explained above might not be sufficient for your usecase, but a low-level kernel cgroup setting might help. It is easy to make use of these options from systemd unit files, without having them covered with a high-level setting. For example, sometimes it might be useful to set the swappiness of a service. The kernel makes this controllable via the memory.swappiness cgroup attribute, but systemd does not expose it as a high-level option. Here's how you use it nonetheless, using the low-level ControlGroupAttribute= setting:

.include /usr/lib/systemd/system/httpd.service

[Service]
ControlGroupAttribute=memory.swappiness 70

(Analogue to the other cases this too causes the service to be added to the memory hierarchy.)

Later on we might add more high-level controls for the various cgroup attributes. In fact, please ping us if you frequently use one and believe it deserves more focus. We'll consider adding a high-level option for it then. (Even better: send us a patch!)

Disclaimer: note that making use of the various resource controllers does have a runtime impact on the system. Enforcing resource limits comes at a price. If you do use them, certain operations do get slower. Especially the memory controller has (used to have?) a bad reputation to come at a performance cost.

For more details on all of this, please have a look at the documenation of the mentioned unit settings, and of the cpu, memory and blkio controllers.

And that's it for now. Of course, this blog story only focussed on the per-service resource settings. On top this, you can also set the more traditional, well-known per-process resource settings, which will then be inherited by the various subprocesses, but always only be enforced per-process. More specifically that's IOSchedulingClass=, IOSchedulingPriority=, CPUSchedulingPolicy=, CPUSchedulingPriority=, CPUAffinity=, LimitCPU= and related. These do not make use of cgroup controllers and have a much lower performance cost. We might cover those in a later article in more detail.


systemd for Administrators, Part XVII

It's that time again, here's now the seventeenth installment of my ongoing series on systemd for Administrators:

Using the Journal

A while back I already posted a blog story introducing some functionality of the journal, and how it is exposed in systemctl. In this episode I want to explain a few more uses of the journal, and how you can make it work for you.

If you are wondering what the journal is, here's an explanation in a few words to get you up to speed: the journal is a component of systemd, that captures Syslog messages, Kernel log messages, initial RAM disk and early boot messages as well as messages written to STDOUT/STDERR of all services, indexes them and makes this available to the user. It can be used in parallel, or in place of a traditional syslog daemon, such as rsyslog or syslog-ng. For more information, see the initial announcement.

The journal has been part of Fedora since F17. With Fedora 18 it now has grown into a reliable, powerful tool to handle your logs. Note however, that on F17 and F18 the journal is configured by default to store logs only in a small ring-buffer in /run/log/journal, i.e. not persistent. This of course limits its usefulness quite drastically but is sufficient to show a bit of recent log history in systemctl status. For Fedora 19, we plan to change this, and enable persistent logging by default. Then, journal files will be stored in /var/log/journal and can grow much larger, thus making the journal a lot more useful.

Enabling Persistency

In the meantime, on F17 or F18, you can enable journald's persistent storage manually:

# mkdir -p /var/log/journal

After that, it's a good idea to reboot, to get some useful structured data into your journal to play with. Oh, and since you have the journal now, you don't need syslog anymore (unless having /var/log/messages as text file is a necessity for you.), so you can choose to deinstall rsyslog:

# yum remove rsyslog

Basics

Now we are ready to go. The following text shows a lot of features of systemd 195 as it will be included in Fedora 18[1], so if your F17 can't do the tricks you see, please wait for F18. First, let's start with some basics. To access the logs of the journal use the journalctl(1) tool. To have a first look at the logs, just type in:

# journalctl

If you run this as root you will see all logs generated on the system, from system components the same way as for logged in users. The output you will get looks like a pixel-perfect copy of the traditional /var/log/messages format, but actually has a couple of improvements over it:

  • Lines of error priority (and higher) will be highlighted red.
  • Lines of notice/warning priority will be highlighted bold.
  • The timestamps are converted into your local time-zone.
  • The output is auto-paged with your pager of choice (defaults to less).
  • This will show all available data, including rotated logs.
  • Between the output of each boot we'll add a line clarifying that a new boot begins now.

Note that in this blog story I will not actually show you any of the output this generates, I cut that out for brevity -- and to give you a reason to try it out yourself with a current image for F18's development version with systemd 195. But I do hope you get the idea anyway.

Access Control

Browsing logs this way is already pretty nice. But requiring to be root sucks of course, even administrators tend to do most of their work as unprivileged users these days. By default, Journal users can only watch their own logs, unless they are root or in the adm group. To make watching system logs more fun, let's add ourselves to adm:

# usermod -a -G adm lennart

After logging out and back in as lennart I know have access to the full journal of the system and all users:

$ journalctl

Live View

If invoked without parameters journalctl will show me the current log database. Sometimes one needs to watch logs as they grow, where one previously used tail -f /var/log/messages:

$ journalctl -f

Yes, this does exactly what you expect it to do: it will show you the last ten logs lines and then wait for changes and show them as they take place.

Basic Filtering

When invoking journalctl without parameters you'll see the whole set of logs, beginning with the oldest message stored. That of course, can be a lot of data. Much more useful is just viewing the logs of the current boot:

$ journalctl -b

This will show you only the logs of the current boot, with all the aforementioned gimmicks mentioned. But sometimes even this is way too much data to process. So what about just listing all the real issues to care about: all messages of priority levels ERROR and worse, from the current boot:

$ journalctl -b -p err

If you reboot only seldom the -b makes little sense, filtering based on time is much more useful:

$ journalctl --since=yesterday

And there you go, all log messages from the day before at 00:00 in the morning until right now. Awesome! Of course, we can combine this with -p err or a similar match. But humm, we are looking for something that happened on the 15th of October, or was it the 16th?

$ journalctl --since=2012-10-15 --until="2011-10-16 23:59:59"

Yupp, there we go, we found what we were looking for. But humm, I noticed that some CGI script in Apache was acting up earlier today, let's see what Apache logged at that time:

$ journalctl -u httpd --since=00:00 --until=9:30

Oh, yeah, there we found it. But hey, wasn't there an issue with that disk /dev/sdc? Let's figure out what was going on there:

$ journalctl /dev/sdc

OMG, a disk error![2] Hmm, let's quickly replace the disk before we lose data. Done! Next! -- Hmm, didn't I see that the vpnc binary made a booboo? Let's check for that:

$ journalctl /usr/sbin/vpnc

Hmm, I don't get this, this seems to be some weird interaction with dhclient, let's see both outputs, interleaved:

$ journalctl /usr/sbin/vpnc /usr/sbin/dhclient

That did it! Found it!

Advanced Filtering

Whew! That was awesome already, but let's turn this up a notch. Internally systemd stores each log entry with a set of implicit meta data. This meta data looks a lot like an environment block, but actually is a bit more powerful: values can take binary, large values (though this is the exception, and usually they just contain UTF-8), and fields can have multiple values assigned (an exception too, usually they only have one value). This implicit meta data is collected for each and every log message, without user intervention. The data will be there, and wait to be used by you. Let's see how this looks:

$ journalctl -o verbose -n
[...]
Tue, 2012-10-23 23:51:38 CEST [s=ac9e9c423355411d87bf0ba1a9b424e8;i=4301;b=5335e9cf5d954633bb99aefc0ec38c25;m=882ee28d2;t=4ccc0f98326e6;x=f21e8b1b0994d7ee]
        PRIORITY=6
        SYSLOG_FACILITY=3
        _MACHINE_ID=a91663387a90b89f185d4e860000001a
        _HOSTNAME=epsilon
        _TRANSPORT=syslog
        SYSLOG_IDENTIFIER=avahi-daemon
        _COMM=avahi-daemon
        _EXE=/usr/sbin/avahi-daemon
        _SYSTEMD_CGROUP=/system/avahi-daemon.service
        _SYSTEMD_UNIT=avahi-daemon.service
        _SELINUX_CONTEXT=system_u:system_r:avahi_t:s0
        _UID=70
        _GID=70
        _CMDLINE=avahi-daemon: registering [epsilon.local]
        MESSAGE=Joining mDNS multicast group on interface wlan0.IPv4 with address 172.31.0.53.
        _BOOT_ID=5335e9cf5d954633bb99aefc0ec38c25
        _PID=27937
        SYSLOG_PID=27937
        _SOURCE_REALTIME_TIMESTAMP=1351029098747042

(I cut out a lot of noise here, I don't want to make this story overly long. -n without parameter shows you the last 10 log entries, but I cut out all but the last.)

With the -o verbose switch we enabled verbose output. Instead of showing a pixel-perfect copy of classic /var/log/messages that only includes a minimimal subset of what is available we now see all the gory details the journal has about each entry. But it's highly interesting: there is user credential information, SELinux bits, machine information and more. For a full list of common, well-known fields, see the man page.

Now, as it turns out the journal database is indexed by all of these fields, out-of-the-box! Let's try this out:

$ journalctl _UID=70

And there you go, this will show all log messages logged from Linux user ID 70. As it turns out one can easily combine these matches:

$ journalctl _UID=70 _UID=71

Specifying two matches for the same field will result in a logical OR combination of the matches. All entries matching either will be shown, i.e. all messages from either UID 70 or 71.

$ journalctl _HOSTNAME=epsilon _COMM=avahi-daemon

You guessed it, if you specify two matches for different field names, they will be combined with a logical AND. All entries matching both will be shown now, meaning that all messages from processes named avahi-daemon and host epsilon.

But of course, that's not fancy enough for us. We are computer nerds after all, we live off logical expressions. We must go deeper!

$ journalctl _HOSTNAME=theta _UID=70 + _HOSTNAME=epsilon _COMM=avahi-daemon

The + is an explicit OR you can use in addition to the implied OR when you match the same field twice. The line above hence means: show me everything from host theta with UID 70, or of host epsilon with a process name of avahi-daemon.

And now, it becomes magic!

That was already pretty cool, right? Righ! But heck, who can remember all those values a field can take in the journal, I mean, seriously, who has thaaaat kind of photographic memory? Well, the journal has:

$ journalctl -F _SYSTEMD_UNIT

This will show us all values the field _SYSTEMD_UNIT takes in the database, or in other words: the names of all systemd services which ever logged into the journal. This makes it super-easy to build nice matches. But wait, turns out this all is actually hooked up with shell completion on bash! This gets even more awesome: as you type your match expression you will get a list of well-known field names, and of the values they can take! Let's figure out how to filter for SELinux labels again. We remember the field name was something with SELINUX in it, let's try that:

$ journalctl _SE<TAB>

And yupp, it's immediately completed:

$ journalctl _SELINUX_CONTEXT=

Cool, but what's the label again we wanted to match for?

$ journalctl _SELINUX_CONTEXT=<TAB><TAB>
kernel                                                       system_u:system_r:local_login_t:s0-s0:c0.c1023               system_u:system_r:udev_t:s0-s0:c0.c1023
system_u:system_r:accountsd_t:s0                             system_u:system_r:lvm_t:s0                                   system_u:system_r:virtd_t:s0-s0:c0.c1023
system_u:system_r:avahi_t:s0                                 system_u:system_r:modemmanager_t:s0-s0:c0.c1023              system_u:system_r:vpnc_t:s0
system_u:system_r:bluetooth_t:s0                             system_u:system_r:NetworkManager_t:s0                        system_u:system_r:xdm_t:s0-s0:c0.c1023
system_u:system_r:chkpwd_t:s0-s0:c0.c1023                    system_u:system_r:policykit_t:s0                             unconfined_u:system_r:rpm_t:s0-s0:c0.c1023
system_u:system_r:chronyd_t:s0                               system_u:system_r:rtkit_daemon_t:s0                          unconfined_u:system_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:crond_t:s0-s0:c0.c1023                     system_u:system_r:syslogd_t:s0                               unconfined_u:system_r:useradd_t:s0-s0:c0.c1023
system_u:system_r:devicekit_disk_t:s0                        system_u:system_r:system_cronjob_t:s0-s0:c0.c1023            unconfined_u:unconfined_r:unconfined_dbusd_t:s0-s0:c0.c1023
system_u:system_r:dhcpc_t:s0                                 system_u:system_r:system_dbusd_t:s0-s0:c0.c1023              unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:dnsmasq_t:s0-s0:c0.c1023                   system_u:system_r:systemd_logind_t:s0
system_u:system_r:init_t:s0                                  system_u:system_r:systemd_tmpfiles_t:s0

Ah! Right! We wanted to see everything logged under PolicyKit's security label:

$ journalctl _SELINUX_CONTEXT=system_u:system_r:policykit_t:s0

Wow! That was easy! I didn't know anything related to SELinux could be thaaat easy! ;-) Of course this kind of completion works with any field, not just SELinux labels.

So much for now. There's a lot more cool stuff in journalctl(1) than this. For example, it generates JSON output for you! You can match against kernel fields! You can get simple /var/log/messages-like output but with relative timestamps! And so much more!

Anyway, in the next weeks I hope to post more stories about all the cool things the journal can do for you. This is just the beginning, stay tuned.

Footnotes

[1] systemd 195 is currently still in Bodhi but hopefully will get into F18 proper soon, and definitely before the release of Fedora 18.

[2] OK, I cheated here, indexing by block device is not in the kernel yet, but on its way due to Hannes' fantastic work, and I hope it will make appearence in F18.


systemd for Administrators, Part XVI

And, yes, here's now the sixteenth installment of my ongoing series on systemd for Administrators:

Gettys on Serial Consoles (and Elsewhere)

TL;DR: To make use of a serial console, just use console=ttyS0 on the kernel command line, and systemd will automatically start a getty on it for you.

While physical RS232 serial ports have become exotic in today's PCs they play an important role in modern servers and embedded hardware. They provide a relatively robust and minimalistic way to access the console of your device, that works even when the network is hosed, or the primary UI is unresponsive. VMs frequently emulate a serial port as well.

Of course, Linux has always had good support for serial consoles, but with systemd we tried to make serial console support even simpler to use. In the following text I'll try to give an overview how serial console gettys on systemd work, and how TTYs of any kind are handled.

Let's start with the key take-away: in most cases, to get a login prompt on your serial prompt you don't need to do anything. systemd checks the kernel configuration for the selected kernel console and will simply spawn a serial getty on it. That way it is entirely sufficient to configure your kernel console properly (for example, by adding console=ttyS0 to the kernel command line) and that's it. But let's have a look at the details:

In systemd, two template units are responsible for bringing up a login prompt on text consoles:

  1. getty@.service is responsible for virtual terminal (VT) login prompts, i.e. those on your VGA screen as exposed in /dev/tty1 and similar devices.
  2. serial-getty@.service is responsible for all other terminals, including serial ports such as /dev/ttyS0. It differs in a couple of ways from getty@.service: among other things the $TERM environment variable is set to vt102 (hopefully a good default for most serial terminals) rather than linux (which is the right choice for VTs only), and a special logic that clears the VT scrollback buffer (and only work on VTs) is skipped.
Virtual Terminals

Let's have a closer look how getty@.service is started, i.e. how login prompts on the virtual terminal (i.e. non-serial TTYs) work. Traditionally, the init system on Linux machines was configured to spawn a fixed number login prompts at boot. In most cases six instances of the getty program were spawned, on the first six VTs, tty1 to tty6.

In a systemd world we made this more dynamic: in order to make things more efficient login prompts are now started on demand only. As you switch to the VTs the getty service is instantiated to getty@tty2.service, getty@tty5.service and so on. Since we don't have to unconditionally start the getty processes anymore this allows us to save a bit of resources, and makes start-up a bit faster. This behaviour is mostly transparent to the user: if the user activates a VT the getty is started right-away, so that the user will hardly notice that it wasn't running all the time. If he then logs in and types ps he'll notice however that getty instances are only running for the VTs he so far switched to.

By default this automatic spawning is done for the VTs up to VT6 only (in order to be close to the traditional default configuration of Linux systems)[1]. Note that the auto-spawning of gettys is only attempted if no other subsystem took possession of the VTs yet. More specifically, if a user makes frequent use of fast user switching via GNOME he'll get his X sessions on the first six VTs, too, since the lowest available VT is allocated for each session.

Two VTs are handled specially by the auto-spawning logic: firstly tty1 gets special treatment: if we boot into graphical mode the display manager takes possession of this VT. If we boot into multi-user (text) mode a getty is started on it -- unconditionally, without any on-demand logic[2].

Secondly, tty6 is especially reserved for auto-spawned gettys and unavailable to other subsystems such as X[3]. This is done in order to ensure that there's always a way to get a text login, even if due to fast user switching X took possession of more than 5 VTs.

Serial Terminals

Handling of login prompts on serial terminals (and all other kind of non-VT terminals) is different from that of VTs. By default systemd will instantiate one serial-getty@.service on the main kernel[4] console, if it is not a virtual terminal. The kernel console is where the kernel outputs its own log messages and is usually configured on the kernel command line in the boot loader via an argument such as console=ttyS0[5]. This logic ensures that when the user asks the kernel to redirect its output onto a certain serial terminal, he will automatically also get a login prompt on it as the boot completes[6]. systemd will also spawn a login prompt on the first special VM console (that's /dev/hvc0, /dev/xvc0, /dev/hvsi0), if the system is run in a VM that provides these devices. This logic is implemented in a generator called systemd-getty-generator that is run early at boot and pulls in the necessary services depending on the execution environment.

In many cases, this automatic logic should already suffice to get you a login prompt when you need one, without any specific configuration of systemd. However, sometimes there's the need to manually configure a serial getty, for example, if more than one serial login prompt is needed or the kernel console should be redirected to a different terminal than the login prompt. To facilitate this it is sufficient to instantiate serial-getty@.service once for each serial port you want it to run on[7]:

# systemctl enable serial-getty@ttyS2.service
# systemctl start serial-getty@ttyS2.service

And that's it. This will make sure you get the login prompt on the chosen port on all subsequent boots, and starts it right-away too.

Sometimes, there's the need to configure the login prompt in even more detail. For example, if the default baud rate configured by the kernel is not correct or other agetty parameters need to be changed. In such a case simply copy the default unit template to /etc/systemd/system and edit it there:

# cp /usr/lib/systemd/system/serial-getty@.service /etc/systemd/system/serial-getty@ttyS2.service
# vi /etc/systemd/system/serial-getty@ttyS2.service
 .... now make your changes to the agetty command line ...
# ln -s /etc/systemd/system/serial-getty@ttyS2.service /etc/systemd/system/getty.target.wants/
# systemctl daemon-reload
# systemctl start serial-getty@ttyS2.service

This creates a unit file that is specific to serial port ttyS2, so that you can make specific changes to this port and this port only.

And this is pretty much all there's to say about serial ports, VTs and login prompts on them. I hope this was interesting, and please come back soon for the next installment of this series!

Footnotes

[1] You can easily modify this by changing NAutoVTs= in logind.conf.

[2] Note that whether the getty on VT1 is started on-demand or not hardly makes a difference, since VT1 is the default active VT anyway, so the demand is there anyway at boot.

[3] You can easily change this special reserved VT by modifying ReserveVT= in logind.conf.

[4] If multiple kernel consoles are used simultaneously, the main console is the one listed first in /sys/class/tty/console/active, which is the last one listed on the kernel command line.

[5] See kernel-parameters.txt for more information on this kernel command line option.

[6] Note that agetty -s is used here so that the baud rate configured at the kernel command line is not altered and continued to be used by the login prompt.

[7] Note that this systemctl enable syntax only works with systemd 188 and newer (i.e. F18). On older versions use ln -s /usr/lib/systemd/system/serial-getty@.service /etc/systemd/system/getty.target.wants/serial-getty@ttyS2.service ; systemctl daemon-reload instead.


Berlin Open Source Meetup

Prater

Chris Kühl and I are organizing a Berlin Open Source Meetup on Aug 19th at the Prater Biergarten in Prenzlauer Berg. If you live in Berlin (or are passing by) and are involved in or interested in Open Source then you are invited!

There's also a Google+ event for the meetup.

It's a public event, so everybody is welcome, and please feel free to invite others!

See you at the Prater!


Upcoming Hackfests/Sprints

The Linux Plumbers Conference 2012 will take place August 29th to 31st in San Diego, California. We, the systemd developers, would like to invite you to two hackfests/sprints that will happen around LPC:

San Diego: libvirt/LXC/systemd/SELinux Integration Hackfest

On 28th of August we'll have a hackfest on the topic of closer integration of libvirt, LXC, systemd and SELinux, colocated with LPC in San Diego, California. We'll have a number of key people from these projects participating, including Dan Walsh, Eric Paris, Daniel P. Berrange, Kay Sievers and myself.

Topics we'll cover: making Fedora/Linux boot entirely cleanly in normal containers, teaching systemd's control tools minimal container-awareness (such as being able to list all services of all containers in one go, in addition to those running on the host system), unified journal logging across multiple containers, the systemd container interface, auditing and containers, running multiple instances from the same /usr tree, and a lot more...

Who should attend? Everybody hacking on the mentioned projects who wants to help integrating them with the goal of turning them into a secure, reliable, powerful container solution for Linux.

Who should not attend? If you don't hack on any of these projects, or if you are not interested in closer integration of at least two of these projects.

How to register? Just show up. You get extra points however for letting us know in advance (just send us an email). Attendance is free.

➥ See also: Google+ Event

San Francisco: systemd Journal Sprint

On September 3-7 we'll have a sprint on the topic of the systemd Journal. It's going to take place at the Pantheon headquarters in San Francisco, California. Among others, Kay Sievers, David Strauss and I will participate.

Who should attend? Everybody who wants to help improving the systemd Journal, regardless if in its core itself, in client software for it, hooking up other projects or writing library bindings for it. Also, if you are using or planning to use the journal for a project, we'd be very interested in high-bandwith face-to-face feedback regarding what you are missing, what you don't like so much, and what you find awesome in the Journal.

How to register? Please sign up at EventBrite. Attendance is free. For more information see the invitation mail.

➥ See also: Google+ Event

See you in California!


foss.in 2012 CFP Ends in a Few Hours

foss.in 2012 in Bangalore takes place again after a hiatus of some years. It has always been a fantastic conference, and a great opportunity to visit Bangalore and India. I just submitted my talk proposals, so, hurry up, and submit yours!


systemd for Administrators, Part XV

Quickly following the previous iteration, here's now the fifteenth installment of my ongoing series on systemd for Administrators:

Watchdogs

There are three big target audiences we try to cover with systemd: the embedded/mobile folks, the desktop people and the server folks. While the systems used by embedded/mobile tend to be underpowered and have few resources are available, desktops tend to be much more powerful machines -- but still much less resourceful than servers. Nonetheless there are surprisingly many features that matter to both extremes of this axis (embedded and servers), but not the center (desktops). On of them is support for watchdogs in hardware and software.

Embedded devices frequently rely on watchdog hardware that resets it automatically if software stops responding (more specifically, stops signalling the hardware in fixed intervals that it is still alive). This is required to increase reliability and make sure that regardless what happens the best is attempted to get the system working again. Functionality like this makes little sense on the desktop[1]. However, on high-availability servers watchdogs are frequently used, again.

Starting with version 183 systemd provides full support for hardware watchdogs (as exposed in /dev/watchdog to userspace), as well as supervisor (software) watchdog support for invidual system services. The basic idea is the following: if enabled, systemd will regularly ping the watchdog hardware. If systemd or the kernel hang this ping will not happen anymore and the hardware will automatically reset the system. This way systemd and the kernel are protected from boundless hangs -- by the hardware. To make the chain complete, systemd then exposes a software watchdog interface for individual services so that they can also be restarted (or some other action taken) if they begin to hang. This software watchdog logic can be configured individually for each service in the ping frequency and the action to take. Putting both parts together (i.e. hardware watchdogs supervising systemd and the kernel, as well as systemd supervising all other services) we have a reliable way to watchdog every single component of the system.

To make use of the hardware watchdog it is sufficient to set the RuntimeWatchdogSec= option in /etc/systemd/system.conf. It defaults to 0 (i.e. no hardware watchdog use). Set it to a value like 20s and the watchdog is enabled. After 20s of no keep-alive pings the hardware will reset itself. Note that systemd will send a ping to the hardware at half the specified interval, i.e. every 10s. And that's already all there is to it. By enabling this single, simple option you have turned on supervision by the hardware of systemd and the kernel beneath it.[2]

Note that the hardware watchdog device (/dev/watchdog) is single-user only. That means that you can either enable this functionality in systemd, or use a separate external watchdog daemon, such as the aptly named watchdog.

ShutdownWatchdogSec= is another option that can be configured in /etc/systemd/system.conf. It controls the watchdog interval to use during reboots. It defaults to 10min, and adds extra reliability to the system reboot logic: if a clean reboot is not possible and shutdown hangs, we rely on the watchdog hardware to reset the system abruptly, as extra safety net.

So much about the hardware watchdog logic. These two options are really everything that is necessary to make use of the hardware watchdogs. Now, let's have a look how to add watchdog logic to individual services.

First of all, to make software watchdog-supervisable it needs to be patched to send out "I am alive" signals in regular intervals in its event loop. Patching this is relatively easy. First, a daemon needs to read the WATCHDOG_USEC= environment variable. If it is set, it will contain the watchdog interval in usec formatted as ASCII text string, as it is configured for the service. The daemon should then issue sd_notify("WATCHDOG=1") calls every half of that interval. A daemon patched this way should transparently support watchdog functionality by checking whether the environment variable is set and honouring the value it is set to.

To enable the software watchdog logic for a service (which has been patched to support the logic pointed out above) it is sufficient to set the WatchdogSec= to the desired failure latency. See systemd.service(5) for details on this setting. This causes WATCHDOG_USEC= to be set for the service's processes and will cause the service to enter a failure state as soon as no keep-alive ping is received within the configured interval.

If a service enters a failure state as soon as the watchdog logic detects a hang, then this is hardly sufficient to build a reliable system. The next step is to configure whether the service shall be restarted and how often, and what to do if it then still fails. To enable automatic service restarts on failure set Restart=on-failure for the service. To configure how many times a service shall be attempted to be restarted use the combination of StartLimitBurst= and StartLimitInterval= which allow you to configure how often a service may restart within a time interval. If that limit is reached, a special action can be taken. This action is configured with StartLimitAction=. The default is a none, i.e. that no further action is taken and the service simply remains in the failure state without any further attempted restarts. The other three possible values are reboot, reboot-force and reboot-immediate. reboot attempts a clean reboot, going through the usual, clean shutdown logic. reboot-force is more abrupt: it will not actually try to cleanly shutdown any services, but immediately kills all remaining services and unmounts all file systems and then forcibly reboots (this way all file systems will be clean but reboot will still be very fast). Finally, reboot-immediate does not attempt to kill any process or unmount any file systems. Instead it just hard reboots the machine without delay. reboot-immediate hence comes closest to a reboot triggered by a hardware watchdog. All these settings are documented in systemd.service(5).

Putting this all together we now have pretty flexible options to watchdog-supervise a specific service and configure automatic restarts of the service if it hangs, plus take ultimate action if that doesn't help.

Here's an example unit file:

[Unit]
Description=My Little Daemon
Documentation=man:mylittled(8)

[Service]
ExecStart=/usr/bin/mylittled
WatchdogSec=30s
Restart=on-failure
StartLimitInterval=5min
StartLimitBurst=4
StartLimitAction=reboot-force

This service will automatically be restarted if it hasn't pinged the system manager for longer than 30s or if it fails otherwise. If it is restarted this way more often than 4 times in 5min action is taken and the system quickly rebooted, with all file systems being clean when it comes up again.

And that's already all I wanted to tell you about! With hardware watchdog support right in PID 1, as well as supervisor watchdog support for individual services we should provide everything you need for most watchdog usecases. Regardless if you are building an embedded or mobile applience, or if your are working with high-availability servers, please give this a try!

(Oh, and if you wonder why in heaven PID 1 needs to deal with /dev/watchdog, and why this shouldn't be kept in a separate daemon, then please read this again and try to understand that this is all about the supervisor chain we are building here, where the hardware watchdog supervises systemd, and systemd supervises the individual services. Also, we believe that a service not responding should be treated in a similar way as any other service error. Finally, pinging /dev/watchdog is one of the most trivial operations in the OS (basically little more than a ioctl() call), to the support for this is not more than a handful lines of code. Maintaining this externally with complex IPC between PID 1 (and the daemons) and this watchdog daemon would be drastically more complex, error-prone and resource intensive.)

Note that the built-in hardware watchdog support of systemd does not conflict with other watchdog software by default. systemd does not make use of /dev/watchdog by default, and you are welcome to use external watchdog daemons in conjunction with systemd, if this better suits your needs.

And one last thing: if you wonder whether your hardware has a watchdog, then the answer is: almost definitely yes -- if it is anything more recent than a few years. If you want to verify this, try the wdctl tool from recent util-linux, which shows you everything you need to know about your watchdog hardware.

I'd like to thank the great folks from Pengutronix for contributing most of the watchdog logic. Thank you!

Footnotes

[1] Though actually most desktops tend to include watchdog hardware these days too, as this is cheap to build and available in most modern PC chipsets.

[2] So, here's a free tip for you if you hack on the core OS: don't enable this feature while you hack. Otherwise your system might suddenly reboot if you are in the middle of tracing through PID 1 with gdb and cause it to be stopped for a moment, so that no hardware ping can be done...

© Lennart Poettering. Built using Pelican. Theme by Giulio Fidente on github. .