Category: projects

File Descriptor Limits

TL;DR: don't use select() + bump the RLIMIT_NOFILE soft limit to the hard limit in your modern programs.

The primary way to reference, allocate and pin runtime OS resources on Linux today are file descriptors ("fds"). Originally they were used to reference open files and directories and maybe a bit more, but today they may be used to reference almost any kind of runtime resource in Linux userspace, including open devices, memory (memfd_create(2)), timers (timefd_create(2)) and even processes (with the new pidfd_open(2) system call). In a way, the philosophically skewed UNIX concept of "everything is a file" through the proliferation of fds actually acquires a bit of sensible meaning: "everything has a file descriptor" is certainly a much better motto to adopt.

Because of this proliferation of fds, non-trivial modern programs tend to have to deal with substantially more fds at the same time than they traditionally did. Today, you'll often encounter real-life programs that have a few thousand fds open at the same time.

Like on most runtime resources on Linux limits are enforced on file descriptors: once you hit the resource limit configured via RLIMIT_NOFILE any attempt to allocate more is refused with the EMFILE error — until you close a couple of those you already have open.

Because fds weren't such a universal concept traditionally, the limit of RLIMIT_NOFILE used to be quite low. Specifically, when the Linux kernel first invokes userspace it still sets RLIMIT_NOFILE to a low value of 1024 (soft) and 4096 (hard). (Quick explanation: the soft limit is what matters and causes the EMFILE issues, the hard limit is a secondary limit that processes may bump their soft limit to — if they like — without requiring further privileges to do so. Bumping the limit further would require privileges however.). A limit of 1024 fds made fds a scarce resource: APIs tried to be careful with using fds, since you simply couldn't have that many of them at the same time. This resulted in some questionable coding decisions and concepts at various places: often secondary descriptors that are very similar to fds — but were not actually fds — were introduced (e.g. inotify watch descriptors), simply to avoid for them the low limits enforced on true fds. Or code tried to aggressively close fds when not absolutely needing them (e.g. ftw()/nftw()), losing the nice + stable "pinning" effect of open fds.

Worse though is that certain OS level APIs were designed having only the low limits in mind. The worst offender being the BSD/POSIX select(2) system call: it only works with fds in the numeric range of 0…1023 (aka FD_SETSIZE-1). If you have an fd outside of this range, tough luck: select() won't work, and only if you are lucky you'll detect that and can handle it somehow.

Linux fds are exposed as simple integers, and for most calls it is guaranteed that the lowest unused integer is allocated for new fds. Thus, as long as the RLIMIT_NOFILE soft limit is set to 1024 everything remains compatible with select(): the resulting fds will also be below 1024. Yay. If we'd bump the soft limit above this threshold though and at some point in time an fd higher than the threshold is allocated, this fd would not be compatible with select() anymore.

Because of that, indiscriminately increasing the soft RLIMIT_NOFILE resource limit today for every userspace process is problematic: as long as there's userspace code still using select() doing so will risk triggering hard-to-handle, hard-to-debug errors all over the place.

However, given the nowadays ubiquitous use of fds for all kinds of resources (did you know, an eBPF program is an fd? and a cgroup too? and attaching an eBPF program to cgroup is another fd? …), we'd really like to raise the limit anyway. 🤔

So before we continue thinking about this problem, let's make the problem more complex (…uh, I mean… "more exciting") first. Having just one hard and one soft per-process limit on fds is boring. Let's add more limits on fds to the mix. Specifically on Linux there are two system-wide sysctls: fs.nr_open and fs.file-max. (Don't ask me why one uses a dash and the other an underscore, or why there are two of them...) On today's kernels they kinda lost their relevance. They had some originally, because fds weren't accounted by any other counter. But today, the kernel tracks fds mostly as small pieces of memory allocated on userspace requests — because that's ultimately what they are —, and thus charges them to the memory accounting done anyway.

So now, we have four limits (actually: five if you count the memory accounting) on the same kind of resource, and all of them make a resource artificially scarce that we don't want to be scarce. So what to do?

Back in systemd v240 already (i.e. 2019) we decided to do something about it. Specifically:

  • Automatically at boot we'll now bump the two sysctls to their maximum, making them effectively ineffective. This one was easy. We got rid of two pretty much redundant knobs. Nice!

  • The RLIMIT_NOFILE hard limit is bumped substantially to 512K. Yay, cheap fds! You may have an fd, and you, and you as well, everyone may have an fd!

  • But … we left the soft RLIMIT_NOFILE limit at 1024. We weren't quite ready to break all programs still using select() in 2019 yet. But it's not as bad as it might sound I think: given the hard limit is bumped every program can easily opt-in to a larger number of fds, by setting the soft limit to the hard limit early on — without requiring privileges.

So effectively, with this approach fds should be much less scarce (at least for programs that opt into that), and the limits should be much easier to configure, since there are only two knobs now one really needs to care about:

  • Configure the RLIMIT_NOFILE hard limit to the maximum number of fds you actually want to allow a process.

  • In the program code then either bump the soft to the hard limit, or not. If you do, you basically declare "I understood the problem, I promise to not use select(), drown me fds please!". If you don't then effectively everything remains as it always was.

Apparently this approach worked, since the negative feedback on change was even scarcer than fds traditionally were (ha, fun!). We got reports from pretty much only two projects that were bitten by the change (one being a JVM implementation): they already bumped their soft limit automatically to their hard limit during program initialization, and then allocated an array with one entry per possible fd. With the new high limit this resulted in one massive allocation that traditionally was just a few K, and this caused memory checks to be hit.

Anyway, here's the take away of this blog story:

  • Don't use select() anymore in 2021. Use poll(), epoll, iouring, …, but for heaven's sake don't use select(). It might have been all the rage in the 1990s but it doesn't scale and is simply not designed for today's programs. I wished the man page of select() would make clearer how icky it is and that there are plenty of more preferably APIs.

  • If you hack on a program that potentially uses a lot of fds, add some simple code somewhere to its start-up that bumps the RLIMIT_NOFILE soft limit to the hard limit. But if you do this, you have to make sure your code (and any code that you link to from it) refrains from using select(). (Note: there's at least one glibc NSS plugin using select() internally. Given that NSS modules can end up being loaded into pretty much any process such modules should probably be considered just buggy.)

  • If said program you hack on forks off foreign programs, make sure to reset the RLIMIT_NOFILE soft limit back to 1024 for them. Just because your program might be fine with fds >= 1024 it doesn't mean that those foreign programs might. And unfortunately RLIMIT_NOFILE is inherited down the process tree unless explicitly set.

And that's all I have for today. I hope this was enlightening.


Unlocking LUKS2 volumes with TPM2, FIDO2, PKCS#11 Security Hardware on systemd 248

TL;DR: It's now easy to unlock your LUKS2 volume with a FIDO2 security token (e.g. YubiKey, Nitrokey FIDO2, AuthenTrend ATKey.Pro). And TPM2 unlocking is easy now too.

Blogging is a lot of work, and a lot less fun than hacking. I mostly focus on the latter because of that, but from time to time I guess stuff is just too interesting to not be blogged about. Hence here, finally, another blog story about exciting new features in systemd.

With the upcoming systemd v248 the systemd-cryptsetup component of systemd (which is responsible for assembling encrypted volumes during boot) gained direct support for unlocking encrypted storage with three types of security hardware:

  1. Unlocking with FIDO2 security tokens (well, at least with those which implement the hmac-secret extension; most do). i.e. your YubiKeys (series 5 and above), Nitrokey FIDO2, AuthenTrend ATKey.Pro and such.

  2. Unlocking with TPM2 security chips (pretty ubiquitous on non-budget PCs/laptops/…)

  3. Unlocking with PKCS#11 security tokens, i.e. your smartcards and older YubiKeys (the ones that implement PIV). (Strictly speaking this was supported on older systemd already, but was a lot more "manual".)

For completeness' sake, let's keep in mind that the component also allows unlocking with these more traditional mechanisms:

  1. Unlocking interactively with a user-entered passphrase (i.e. the way most people probably already deploy it, supported since about forever)

  2. Unlocking via key file on disk (optionally on removable media plugged in at boot), supported since forever.

  3. Unlocking via a key acquired through trivial AF_UNIX/SOCK_STREAM socket IPC. (Also new in v248)

  4. Unlocking via recovery keys. These are pretty much the same thing as a regular passphrase (and in fact can be entered wherever a passphrase is requested) — the main difference being that they are always generated by the computer, and thus have guaranteed high entropy, typically higher than user-chosen passphrases. They are generated in a way they are easy to type, in many cases even if the local key map is misconfigured. (Also new in v248)

In this blog story, let's focus on the first three items, i.e. those that talk to specific types of hardware for implementing unlocking.

To make working with security tokens and TPM2 easy, a new, small tool was added to the systemd tool set: systemd-cryptenroll. It's only purpose is to make it easy to enroll your security token/chip of choice into an encrypted volume. It works with any LUKS2 volume, and embeds a tiny bit of meta-information into the LUKS2 header with parameters necessary for the unlock operation.

Unlocking with FIDO2

So, let's see how this fits together in the FIDO2 case. Most likely this is what you want to use if you have one of these fancy FIDO2 tokens (which need to implement the hmac-secret extension, as mentioned). Let's say you already have your LUKS2 volume set up, and previously unlocked it with a simple passphrase. Plug in your token, and run:

# systemd-cryptenroll --fido2-device=auto /dev/sda5

(Replace /dev/sda5 with the underlying block device of your volume).

This will enroll the key as an additional way to unlock the volume, and embeds all necessary information for it in the LUKS2 volume header. Before we can unlock the volume with this at boot, we need to allow FIDO2 unlocking via /etc/crypttab. For that, find the right entry for your volume in that file, and edit it like so:

myvolume /dev/sda5 - fido2-device=auto

Replace myvolume and /dev/sda5 with the right volume name, and underlying device of course. Key here is the fido2-device=auto option you need to add to the fourth column in the file. It tells systemd-cryptsetup to use the FIDO2 metadata now embedded in the LUKS2 header, wait for the FIDO2 token to be plugged in at boot (utilizing systemd-udevd, …) and unlock the volume with it.

And that's it already. Easy-peasy, no?

Note that all of this doesn't modify the FIDO2 token itself in any way. Moreover you can enroll the same token in as many volumes as you like. Since all enrollment information is stored in the LUKS2 header (and not on the token) there are no bounds on any of this. (OK, well, admittedly, there's a cap on LUKS2 key slots per volume, i.e. you can't enroll more than a bunch of keys per volume.)

Unlocking with PKCS#11

Let's now have a closer look how the same works with a PKCS#11 compatible security token or smartcard. For this to work, you need a device that can store an RSA key pair. I figure most security tokens/smartcards that implement PIV qualify. How you actually get the keys onto the device might differ though. Here's how you do this for any YubiKey that implements the PIV feature:

# ykman piv reset
# ykman piv generate-key -a RSA2048 9d pubkey.pem
# ykman piv generate-certificate --subject "Knobelei" 9d pubkey.pem
# rm pubkey.pem

(This chain of commands erases what was stored in PIV feature of your token before, be careful!)

For tokens/smartcards from other vendors a different series of commands might work. Once you have a key pair on it, you can enroll it with a LUKS2 volume like so:

# systemd-cryptenroll --pkcs11-token-uri=auto /dev/sda5

Just like the same command's invocation in the FIDO2 case this enrolls the security token as an additional way to unlock the volume, any passphrases you already have enrolled remain enrolled.

For the PKCS#11 case you need to edit your /etc/crypttab entry like this:

myvolume /dev/sda5 - pkcs11-uri=auto

If you have a security token that implements both PKCS#11 PIV and FIDO2 I'd probably enroll it as FIDO2 device, given it's the more contemporary, future-proof standard. Moreover, it requires no special preparation in order to get an RSA key onto the device: FIDO2 keys typically just work.

Unlocking with TPM2

Most modern (non-budget) PC hardware (and other kind of hardware too) nowadays comes with a TPM2 security chip. In many ways a TPM2 chip is a smartcard that is soldered onto the mainboard of your system. Unlike your usual USB-connected security tokens you thus cannot remove them from your PC, which means they address quite a different security scenario: they aren't immediately comparable to a physical key you can take with you that unlocks some door, but they are a key you leave at the door, but that refuses to be turned by anyone but you.

Even though this sounds a lot weaker than the FIDO2/PKCS#11 model TPM2 still bring benefits for securing your systems: because the cryptographic key material stored in TPM2 devices cannot be extracted (at least that's the theory), if you bind your hard disk encryption to it, it means attackers cannot just copy your disk and analyze it offline — they always need access to the TPM2 chip too to have a chance to acquire the necessary cryptographic keys. Thus, they can still steal your whole PC and analyze it, but they cannot just copy the disk without you noticing and analyze the copy.

Moreover, you can bind the ability to unlock the harddisk to specific software versions: for example you could say that only your trusted Fedora Linux can unlock the device, but not any arbitrary OS some hacker might boot from a USB stick they plugged in. Thus, if you trust your OS vendor, you can entrust storage unlocking to the vendor's OS together with your TPM2 device, and thus can be reasonably sure intruders cannot decrypt your data unless they both hack your OS vendor and steal/break your TPM2 chip.

Here's how you enroll your LUKS2 volume with your TPM2 chip:

# systemd-cryptenroll --tpm2-device=auto --tpm2-pcrs=7 /dev/sda5

This looks almost as straightforward as the two earlier sytemd-cryptenroll command lines — if it wasn't for the --tpm2-pcrs= part. With that option you can specify to which TPM2 PCRs you want to bind the enrollment. TPM2 PCRs are a set of (typically 24) hash values that every TPM2 equipped system at boot calculates from all the software that is invoked during the boot sequence, in a secure, unfakable way (this is called "measurement"). If you bind unlocking to a specific value of a specific PCR you thus require the system has to follow the same sequence of software at boot to re-acquire the disk encryption key. Sounds complex? Well, that's because it is.

For now, let's see how we have to modify your /etc/crypttab to unlock via TPM2:

myvolume /dev/sda5 - tpm2-device=auto

This part is easy again: the tpm2-device= option is what tells systemd-cryptsetup to use the TPM2 metadata from the LUKS2 header and to wait for the TPM2 device to show up.

Bonus: Recovery Key Enrollment

FIDO2, PKCS#11 and TPM2 security tokens and chips pair well with recovery keys: since you don't need to type in your password everyday anymore it makes sense to get rid of it, and instead enroll a high-entropy recovery key you then print out or scan off screen and store a safe, physical location. i.e. forget about good ol' passphrase-based unlocking, go for FIDO2 plus recovery key instead! Here's how you do it:

# systemd-cryptenroll --recovery-key /dev/sda5

This will generate a key, enroll it in the LUKS2 volume, show it to you on screen and generate a QR code you may scan off screen if you like. The key has highest entropy, and can be entered wherever you can enter a passphrase. Because of that you don't have to modify /etc/crypttab to make the recovery key work.

Future

There's still plenty room for further improvement in all of this. In particular for the TPM2 case: what the text above doesn't really mention is that binding your encrypted volume unlocking to specific software versions (i.e. kernel + initrd + OS versions) actually sucks hard: if you naively update your system to newer versions you might lose access to your TPM2 enrolled keys (which isn't terrible, after all you did enroll a recovery key — right? — which you then can use to regain access). To solve this some more integration with distributions would be necessary: whenever they upgrade the system they'd have to make sure to enroll the TPM2 again — with the PCR hashes matching the new version. And whenever they remove an old version of the system they need to remove the old TPM2 enrollment. Alternatively TPM2 also knows a concept of signed PCR hash values. In this mode the distro could just ship a set of PCR signatures which would unlock the TPM2 keys. (But quite frankly I don't really see the point: whether you drop in a signature file on each system update, or enroll a new set of PCR hashes in the LUKS2 header doesn't make much of a difference). Either way, to make TPM2 enrollment smooth some more integration work with your distribution's system update mechanisms need to happen. And yes, because of this OS updating complexity the example above — where I referenced your trusty Fedora Linux — doesn't actually work IRL (yet? hopefully…). Nothing updates the enrollment automatically after you initially enrolled it, hence after the first kernel/initrd update you have to manually re-enroll things again, and again, and again … after every update.

The TPM2 could also be used for other kinds of key policies, we might look into adding later too. For example, Windows uses TPM2 stuff to allow short (4 digits or so) "PINs" for unlocking the harddisk, i.e. kind of a low-entropy password you type in. The reason this is reasonably safe is that in this case the PIN is passed to the TPM2 which enforces that not more than some limited amount of unlock attempts may be made within some time frame, and that after too many attempts the PIN is invalidated altogether. Thus making dictionary attacks harder (which would normally be easier given the short length of the PINs).

Postscript

(BTW: Yubico sent me two YubiKeys for testing, Nitrokey a Nitrokey FIDO2, and AuthenTrend three ATKey.Pro tokens, thank you! — That's why you see all those references to YubiKey/Nitrokey/AuthenTrend devices in the text above: it's the hardware I had to test this with. That said, I also tested the FIDO2 stuff with a SoloKey I bought, where it also worked fine. And yes, you!, other vendors!, who might be reading this, please send me your security tokens for free, too, and I might test things with them as well. No promises though. And I am not going to give them back, if you do, sorry. ;-))


ASG! 2019 CfP Re-Opened!

The All Systems Go! 2019 Call for Participation Re-Opened for ONE DAY!

Due to popular request we have re-opened the Call for Participation (CFP) for All Systems Go! 2019 for one day. It will close again TODAY, on 15 of July 2019, midnight Central European Summit Time! If you missed the deadline so far, we’d like to invite you to submit your proposals for consideration to the CFP submission site quickly! (And yes, this is the last extension, there's not going to be any more extensions.)

ASG image

All Systems Go! is everybody's favourite low-level Userspace Linux conference, taking place in Berlin, Germany in September 20-22, 2019.

For more information please visit our conference website!


Walkthrough for Portable Services in Go

Portable Services Walkthrough (Go Edition)

A few months ago I posted a blog story with a walkthrough of systemd Portable Services. The example service given was written in C, and the image was built with mkosi. In this blog story I'd like to revisit the exercise, but this time focus on a different aspect: modern programming languages like Go and Rust push users a lot more towards static linking of libraries than the usual dynamic linking preferred by C (at least in the way C is used by traditional Linux distributions).

Static linking means we can greatly simplify image building: if we don't have to link against shared libraries during runtime we don't have to include them in the portable service image. And that means pretty much all need for building an image from a Linux distribution of some kind goes away as we'll have next to no dependencies that would require us to rely on a distribution package manager or distribution packages. In fact, as it turns out, we only need as few as three files in the portable service image to be fully functional.

So, let's have a closer look how such an image can be put together. All of the following is available in this git repository.

A Simple Go Service

Let's start with a simple Go service, an HTTP service that simply counts how often a page from it is requested. Here are the sources: main.go — note that I am not a seasoned Go programmer, hence please be gracious.

The service implements systemd's socket activation protocol, and thus can receive bound TCP listener sockets from systemd, using the $LISTEN_PID and $LISTEN_FDS environment variables.

The service will store the counter data in the directory indicated in the $STATE_DIRECTORY environment variable, which happens to be an environment variable current systemd versions set based on the StateDirectory= setting in service files.

Two Simple Unit Files

When a service shall be managed by systemd a unit file is required. Since the service we are putting together shall be socket activatable, we even have two: portable-walkthrough-go.service (the description of the service binary itself) and portable-walkthrough-go.socket (the description of the sockets to listen on for the service).

These units are not particularly remarkable: the .service file primarily contains the command line to invoke and a StateDirectory= setting to make sure the service when invoked gets its own private state directory under /var/lib/ (and the $STATE_DIRECTORY environment variable is set to the resulting path). The .socket file simply lists 8088 as TCP/IP port to listen on.

An OS Description File

OS images (and that includes portable service images) generally should include an os-release file. Usually, that is provided by the distribution. Since we are building an image without any distribution let's write our own version of such a file. Later on we can use the portablectl inspect command to have a look at this metadata of our image.

Putting it All Together

The four files described above are already every file we need to build our image. Let's now put the portable service image together. For that I've written a Makefile. It contains two relevant rules: the first one builds the static binary from the Go program sources. The second one then puts together a squashfs file system combining the following:

  1. The compiled, statically linked service binary
  2. The two systemd unit files
  3. The os-release file
  4. A couple of empty directories such as /proc/, /sys/, /dev/ and so on that need to be over-mounted with the respective kernel API file system. We need to create them as empty directories here since Linux insists on directories to exist in order to over-mount them, and since the image we are building is going to be an immutable read-only image (squashfs) these directories cannot be created dynamically when the portable image is mounted.
  5. Two empty files /etc/resolv.conf and /etc/machine-id that can be over-mounted with the same files from the host.

And that's already it. After a quick make we'll have our portable service image portable-walkthrough-go.raw and are ready to go.

Trying it out

Let's now attach the portable service image to our host system:

# portablectl attach ./portable-walkthrough-go.raw
(Matching unit files with prefix 'portable-walkthrough-go'.)
Created directory /etc/systemd/system.attached.
Created directory /etc/systemd/system.attached/portable-walkthrough-go.socket.d.
Written /etc/systemd/system.attached/portable-walkthrough-go.socket.d/20-portable.conf.
Copied /etc/systemd/system.attached/portable-walkthrough-go.socket.
Created directory /etc/systemd/system.attached/portable-walkthrough-go.service.d.
Written /etc/systemd/system.attached/portable-walkthrough-go.service.d/20-portable.conf.
Created symlink /etc/systemd/system.attached/portable-walkthrough-go.service.d/10-profile.conf  /usr/lib/systemd/portable/profile/default/service.conf.
Copied /etc/systemd/system.attached/portable-walkthrough-go.service.
Created symlink /etc/portables/portable-walkthrough-go.raw  /home/lennart/projects/portable-walkthrough-go/portable-walkthrough-go.raw.

The portable service image is now attached to the host, which means we can now go and start it (or even enable it):

# systemctl start portable-walkthrough-go.socket

Let's see if our little web service works, by doing an HTTP request on port 8088:

# curl localhost:8088
Hello! You are visitor #1!

Let's try this again, to check if it counts correctly:

# curl localhost:8088
Hello! You are visitor #2!

Nice! It worked. Let's now stop the service again, and detach the image again:

# systemctl stop portable-walkthrough-go.service portable-walkthrough-go.socket
# portablectl detach portable-walkthrough-go
Removed /etc/systemd/system.attached/portable-walkthrough-go.service.
Removed /etc/systemd/system.attached/portable-walkthrough-go.service.d/10-profile.conf.
Removed /etc/systemd/system.attached/portable-walkthrough-go.service.d/20-portable.conf.
Removed /etc/systemd/system.attached/portable-walkthrough-go.service.d.
Removed /etc/systemd/system.attached/portable-walkthrough-go.socket.
Removed /etc/systemd/system.attached/portable-walkthrough-go.socket.d/20-portable.conf.
Removed /etc/systemd/system.attached/portable-walkthrough-go.socket.d.
Removed /etc/portables/portable-walkthrough-go.raw.
Removed /etc/systemd/system.attached.

And there we go, the portable image file is detached from the host again.

A Couple of Notes

  1. Of course, this is a simplistic example: in real life services will be more than one compiled file, even when statically linked. But you get the idea, and it's very easy to extend the example above to include any additional, auxiliary files in the portable service image.

  2. The service is very nicely sandboxed during runtime: while it runs as regular service on the host (and you thus can watch its logs or do resource management on it like you would do for all other systemd services), it runs in a very restricted environment under a dynamically assigned UID that ceases to exist when the service is stopped again.

  3. Originally I wanted to make the service not only socket activatable but also implement exit-on-idle, i.e. add a logic so that the service terminates on its own when there's no ongoing HTTP connection for a while. I couldn't figure out how to do this race-freely in Go though, but I am sure an interested reader might want to add that? By combining socket activation with exit-on-idle we can turn this project into an excercise of putting together an extremely resource-friendly and robust service architecture: the service is started only when needed and terminates when no longer needed. This would allow to pack services at a much higher density even on systems with few resources.

  4. While the basic concepts of portable services have been around since systemd 239, it's best to try the above with systemd 241 or newer since the portable service logic received a number of fixes since then.

Further Reading

A low-level document introducing Portable Services is shipped along with systemd.

Please have a look at the blog story from a few months ago that did something very similar with a service written in C.

There are also relevant manual pages: portablectl(1) and systemd-portabled(8).


ASG! 2018 Tickets

All Systems Go! 2018 Tickets Selling Out Quickly!

Buy your tickets for All Systems Go! 2018 soon, they are quickly selling out! The conference takes place on September 28-30, in Berlin, Germany, in a bit over two weeks.

Why should you attend? If you are interested in low-level Linux userspace, then All Systems Go! is the right conference for you. It covers all topics relevant to foundational open-source Linux technologies. For details on the covered topics see our schedule for day #1 and for day #2.

For more information please visit our conference website!

See you in Berlin!


ASG! 2018 CfP Closes TODAY

The All Systems Go! 2018 Call for Participation Closes TODAY!

The Call for Participation (CFP) for All Systems Go! 2018 will close TODAY, on 30th of July! We’d like to invite you to submit your proposals for consideration to the CFP submission site quickly!

ASG image

All Systems Go! is everybody's favourite low-level Userspace Linux conference, taking place in Berlin, Germany in September 28-30, 2018.

For more information please visit our conference website!


ASG! 2018 CfP Closes Soon

The All Systems Go! 2018 Call for Participation Closes in One Week!

The Call for Participation (CFP) for All Systems Go! 2018 will close in one week, on 30th of July! We’d like to invite you to submit your proposals for consideration to the CFP submission site quickly!

ASG image

Notification of acceptance and non-acceptance will go out within 7 days of the closing of the CFP.

All topics relevant to foundational open-source Linux technologies are welcome. In particular, however, we are looking for proposals including, but not limited to, the following topics:

  • Low-level container executors and infrastructure
  • IoT and embedded OS infrastructure
  • BPF and eBPF filtering
  • OS, container, IoT image delivery and updating
  • Building Linux devices and applications
  • Low-level desktop technologies
  • Networking
  • System and service management
  • Tracing and performance measuring
  • IPC and RPC systems
  • Security and Sandboxing

While our focus is definitely more on the user-space side of things, talks about kernel projects are welcome, as long as they have a clear and direct relevance for user-space.

For more information please visit our conference website!


Walkthrough for Portable Services

Portable Services with systemd v239

systemd v239 contains a great number of new features. One of them is first class support for Portable Services. In this blog story I'd like to shed some light on what they are and why they might be interesting for your application.

What are "Portable Services"?

The "Portable Service" concept takes inspiration from classic chroot() environments as well as container management and brings a number of their features to more regular system service management.

While the definition of what a "container" really is is hotly debated, I figure people can generally agree that the "container" concept primarily provides two major features:

  1. Resource bundling: a container generally brings its own file system tree along, bundling any shared libraries and other resources it might need along with the main service executables.

  2. Isolation and sand-boxing: a container operates in a name-spaced environment that is relatively detached from the host. Besides living in its own file system namespace it usually also has its own user database, process tree and so on. Access from the container to the host is limited with various security technologies.

Of these two concepts the first one is also what traditional UNIX chroot() environments are about.

Both resource bundling and isolation/sand-boxing are concepts systemd has implemented to varying degrees for a longer time. Specifically, RootDirectory= and RootImage= have been around for a long time, and so have been the various sand-boxing features systemd provides. The Portable Services concept builds on that, putting these features together in a new, integrated way to make them more accessible and usable.

OK, so what precisely is a "Portable Service"?

Much like a container image, a portable service on disk can be just a directory tree that contains service executables and all their dependencies, in a hierarchy resembling the normal Linux directory hierarchy. A portable service can also be a raw disk image, containing a file system containing such a tree (which can be mounted via a loop-back block device), or multiple file systems (in which case they need to follow the Discoverable Partitions Specification and be located within a GPT partition table). Regardless whether the portable service on disk is a simple directory tree or a raw disk image, let's call this concept the portable service image.

Such images can be generated with any tool typically used for the purpose of installing OSes inside some directory, for example dnf --installroot= or debootstrap. There are very few requirements made on these trees, except the following two:

  1. The tree should carry systemd unit files for relevant services in them.

  2. The tree should carry /usr/lib/os-release (or /etc/os-release) OS release information.

Of course, as you might notice, OS trees generated from any of today's big distributions generally qualify for these two requirements without any further modification, as pretty much all of them adopted /usr/lib/os-release and tend to ship their major services with systemd unit files.

A portable service image generated like this can be "attached" or "detached" from a host:

  1. "Attaching" an image to a host is done through the new portablectl attach command. This command dissects the image, reading the os-release information, and searching for unit files in them. It then copies relevant unit files out of the images and into /etc/systemd/system/. After that it augments any copied service unit files in two ways: a drop-in adding a RootDirectory= or RootImage= line is added in so that even though the unit files are now available on the host when started they run the referenced binaries from the image. It also symlinks in a second drop-in which is called a "profile", which is supposed to carry additional security settings to enforce on the attached services, to ensure the right amount of sand-boxing.

  2. "Detaching" an image from the host is done through portable detach. It reverses the steps above: the unit files copied out are removed again, and so are the two drop-in files generated for them.

While a portable service is attached its relevant unit files are made available on the host like any others: they will appear in systemctl list-unit-files, you can enable and disable them, you can start them and stop them. You can extend them with systemctl edit. You can introspect them. You can apply resource management to them like to any other service, and you can process their logs like any other service and so on. That's because they really are native systemd services, except that they have 'twist' if you so will: they have tougher security by default and store their resources in a root directory or image.

And that's already the essence of what Portable Services are.

A couple of interesting points:

  1. Even though the focus is on shipping service unit files in portable service images, you can actually ship timer units, socket units, target units, path units in portable services too. This means you can very naturally do time, socket and path based activation. It's also entirely fine to ship multiple service units in the same image, in case you have more complex applications.

  2. This concept introduces zero new metadata. Unit files are an existing concept, as are os-release files, and — in case you opt for raw disk images — GPT partition tables are already established too. This also means existing tools to generate images can be reused for building portable service images to a large degree as no completely new artifact types need to be generated.

  3. Because the Portable Service concepts introduces zero new metadata and just builds on existing security and resource bundling features of systemd it's implemented in a set of distinct tools, relatively disconnected from the rest of systemd. Specifically, the main user-facing command is portablectl, and the actual operations are implemented in systemd-portabled.service. If you so will, portable services are a true add-on to systemd, just making a specific work-flow nicer to use than with the basic operations systemd otherwise provides. Also note that systemd-portabled provides bus APIs accessible to any program that wants to interface with it, portablectl is just one tool that happens to be shipped along with systemd.

  4. Since Portable Services are a feature we only added very recently we wanted to keep some freedom to make changes still. Due to that we decided to install the portablectl command into /usr/lib/systemd/ for now, so that it does not appear in $PATH by default. This means, for now you have to invoke it with a full path: /usr/lib/systemd/portablectl. We expect to move it into /usr/bin/ very soon though, and make it a fully supported interface of systemd.

  5. You may wonder which unit files contained in a portable service image are the ones considered "relevant" and are actually copied out by the portablectl attach operation. Currently, this is derived from the image name. Let's say you have an image stored in a directory /var/lib/portables/foobar_4711/ (or alternatively in a raw image /var/lib/portables/foobar_4711.raw). In that case the unit files copied out match the pattern foobar*.service, foobar*.socket, foobar*.target, foobar*.path, foobar*.timer.

  6. The Portable Services concept does not define any specific method how images get on the deployment machines, that's entirely up to administrators. You can just scp them there, or wget them. You could even package them as RPMs and then deploy them with dnf if you feel adventurous.

  7. Portable service images can reside in any directory you like. However, if you place them in /var/lib/portables/ then portablectl will find them easily and can show you a list of images you can attach and suchlike.

  8. Attaching a portable service image can be done persistently, so that it remains attached on subsequent boots (which is the default), or it can be attached only until the next reboot, by passing --runtime to portablectl.

  9. Because portable service images are ultimately just regular OS images, it's natural and easy to build a single image that can be used in three different ways:

    1. It can be attached to any host as a portable service image.

    2. It can be booted as OS container, for example in a container manager like systemd-nspawn.

    3. It can be booted as host system, for example on bare metal or in a VM manager.

    Of course, to qualify for the latter two the image needs to contain more than just the service binaries, the os-release file and the unit files. To be bootable an OS container manager such as systemd-nspawn the image needs to contain an init system of some form, for example systemd. To be bootable on bare metal or as VM it also needs a boot loader of some form, for example systemd-boot.

Profiles

In the previous section the "profile" concept was briefly mentioned. Since they are a major feature of the Portable Services concept, they deserve some focus. A "profile" is ultimately just a pre-defined drop-in file for unit files that are attached to a host. They are supposed to mostly contain sand-boxing and security settings, but may actually contain any other settings, too. When a portable service is attached a suitable profile has to be selected. If none is selected explicitly, the default profile called default is used. systemd ships with four different profiles out of the box:

  1. The default profile provides a medium level of security. It contains settings to drop capabilities, enforce system call filters, restrict many kernel interfaces and mount various file systems read-only.

  2. The strict profile is similar to the default profile, but generally uses the most restrictive sand-boxing settings. For example networking is turned off and access to AF_NETLINK sockets is prohibited.

  3. The trusted profile is the least strict of them all. In fact it makes almost no restrictions at all. A service run with this profile has basically full access to the host system.

  4. The nonetwork profile is mostly identical to default, but also turns off network access.

Note that the profile is selected at the time the portable service image is attached, and it applies to all service files attached, in case multiple are shipped in the same image. Thus, the sand-boxing restriction to enforce are selected by the administrator attaching the image and not the image vendor.

Additional profiles can be defined easily by the administrator, if needed. We might also add additional profiles sooner or later to be shipped with systemd out of the box.

What's the use-case for this? If I have containers, why should I bother?

Portable Services are primarily intended to cover use-cases where code should more feel like "extensions" to the host system rather than live in disconnected, separate worlds. The profile concept is supposed to be tunable to the exact right amount of integration or isolation needed for an application.

In the container world the concept of "super-privileged containers" has been touted a lot, i.e. containers that run with full privileges. It's precisely that use-case that portable services are intended for: extensions to the host OS, that default to isolation, but can optionally get as much access to the host as needed, and can naturally take benefit of the full functionality of the host. The concept should hence be useful for all kinds of low-level system software that isn't shipped with the OS itself but needs varying degrees of integration with it. Besides servers and appliances this should be particularly interesting for IoT and embedded devices.

Because portable services are just a relatively small extension to the way system services are otherwise managed, they can be treated like regular service for almost all use-cases: they will appear along regular services in all tools that can introspect systemd unit data, and can be managed the same way when it comes to logging, resource management, runtime life-cycles and so on.

Portable services are a very generic concept. While the original use-case is OS extensions, it's of course entirely up to you and other users to use them in a suitable way of your choice.

Walkthrough

Let's have a look how this all can be used. We'll start with building a portable service image from scratch, before we attach, enable and start it on a host.

Building a Portable Service image

As mentioned, you can use any tool you like that can create OS trees or raw images for building Portable Service images, for example debootstrap or dnf --installroot=. For this example walkthrough run we'll use mkosi, which is ultimately just a fancy wrapper around dnf and debootstrap but makes a number of things particularly easy when repetitively building images from source trees.

I have pushed everything necessary to reproduce this walkthrough locally to a GitHub repository. Let's check it out:

$ git clone https://github.com/systemd/portable-walkthrough.git

Let's have a look in the repository:

  1. First of all, walkthroughd.c is the main source file of our little service. To keep things simple it's written in C, but it could be in any language of your choice. The daemon as implemented won't do much: it just starts up and waits for SIGTERM, at which point it will shut down. It's ultimately useless, but hopefully illustrates how this all fits together. The C code has no dependencies besides libc.

  2. walkthroughd.service is a systemd unit file that starts our little daemon. It's a simple service, hence the unit file is trivial.

  3. Makefile is a short make build script to build the daemon binary. It's pretty trivial, too: it just takes the C file and builds a binary from it. It can also install the daemon. It places the binary in /usr/local/lib/walkthroughd/walkthroughd (why not in /usr/local/bin? because it's not a user-facing binary but a system service binary), and its unit file in /usr/local/lib/systemd/walkthroughd.service. If you want to test the daemon on the host we can now simply run make and then ./walkthroughd in order to check everything works.

  4. mkosi.default is file that tells mkosi how to build the image. We opt for a Fedora-based image here (but we might as well have used Debian here, or any other supported distribution). We need no particular packages during runtime (after all we only depend on libc), but during the build phase we need gcc and make, hence these are the only packages we list in BuildPackages=.

  5. mkosi.build is a shell script that is invoked during mkosi's build logic. All it does is invoke make and make install to build and install our little daemon, and afterwards it extends the distribution-supplied /etc/os-release file with an additional field that describes our portable service a bit.

Let's now use this to build the portable service image. For that we use the mkosi tool. It's sufficient to invoke it without parameter to build the first image: it will automatically discover mkosi.default and mkosi.build which tells it what to do. (Note that if you work on a project like this for a longer time, mkosi -if is probably the better command to use, as it that speeds up building substantially by using an incremental build mode). mkosi will download the necessary RPMs, and put them all together. It will build our little daemon inside the image and after all that's done it will output the resulting image: walkthroughd_1.raw.

Because we opted to build a GPT raw disk image in mkosi.default this file is actually a raw disk image containing a GPT partition table. You can use fdisk -l walkthroughd_1.raw to enumerate the partition table. You can also use systemd-nspawn -i walkthroughd_1.raw to explore the image quickly if you need.

Using the Portable Service Image

Now that we have a portable service image, let's see how we can attach, enable and start the service included within it.

First, let's attach the image:

# /usr/lib/systemd/portablectl attach ./walkthroughd_1.raw
(Matching unit files with prefix 'walkthroughd'.)
Created directory /etc/systemd/system/walkthroughd.service.d.
Written /etc/systemd/system/walkthroughd.service.d/20-portable.conf.
Created symlink /etc/systemd/system/walkthroughd.service.d/10-profile.conf → /usr/lib/systemd/portable/profile/default/service.conf.
Copied /etc/systemd/system/walkthroughd.service.
Created symlink /etc/portables/walkthroughd_1.raw → /home/lennart/projects/portable-walkthrough/walkthroughd_1.raw.

The command will show you exactly what is has been doing: it just copied the main service file out, and added the two drop-ins, as expected.

Let's see if the unit is now available on the host, just like a regular unit, as promised:

# systemctl status walkthroughd.service
● walkthroughd.service - A simple example service
   Loaded: loaded (/etc/systemd/system/walkthroughd.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/walkthroughd.service.d
           └─10-profile.conf, 20-portable.conf
   Active: inactive (dead)

Nice, it worked. We see that the unit file is available and that systemd correctly discovered the two drop-ins. The unit is neither enabled nor started however. Yes, attaching a portable service image doesn't imply enabling nor starting. It just means the unit files contained in the image are made available to the host. It's up to the administrator to then enable them (so that they are automatically started when needed, for example at boot), and/or start them (in case they shall run right-away).

Let's now enable and start the service in one step:

# systemctl enable --now walkthroughd.service
Created symlink /etc/systemd/system/multi-user.target.wants/walkthroughd.service → /etc/systemd/system/walkthroughd.service.

Let's check if it's running:

# systemctl status walkthroughd.service
● walkthroughd.service - A simple example service
   Loaded: loaded (/etc/systemd/system/walkthroughd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/walkthroughd.service.d
           └─10-profile.conf, 20-portable.conf
   Active: active (running) since Wed 2018-06-27 17:55:30 CEST; 4s ago
 Main PID: 45003 (walkthroughd)
    Tasks: 1 (limit: 4915)
   Memory: 4.3M
   CGroup: /system.slice/walkthroughd.service
           └─45003 /usr/local/lib/walkthroughd/walkthroughd

Jun 27 17:55:30 sigma walkthroughd[45003]: Initializing.

Perfect! We can see that the service is now enabled and running. The daemon is running as PID 45003.

Now that we verified that all is good, let's stop, disable and detach the service again:

# systemctl disable --now walkthroughd.service
Removed /etc/systemd/system/multi-user.target.wants/walkthroughd.service.
# /usr/lib/systemd/portablectl detach ./walkthroughd_1.raw
Removed /etc/systemd/system/walkthroughd.service.
Removed /etc/systemd/system/walkthroughd.service.d/10-profile.conf.
Removed /etc/systemd/system/walkthroughd.service.d/20-portable.conf.
Removed /etc/systemd/system/walkthroughd.service.d.
Removed /etc/portables/walkthroughd_1.raw.

And finally, let's see that it's really gone:

# systemctl status walkthroughd
Unit walkthroughd.service could not be found.

Perfect! It worked!

I hope the above gets you started with Portable Services. If you have further questions, please contact our mailing list.

Further Reading

A more low-level document explaining details is shipped along with systemd.

There are also relevant manual pages: portablectl(1) and systemd-portabled(8).

For further information about mkosi see its homepage.


All Systems Go! 2018 CfP Open

The All Systems Go! 2018 Call for Participation is Now Open!

The Call for Participation (CFP) for All Systems Go! 2018 is now open. We’d like to invite you to submit your proposals for consideration to the CFP submission site.

ASG image

The CFP will close on July 30th. Notification of acceptance and non-acceptance will go out within 7 days of the closing of the CFP.

All topics relevant to foundational open-source Linux technologies are welcome. In particular, however, we are looking for proposals including, but not limited to, the following topics:

  • Low-level container executors and infrastructure
  • IoT and embedded OS infrastructure
  • BPF and eBPF filtering
  • OS, container, IoT image delivery and updating
  • Building Linux devices and applications
  • Low-level desktop technologies
  • Networking
  • System and service management
  • Tracing and performance measuring
  • IPC and RPC systems
  • Security and Sandboxing

While our focus is definitely more on the user-space side of things, talks about kernel projects are welcome, as long as they have a clear and direct relevance for user-space.

For more information please visit our conference website!


All Systems Go! 2017 Videos Online!

For those living under a rock, the videos from everybody's favourite Userspace Linux Conference All Systems Go! 2017 are now available online.

All videos

The videos for my own two talks are available here:

Synchronizing Images with casync (Slides)

Containers without a Container Manager, with systemd (Slides)

Of course, this is the stellar work of the CCC VOC folks, who are hard to beat when it comes to videotaping of community conferences.

© Lennart Poettering. Built using Pelican. Theme by Giulio Fidente on github. .