Index | Archives | Atom Feed | RSS Feed

casync — A tool for distributing file system images

Introducing casync

In the past months I have been working on a new project: casync. casync takes inspiration from the popular rsync file synchronization tool as well as the probably even more popular git revision control system. It combines the idea of the rsync algorithm with the idea of git-style content-addressable file systems, and creates a new system for efficiently storing and delivering file system images, optimized for high-frequency update cycles over the Internet. Its current focus is on delivering IoT, container, VM, application, portable service or OS images, but I hope to extend it later in a generic fashion to become useful for backups and home directory synchronization as well (but more about that later).

The basic technological building blocks casync is built from are neither new nor particularly innovative (at least not anymore), however the way casync combines them is different from existing tools, and that's what makes it useful for a variety of use-cases that other tools can't cover that well.

Why?

I created casync after studying how today's popular tools store and deliver file system images. To briefly name a few: Docker has a layered tarball approach, OSTree serves the individual files directly via HTTP and maintains packed deltas to speed up updates, while other systems operate on the block layer and place raw squashfs images (or other archival file systems, such as IS09660) for download on HTTP shares (in the better cases combined with zsync data).

Neither of these approaches appeared fully convincing to me when used in high-frequency update cycle systems. In such systems, it is important to optimize towards a couple of goals:

  1. Most importantly, make updates cheap traffic-wise (for this most tools use image deltas of some form)
  2. Put boundaries on disk space usage on servers (keeping deltas between all version combinations clients might want to run updates between, would suggest keeping an exponentially growing amount of deltas on servers)
  3. Put boundaries on disk space usage on clients
  4. Be friendly to Content Delivery Networks (CDNs), i.e. serve neither too many small nor too many overly large files, and only require the most basic form of HTTP. Provide the repository administrator with high-level knobs to tune the average file size delivered.
  5. Simplicity to use for users, repository administrators and developers

I don't think any of the tools mentioned above are really good on more than a small subset of these points.

Specifically: Docker's layered tarball approach dumps the "delta" question onto the feet of the image creators: the best way to make your image downloads minimal is basing your work on an existing image clients might already have, and inherit its resources, maintaining full history. Here, revision control (a tool for the developer) is intermingled with update management (a concept for optimizing production delivery). As container histories grow individual deltas are likely to stay small, but on the other hand a brand-new deployment usually requires downloading the full history onto the deployment system, even though there's no use for it there, and likely requires substantially more disk space and download sizes.

OSTree's serving of individual files is unfriendly to CDNs (as many small files in file trees cause an explosion of HTTP GET requests). To counter that OSTree supports placing pre-calculated delta images between selected revisions on the delivery servers, which means a certain amount of revision management, that leaks into the clients.

Delivering direct squashfs (or other file system) images is almost beautifully simple, but of course means every update requires a full download of the newest image, which is both bad for disk usage and generated traffic. Enhancing it with zsync makes this a much better option, as it can reduce generated traffic substantially at very little cost of history/meta-data (no explicit deltas between a large number of versions need to be prepared server side). On the other hand server requirements in disk space and functionality (HTTP Range requests) are minus points for the use-case I am interested in.

(Note: all the mentioned systems have great properties, and it's not my intention to badmouth them. They only point I am trying to make is that for the use case I care about — file system image delivery with high high frequency update-cycles — each system comes with certain drawbacks.)

Security & Reproducibility

Besides the issues pointed out above I wasn't happy with the security and reproducibility properties of these systems. In today's world where security breaches involving hacking and breaking into connected systems happen every day, an image delivery system that cannot make strong guarantees regarding data integrity is out of date. Specifically, the tarball format is famously nondeterministic: the very same file tree can result in any number of different valid serializations depending on the tool used, its version and the underlying OS and file system. Some tar implementations attempt to correct that by guaranteeing that each file tree maps to exactly one valid serialization, but such a property is always only specific to the tool used. I strongly believe that any good update system must guarantee on every single link of the chain that there's only one valid representation of the data to deliver, that can easily be verified.

What casync Is

So much about the background why I created casync. Now, let's have a look what casync actually is like, and what it does. Here's the brief technical overview:

Encoding: Let's take a large linear data stream, split it into variable-sized chunks (the size of each being a function of the chunk's contents), and store these chunks in individual, compressed files in some directory, each file named after a strong hash value of its contents, so that the hash value may be used to as key for retrieving the full chunk data. Let's call this directory a "chunk store". At the same time, generate a "chunk index" file that lists these chunk hash values plus their respective chunk sizes in a simple linear array. The chunking algorithm is supposed to create variable, but similarly sized chunks from the data stream, and do so in a way that the same data results in the same chunks even if placed at varying offsets. For more information see this blog story.

Decoding: Let's take the chunk index file, and reassemble the large linear data stream by concatenating the uncompressed chunks retrieved from the chunk store, keyed by the listed chunk hash values.

As an extra twist, we introduce a well-defined, reproducible, random-access serialization format for file trees (think: a more modern tar), to permit efficient, stable storage of complete file trees in the system, simply by serializing them and then passing them into the encoding step explained above.

Finally, let's put all this on the network: for each image you want to deliver, generate a chunk index file and place it on an HTTP server. Do the same with the chunk store, and share it between the various index files you intend to deliver.

Why bother with all of this? Streams with similar contents will result in mostly the same chunk files in the chunk store. This means it is very efficient to store many related versions of a data stream in the same chunk store, thus minimizing disk usage. Moreover, when transferring linear data streams chunks already known on the receiving side can be made use of, thus minimizing network traffic.

Why is this different from rsync or OSTree, or similar tools? Well, one major difference between casync and those tools is that we remove file boundaries before chunking things up. This means that small files are lumped together with their siblings and large files are chopped into pieces, which permits us to recognize similarities in files and directories beyond file boundaries, and makes sure our chunk sizes are pretty evenly distributed, without the file boundaries affecting them.

The "chunking" algorithm is based on a the buzhash rolling hash function. SHA256 is used as strong hash function to generate digests of the chunks. xz is used to compress the individual chunks.

Here's a diagram, hopefully explaining a bit how the encoding process works, wasn't it for my crappy drawing skills:

Diagram

The diagram shows the encoding process from top to bottom. It starts with a block device or a file tree, which is then serialized and chunked up into variable sized blocks. The compressed chunks are then placed in the chunk store, while a chunk index file is written listing the chunk hashes in order. (The original SVG of this graphic may be found here.)

Details

Note that casync operates on two different layers, depending on the use-case of the user:

  1. You may use it on the block layer. In this case the raw block data on disk is taken as-is, read directly from the block device, split into chunks as described above, compressed, stored and delivered.

  2. You may use it on the file system layer. In this case, the file tree serialization format mentioned above comes into play: the file tree is serialized depth-first (much like tar would do it) and then split into chunks, compressed, stored and delivered.

The fact that it may be used on both the block and file system layer opens it up for a variety of different use-cases. In the VM and IoT ecosystems shipping images as block-level serializations is more common, while in the container and application world file-system-level serializations are more typically used.

Chunk index files referring to block-layer serializations carry the .caibx suffix, while chunk index files referring to file system serializations carry the .caidx suffix. Note that you may also use casync as direct tar replacement, i.e. without the chunking, just generating the plain linear file tree serialization. Such files carry the .catar suffix. Internally .caibx are identical to .caidx files, the only difference is semantical: .caidx files describe a .catar file, while .caibx files may describe any other blob. Finally, chunk stores are directories carrying the .castr suffix.

Features

Here are a couple of other features casync has:

  1. When downloading a new image you may use casync's --seed= feature: each block device, file, or directory specified is processed using the same chunking logic described above, and is used as preferred source when putting together the downloaded image locally, avoiding network transfer of it. This of course is useful whenever updating an image: simply specify one or more old versions as seed and only download the chunks that truly changed since then. Note that using seeds requires no history relationship between seed and the new image to download. This has major benefits: you can even use it to speed up downloads of relatively foreign and unrelated data. For example, when downloading a container image built using Ubuntu you can use your Fedora host OS tree in /usr as seed, and casync will automatically use whatever it can from that tree, for example timezone and locale data that tends to be identical between distributions. Example: casync extract http://example.com/myimage.caibx --seed=/dev/sda1 /dev/sda2. This will place the block-layer image described by the indicated URL in the /dev/sda2 partition, using the existing /dev/sda1 data as seeding source. An invocation like this could be typically used by IoT systems with an A/B partition setup. Example 2: casync extract http://example.com/mycontainer-v3.caidx --seed=/srv/container-v1 --seed=/srv/container-v2 /src/container-v3, is very similar but operates on the file system layer, and uses two old container versions to seed the new version.

  2. When operating on the file system level, the user has fine-grained control on the meta-data included in the serialization. This is relevant since different use-cases tend to require a different set of saved/restored meta-data. For example, when shipping OS images, file access bits/ACLs and ownership matter, while file modification times hurt. When doing personal backups OTOH file ownership matters little but file modification times are important. Moreover different backing file systems support different feature sets, and storing more information than necessary might make it impossible to validate a tree against an image if the meta-data cannot be replayed in full. Due to this, casync provides a set of --with= and --without= parameters that allow fine-grained control of the data stored in the file tree serialization, including the granularity of modification times and more. The precise set of selected meta-data features is also always part of the serialization, so that seeding can work correctly and automatically.

  3. casync tries to be as accurate as possible when storing file system meta-data. This means that besides the usual baseline of file meta-data (file ownership and access bits), and more advanced features (extended attributes, ACLs, file capabilities) a number of more exotic data is stored as well, including Linux chattr(1) file attributes, as well as FAT file attributes (you may wonder why the latter? — EFI is FAT, and /efi is part of the comprehensive serialization of any host). In the future I intend to extend this further, for example storing btrfs sub-volume information where available. Note that as described above every single type of meta-data may be turned off and on individually, hence if you don't need FAT file bits (and I figure it's pretty likely you don't), then they won't be stored.

  4. The user creating .caidx or .caibx files may control the desired average chunk length (before compression) freely, using the --chunk-size= parameter. Smaller chunks increase the number of generated files in the chunk store and increase HTTP GET load on the server, but also ensure that sharing between similar images is improved, as identical patterns in the images stored are more likely to be recognized. By default casync will use a 64K average chunk size. Tweaking this can be particularly useful when adapting the system to specific CDNs, or when delivering compressed disk images such as squashfs (see below).

  5. Emphasis is placed on making all invocations reproducible, well-defined and strictly deterministic. As mentioned above this is a requirement to reach the intended security guarantees, but is also useful for many other use-cases. For example, the casync digest command may be used to calculate a hash value identifying a specific directory in all desired detail (use --with= and --without to pick the desired detail). Moreover the casync mtree command may be used to generate a BSD mtree(5) compatible manifest of a directory tree, .caidx or .catar file.

  6. The file system serialization format is nicely composable. By this I mean that the serialization of a file tree is the concatenation of the serializations of all files and file sub-trees located at the top of the tree, with zero meta-data references from any of these serializations into the others. This property is essential to ensure maximum reuse of chunks when similar trees are serialized.

  7. When extracting file trees or disk image files, casync will automatically create reflinks from any specified seeds if the underlying file system supports it (such as btrfs, ocfs, and future xfs). After all, instead of copying the desired data from the seed, we can just tell the file system to link up the relevant blocks. This works both when extracting .caidx and .caibx files — the latter of course only when the extracted disk image is placed in a regular raw image file on disk, rather than directly on a plain block device, as plain block devices do not know the concept of reflinks.

  8. Optionally, when extracting file trees, casync can create traditional UNIX hard-links for identical files in specified seeds (--hardlink=yes). This works on all UNIX file systems, and can save substantial amounts of disk space. However, this only works for very specific use-cases where disk images are considered read-only after extraction, as any changes made to one tree will propagate to all other trees sharing the same hard-linked files, as that's the nature of hard-links. In this mode, casync exposes OSTree-like behavior, which is built heavily around read-only hard-link trees.

  9. casync tries to be smart when choosing what to include in file system images. Implicitly, file systems such as procfs and sysfs are excluded from serialization, as they expose API objects, not real files. Moreover, the "nodump" (+d) chattr(1) flag is honored by default, permitting users to mark files to exclude from serialization.

  10. When creating and extracting file trees casync may apply an automatic or explicit UID/GID shift. This is particularly useful when transferring container image for use with Linux user name-spacing.

  11. In addition to local operation, casync currently supports HTTP, HTTPS, FTP and ssh natively for downloading chunk index files and chunks (the ssh mode requires installing casync on the remote host, though, but an sftp mode not requiring that should be easy to add). When creating index files or chunks, only ssh is supported as remote back-end.

  12. When operating on block-layer images, you may expose locally or remotely stored images as local block devices. Example: casync mkdev http://example.com/myimage.caibx exposes the disk image described by the indicated URL as local block device in /dev, which you then may use the usual block device tools on, such as mount or fdisk (only read-only though). Chunks are downloaded on access with high priority, and at low priority when idle in the background. Note that in this mode, casync also plays a role similar to "dm-verity", as all blocks are validated against the strong digests in the chunk index file before passing them on to the kernel's block layer. This feature is implemented though Linux' NBD kernel facility.

  13. Similar, when operating on file-system-layer images, you may mount locally or remotely stored images as regular file systems. Example: casync mount http://example.com/mytree.caidx /srv/mytree mounts the file tree image described by the indicated URL as a local directory /srv/mytree. This feature is implemented though Linux' FUSE kernel facility. Note that special care is taken that the images exposed this way can be packed up again with casync make and are guaranteed to return the bit-by-bit exact same serialization again that it was mounted from. No data is lost or changed while passing things through FUSE (OK, strictly speaking this is a lie, we do lose ACLs, but that's hopefully just a temporary gap to be fixed soon).

  14. In IoT A/B fixed size partition setups the file systems placed in the two partitions are usually much shorter than the partition size, in order to keep some room for later, larger updates. casync is able to analyze the super-block of a number of common file systems in order to determine the actual size of a file system stored on a block device, so that writing a file system to such a partition and reading it back again will result in reproducible data. Moreover this speeds up the seeding process, as there's little point in seeding the white-space after the file system within the partition.

Example Command Lines

Here's how to use casync, explained with a few examples:

$ casync make foobar.caidx /some/directory

This will create a chunk index file foobar.caidx in the local directory, and populate the chunk store directory default.castr located next to it with the chunks of the serialization (you can change the name for the store directory with --store= if you like). This command operates on the file-system level. A similar command operating on the block level:

$ casync make foobar.caibx /dev/sda1

This command creates a chunk index file foobar.caibx in the local directory describing the current contents of the /dev/sda1 block device, and populates default.castr in the same way as above. Note that you may as well read a raw disk image from a file instead of a block device:

$ casync make foobar.caibx myimage.raw

To reconstruct the original file tree from the .caidx file and the chunk store of the first command, use:

$ casync extract foobar.caidx /some/other/directory

And similar for the block-layer version:

$ casync extract foobar.caibx /dev/sdb1

or, to extract the block-layer version into a raw disk image:

$ casync extract foobar.caibx myotherimage.raw

The above are the most basic commands, operating on local data only. Now let's make this more interesting, and reference remote resources:

$ casync extract http://example.com/images/foobar.caidx /some/other/directory

This extracts the specified .caidx onto a local directory. This of course assumes that foobar.caidx was uploaded to the HTTP server in the first place, along with the chunk store. You can use any command you like to accomplish that, for example scp or rsync. Alternatively, you can let casync do this directly when generating the chunk index:

$ casync make ssh.example.com:images/foobar.caidx /some/directory

This will use ssh to connect to the ssh.example.com server, and then places the .caidx file and the chunks on it. Note that this mode of operation is "smart": this scheme will only upload chunks currently missing on the server side, and not re-transmit what already is available.

Note that you can always configure the precise path or URL of the chunk store via the --store= option. If you do not do that, then the store path is automatically derived from the path or URL: the last component of the path or URL is replaced by default.castr.

Of course, when extracting .caidx or .caibx files from remote sources, using a local seed is advisable:

$ casync extract http://example.com/images/foobar.caidx --seed=/some/exising/directory /some/other/directory

Or on the block layer:

$ casync extract http://example.com/images/foobar.caibx --seed=/dev/sda1 /dev/sdb2

When creating chunk indexes on the file system layer casync will by default store meta-data as accurately as possible. Let's create a chunk index with reduced meta-data:

$ casync make foobar.caidx --with=sec-time --with=symlinks --with=read-only /some/dir

This command will create a chunk index for a file tree serialization that has three features above the absolute baseline supported: 1s granularity time-stamps, symbolic links and a single read-only bit. In this mode, all the other meta-data bits are not stored, including nanosecond time-stamps, full UNIX permission bits, file ownership or even ACLs or extended attributes.

Now let's make a .caidx file available locally as a mounted file system, without extracting it:

$ casync mount http://example.comf/images/foobar.caidx /mnt/foobar

And similar, let's make a .caibx file available locally as a block device:

$ casync mkdev http://example.comf/images/foobar.caibx

This will create a block device in /dev and print the used device node path to STDOUT.

As mentioned, casync is big about reproducibility. Let's make use of that to calculate the a digest identifying a very specific version of a file tree:

$ casync digest .

This digest will include all meta-data bits casync and the underlying file system know about. Usually, to make this useful you want to configure exactly what meta-data to include:

$ casync digest --with=unix .

This makes use of the --with=unix shortcut for selecting meta-data fields. Specifying --with-unix= selects all meta-data that traditional UNIX file systems support. It is a shortcut for writing out: --with=16bit-uids --with=permissions --with=sec-time --with=symlinks --with=device-nodes --with=fifos --with=sockets.

Note that when calculating digests or creating chunk indexes you may also use the negative --without= option to remove specific features but start from the most precise:

$ casync digest --without=flag-immutable

This generates a digest with the most accurate meta-data, but leaves one feature out: chattr(1)'s immutable (+i) file flag.

To list the contents of a .caidx file use a command like the following:

$ casync list http://example.com/images/foobar.caidx

or

$ casync mtree http://example.com/images/foobar.caidx

The former command will generate a brief list of files and directories, not too different from tar t or ls -al in its output. The latter command will generate a BSD mtree(5) compatible manifest. Note that casync actually stores substantially more file meta-data than mtree files can express, though.

What casync isn't

  1. casync is not an attempt to minimize serialization and downloaded deltas to the extreme. Instead, the tool is supposed to find a good middle ground, that is good on traffic and disk space, but not at the price of convenience or requiring explicit revision control. If you care about updates that are absolutely minimal, there are binary delta systems around that might be an option for you, such as Google's Courgette.

  2. casync is not a replacement for rsync, or git or zsync or anything like that. They have very different use-cases and semantics. For example, rsync permits you to directly synchronize two file trees remotely. casync just cannot do that, and it is unlikely it every will.

Where next?

casync is supposed to be a generic synchronization tool. Its primary focus for now is delivery of OS images, but I'd like to make it useful for a couple other use-cases, too. Specifically:

  1. To make the tool useful for backups, encryption is missing. I have pretty concrete plans how to add that. When implemented, the tool might become an alternative to restic, BorgBackup or tarsnap.

  2. Right now, if you want to deploy casync in real-life, you still need to validate the downloaded .caidx or .caibx file yourself, for example with some gpg signature. It is my intention to integrate with gpg in a minimal way so that signing and verifying chunk index files is done automatically.

  3. In the longer run, I'd like to build an automatic synchronizer for $HOME between systems from this. Each $HOME instance would be stored automatically in regular intervals in the cloud using casync, and conflicts would be resolved locally.

  4. casync is written in a shared library style, but it is not yet built as one. Specifically this means that almost all of casync's functionality is supposed to be available as C API soon, and applications can process casync files on every level. It is my intention to make this library useful enough so that it will be easy to write a module for GNOME's gvfs subsystem in order to make remote or local .caidx files directly available to applications (as an alternative to casync mount). In fact the idea is to make this all flexible enough that even the remoting back-ends can be replaced easily, for example to replace casync's default HTTP/HTTPS back-ends built on CURL with GNOME's own HTTP implementation, in order to share cookies, certificates, … There's also an alternative method to integrate with casync in place already: simply invoke casync as a sub-process. casync will inform you about a certain set of state changes using a mechanism compatible with sd_notify(3). In future it will also propagate progress data this way and more.

  5. I intend to a add a new seeding back-end that sources chunks from the local network. After downloading the new .caidx file off the Internet casync would then search for the listed chunks on the local network first before retrieving them from the Internet. This should speed things up on all installations that have multiple similar systems deployed in the same network.

Further plans are listed tersely in the TODO file.

FAQ:

  1. Is this a systemd project?casync is hosted under the github systemd umbrella, and the projects share the same coding style. However, the code-bases are distinct and without interdependencies, and casync works fine both on systemd systems and systems without it.

  2. Is casync portable? — At the moment: no. I only run Linux and that's what I code for. That said, I am open to accepting portability patches (unlike for systemd, which doesn't really make sense on non-Linux systems), as long as they don't interfere too much with the way casync works. Specifically this means that I am not too enthusiastic about merging portability patches for OSes lacking the openat(2) family of APIs.

  3. Does casync require reflink-capable file systems to work, such as btrfs? — No it doesn't. The reflink magic in casync is employed when the file system permits it, and it's good to have it, but it's not a requirement, and casync will implicitly fall back to copying when it isn't available. Note that casync supports a number of file system features on a variety of file systems that aren't available everywhere, for example FAT's system/hidden file flags or xfs's projinherit file flag.

  4. Is casync stable? — I just tagged the first, initial release. While I have been working on it since quite some time and it is quite featureful, this is the first time I advertise it publicly, and it hence received very little testing outside of its own test suite. I am also not fully ready to commit to the stability of the current serialization or chunk index format. I don't see any breakages coming for it though. casync is pretty light on documentation right now, and does not even have a man page. I also intend to correct that soon.

  5. Are the .caidx/.caibx and .catar file formats open and documented?casync is Open Source, so if you want to know the precise format, have a look at the sources for now. It's definitely my intention to add comprehensive docs for both formats however. Don't forget this is just the initial version right now.

  6. casync is just like $SOMEOTHERTOOL! Why are you reinventing the wheel (again)? — Well, because casync isn't "just like" some other tool. I am pretty sure I did my homework, and that there is no tool just like casync right now. The tools coming closest are probably rsync, zsync, tarsnap, restic, but they are quite different beasts each.

  7. Why did you invent your own serialization format for file trees? Why don't you just use tar? — That's a good question, and other systems — most prominently tarsnap — do that. However, as mentioned above tar doesn't enforce reproducibility. It also doesn't really do random access: if you want to access some specific file you need to read every single byte stored before it in the tar archive to find it, which is of course very expensive. The serialization casync implements places a focus on reproducibility, random access, and meta-data control. Much like traditional tar it can still be generated and extracted in a stream fashion though.

  8. Does casync save/restore SELinux/SMACK file labels? — At the moment not. That's not because I wouldn't want it to, but simply because I am not a guru of either of these systems, and didn't want to implement something I do not fully grok nor can test. If you look at the sources you'll find that there's already some definitions in place that keep room for them though. I'd be delighted to accept a patch implementing this fully.

  9. What about delivering squashfs images? How well does chunking work on compressed serializations? – That's a very good point! Usually, if you apply the a chunking algorithm to a compressed data stream (let's say a tar.gz file), then changing a single bit at the front will propagate into the entire remainder of the file, so that minimal changes will explode into major changes. Thankfully this doesn't apply that strictly to squashfs images, as it provides random access to files and directories and thus breaks up the compression streams in regular intervals to make seeking easy. This fact is beneficial for systems employing chunking, such as casync as this means single bit changes might affect their vicinity but will not explode in an unbounded fashion. In order achieve best results when delivering squashfs images through casync the block sizes of squashfs and the chunks sizes of casync should be matched up (using casync's --chunk-size= option). How precisely to choose both values is left a research subject for the user, for now.

  10. What does the name casync mean? – It's a synchronizing tool, hence the -sync suffix, following rsync's naming. It makes use of the content-addressable concept of git hence the ca- prefix.

  11. ***Where can I get this stuff? Is it already packaged? *** – Check out the sources on GitHub. I just tagged the first version. Martin Pitt has packaged casync for Ubuntu. There is also an ArchLinux package. Zbigniew Jędrzejewski-Szmek has prepared a Fedora RPM that hopefully will soon be included in the distribution.

Should you care? Is this a tool for you?

Well, that's up to you really. If you are involved with projects that need to deliver IoT, VM, container, application or OS images, then maybe this is a great tool for you — but other options exist, some of which are linked above.

Note that casync is an Open Source project: if it doesn't do exactly what you need, prepare a patch that adds what you need, and we'll consider it.

If you are interested in the project and would like to talk about this in person, I'll be presenting casync soon at Kinvolk's Linux Technologies Meetup in Berlin, Germany. You are invited. I also intend to talk about it at All Systems Go!, also in Berlin.


Avoiding CVE-2016-8655 with systemd

Avoiding CVE-2016-8655 with systemd

Just a quick note: on recent versions of systemd it is relatively easy to block the vulnerability described in CVE-2016-8655 for individual services.

Since systemd release v211 there's an option RestrictAddressFamilies= for service unit files which takes away the right to create sockets of specific address families for processes of the service. In your unit file, add RestrictAddressFamilies=~AF_PACKET to the [Service] section to make AF_PACKET unavailable to it (i.e. a blacklist), which is sufficient to close the attack path. Safer of course is a whitelist of address families whch you can define by dropping the ~ character from the assignment. Here's a trivial example:


[Service]
ExecStart=/usr/bin/mydaemon
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX

This restricts access to socket families, so that the service may access only AF_INET, AF_INET6 or AF_UNIX sockets, which is usually the right, minimal set for most system daemons. (AF_INET is the low-level name for the IPv4 address family, AF_INET6 for the IPv6 address family, and AF_UNIX for local UNIX socket IPC).

Starting with systemd v232 we added RestrictAddressFamilies= to all of systemd's own unit files, always with the minimal set of socket address families appropriate.

With the upcoming v233 release we'll provide a second method for blocking this vulnerability. Using RestrictNamespaces= it is possible to limit which types of Linux namespaces a service may get access to. Use RestrictNamespaces=yes to prohibit access to any kind of namespace, or set RestrictNamespaces=net ipc (or similar) to restrict access to a specific set (in this case: network and IPC namespaces). Given that user namespaces have been a major source of security vulnerabilities in the past months it's probably a good idea to block namespaces on all services which don't need them (which is probably most of them).

Of course, ideally, distributions such as Fedora, as well as upstream developers would turn on the various sandboxing settings systemd provides like these ones by default, since they know best which kind of address families or namespaces a specific daemon needs.


systemd.conf 2016 Over Now

systemd.conf 2016 is Over Now!

A few days ago systemd.conf 2016 ended, our second conference of this kind. I personally enjoyed this conference a lot: the talks, the atmosphere, the audience, the organization, the location, they all were excellent!

I'd like to take the opportunity to thanks everybody involved. In particular I'd like to thank Chris, Daniel, Sandra and Henrike for organizing the conference, your work was stellar!

I'd also like to thank our sponsors, without which the conference couldn't take place like this, of course. In particular I'd like to thank our gold sponsor, Red Hat, our organizing sponsor Kinvolk, as well as our silver sponsors CoreOS and Facebook. I'd also like to thank our bronze sponsors Collabora, OpenSUSE, Pantheon, Pengutronix, our supporting sponsor Codethink and last but not least our media sponsor Linux Magazin. Thank you all!

I'd also like to thank the Video Operation Center ("VOC") for their amazing work on live-streaming the conference and making all talks available on YouTube. It's amazing how efficient the VOC is, it's simply stunning! Thank you guys!

In case you missed this year's iteration of the conference, please have a look at our YouTube Channel. You'll find all of this year's talks there, as well the ones from last year. (For example, my welcome talk is available here). Enjoy!

We hope to see you again next year, for systemd.conf 2017 in Berlin!


systemd.conf 2016 Workshop Tickets Available

Tickets for systemd 2016 Workshop day still available!

We still have a number of ticket for the workshop day of systemd.conf 2016 available. If you are a newcomer to systemd, and would like to learn about various systemd facilities, or if you already know your way around, but would like to know more: this is the best chance to do so. The workshop day is the 28th of September, one day before the main conference, at the betahaus in Berlin, Germany. The schedule for the day is available here. There are five interesting, extensive sessions, run by the systemd hackers themselves. Who better to learn systemd from, than the folks who wrote it?

Note that the workshop day and the main conference days require different tickets. (Also note: there are still a few tickets available for the main conference!).

Buy a ticket here.

See you in Berlin!


Preliminary systemd.conf 2016 Schedule

A Preliminary systemd.conf 2016 Schedule is Now Available!

We have just published a first, preliminary version of the systemd.conf 2016 schedule. There is a small number of white slots in the schedule still, because we're missing confirmation from a small number of presenters. The missing talks will be added in as soon as they are confirmed.

The schedule consists of 5 workshops by high-profile speakers during the workshop day, 22 exciting talks during the main conference days, followed by one full day of hackfests.

Please sign up for the conference soon! Only a limited number of tickets are available, hence make sure to secure yours quickly before they run out! (Last year we sold out.) Please sign up here for the conference!


FINAL REMINDER! systemd.conf 2016 CfP Ends on Monday!

Please note that the systemd.conf 2016 Call for Participation ends on Monday, on Aug. 1st! Please send in your talk proposal by then! We’ve already got a good number of excellent submissions, but we are very interested in yours, too!

We are looking for talks on all facets of systemd: deployment, maintenance, administration, development. Regardless of whether you use it in the cloud, on embedded, on IoT, on the desktop, on mobile, in a container or on the server: we are interested in your submissions!

In addition to proposals for talks for the main conference, we are looking for proposals for workshop sessions held during our Workshop Day (the first day of the conference). The workshop format consists of a day of 2-3h training sessions, that may cover any systemd-related topic you'd like. We are both interested in submissions from the developer community as well as submissions from organizations making use of systemd! Introductory workshop sessions are particularly welcome, as the Workshop Day is intended to open up our conference to newcomers and people who aren't systemd gurus yet, but would like to become more fluent.

For further details on the submissions we are looking for and the CfP process, please consult the CfP page and submit your proposal using the provided form!

ALSO: Please sign up for the conference soon! Only a limited number of tickets are available, hence make sure to secure yours quickly before they run out! (Last year we sold out.) Please sign up here for the conference!

AND OF COURSE: We are also looking for more sponsors for systemd.conf! If you are working on systemd-related projects, or make use of it in your company, please consider becoming a sponsor of systemd.conf 2016! Without our sponsors we couldn't organize systemd.conf 2016!

Thank you very much, and see you in Berlin!


REMINDER! systemd.conf 2016 CfP Ends in Two Weeks!

Please note that the systemd.conf 2016 Call for Participation ends in less than two weeks, on Aug. 1st! Please send in your talk proposal by then! We’ve already got a good number of excellent submissions, but we are interested in yours even more!

We are looking for talks on all facets of systemd: deployment, maintenance, administration, development. Regardless of whether you use it in the cloud, on embedded, on IoT, on the desktop, on mobile, in a container or on the server: we are interested in your submissions!

In addition to proposals for talks for the main conference, we are looking for proposals for workshop sessions held during our Workshop Day (the first day of the conference). The workshop format consists of a day of 2-3h training sessions, that may cover any systemd-related topic you'd like. We are both interested in submissions from the developer community as well as submissions from organizations making use of systemd! Introductory workshop sessions are particularly welcome, as the Workshop Day is intended to open up our conference to newcomers and people who aren't systemd gurus yet, but would like to become more fluent.

For further details on the submissions we are looking for and the CfP process, please consult the CfP page and submit your proposal using the provided form!

And keep in mind:

REMINDER: Please sign up for the conference soon! Only a limited number of tickets are available, hence make sure to secure yours quickly before they run out! (Last year we sold out.) Please sign up here for the conference!

AND OF COURSE: We are also looking for more sponsors for systemd.conf! If you are working on systemd-related projects, or make use of it in your company, please consider becoming a sponsor of systemd.conf 2016! Without our sponsors we couldn't organize systemd.conf 2016!

Thank you very much, and see you in Berlin!


CfP is now open

The systemd.conf 2016 Call for Participation is Now Open!

We’d like to invite presentation and workshop proposals for systemd.conf 2016!

The conference will consist of three parts:

  • One day of workshops, consisting of in-depth (2-3hr) training and learning-by-doing sessions (Sept. 28th)
  • Two days of regular talks (Sept. 29th-30th)
  • One day of hackfest (Oct. 1st)

We are now accepting submissions for the first three days: proposals for workshops, training sessions and regular talks. In particular, we are looking for sessions including, but not limited to, the following topics:

  • Use Cases: systemd in today’s and tomorrow’s devices and applications
  • systemd and containers, in the cloud and on servers
  • systemd in distributions
  • Embedded systemd and in IoT
  • systemd on the desktop
  • Networking with systemd
  • … and everything else related to systemd

Please submit your proposals by August 1st, 2016. Notification of acceptance will be sent out 1-2 weeks later.

If submitting a workshop proposal please contact the organizers for more details.

To submit a talk, please visit our CfP submission page.

For further information on systemd.conf 2016, please visit our conference web site.


Announcing systemd.conf 2016

Announcing systemd.conf 2016

We are happy to announce the 2016 installment of systemd.conf, the conference of the systemd project!

After our successful first conference 2015 we’d like to repeat the event in 2016 for the second time. The conference will take place on September 28th until October 1st, 2016 at betahaus in Berlin, Germany. The event is a few days before LinuxCon Europe, which also is located in Berlin this year. This year, the conference will consist of two days of presentations, a one-day hackfest and one day of hands-on training sessions.

The website is online now, please visit https://conf.systemd.io/.

Tickets at early-bird prices are available already. Purchase them at https://ti.to/systemdconf/systemdconf-2016.

The Call for Presentations will open soon, we are looking forward to your submissions! A separate announcement will be published as soon as the CfP is open.

systemd.conf 2016 is a organized jointly by the systemd community and kinvolk.io.

We are looking for sponsors! We’ve got early commitments from some of last year’s sponsors: Collabora, Pengutronix & Red Hat. Please see the web site for details about how your company may become a sponsor, too.

If you have any questions, please contact us at info@systemd.io.


Introducing sd-event

The Event Loop API of libsystemd

When we began working on systemd we built it around a hand-written ad-hoc event loop, wrapping Linux epoll. The more our project grew the more we realized the limitations of using raw epoll:

  • As we used timerfd for our timer events, each event source cost one file descriptor and we had many of them! File descriptors are a scarce resource on UNIX, as RLIMIT_NOFILE is typically set to 1024 or similar, limiting the number of available file descriptors per process to 1021, which isn't particularly a lot.

  • Ordering of event dispatching became a nightmare. In many cases, we wanted to make sure that a certain kind of event would always be dispatched before another kind of event, if both happen at the same time. For example, when the last process of a service dies, we might be notified about that via a SIGCHLD signal, via an sd_notify() "STATUS=" message, and via a control group notification. We wanted to get these events in the right order, to know when it's safe to process and subsequently release the runtime data systemd keeps about the service or process: it shouldn't be done if there are still events about it pending.

  • For each program we added to the systemd project we noticed we were adding similar code, over and over again, to work with epoll's complex interfaces. For example, finding the right file descriptor and callback function to dispatch an epoll event to, without running into invalidated pointer issues is outright difficult and requires non-trivial code.

  • Integrating child process watching into our event loops was much more complex than one could hope, and even more so if child process events should be ordered against each other and unrelated kinds of events.

Eventually, we started working on sd-bus. At the same time we decided to seize the opportunity, put together a proper event loop API in C, and then not only port sd-bus on top of it, but also the rest of systemd. The result of this is sd-event. After almost two years of development we declared sd-event stable in systemd version 221, and published it as official API of libsystemd.

Why?

sd-event.h, of course, is not the first event loop API around, and it doesn't implement any really novel concepts. When we started working on it we tried to do our homework, and checked the various existing event loop APIs, maybe looking for candidates to adopt instead of doing our own, and to learn about the strengths and weaknesses of the various implementations existing. Ultimately, we found no implementation that could deliver what we needed, or where it would be easy to add the missing bits: as usual in the systemd project, we wanted something that allows us access to all the Linux-specific bits, instead of limiting itself to the least common denominator of UNIX. We weren't looking for an abstraction API, but simply one that makes epoll usable in system code.

With this blog story I'd like to take the opportunity to introduce you to sd-event, and explain why it might be a good candidate to adopt as event loop implementation in your project, too.

So, here are some features it provides:

  • I/O event sources, based on epoll's file descriptor watching, including edge triggered events (EPOLLET). See sd_event_add_io(3).

  • Timer event sources, based on timerfd_create(), supporting the CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTIME clocks, as well as the CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM clocks that can resume the system from suspend. When creating timer events a required accuracy parameter may be specified which allows coalescing of timer events to minimize power consumption. For each clock only a single timer file descriptor is kept, and all timer events are multiplexed with a priority queue. See sd_event_add_time(3).

  • UNIX process signal events, based on signalfd(2), including full support for real-time signals, and queued parameters. See sd_event_add_signal(3).

  • Child process state change events, based on waitid(2). See sd_event_add_child(3).

  • Static event sources, of three types: defer, post and exit, for invoking calls in each event loop, after other event sources or at event loop termination. See sd_event_add_defer(3).

  • Event sources may be assigned a 64bit priority value, that controls the order in which event sources are dispatched if multiple are pending simultanously. See sd_event_source_set_priority(3).

  • The event loop may automatically send watchdog notification messages to the service manager. See sd_event_set_watchdog(3).

  • The event loop may be integrated into foreign event loops, such as the GLib one. The event loop API is hence composable, the same way the underlying epoll logic is. See sd_event_get_fd(3) for an example.

  • The API is fully OOM safe.

  • A complete set of documentation in UNIX man page format is available, with sd-event(3) as the entry page.

  • It's pretty widely available, and requires no extra dependencies. Since systemd is built on it, most major distributions ship the library in their default install set.

  • After two years of development, and after being used in all of systemd's components, it has received a fair share of testing already, even though we only recently decided to declare it stable and turned it into a public API.

Note that sd-event has some potential drawbacks too:

  • If portability is essential to you, sd-event is not your best option. sd-event is a wrapper around Linux-specific APIs, and that's visible in the API. For example: our event callbacks receive structures defined by Linux-specific APIs such as signalfd.

  • It's a low-level C API, and it doesn't isolate you from the OS underpinnings. While I like to think that it is relatively nice and easy to use from C, it doesn't compromise on exposing the low-level functionality. It just fills the gaps in what's missing between epoll, timerfd, signalfd and related concepts, and it does not hide that away.

Either way, I believe that sd-event is a great choice when looking for an event loop API, in particular if you work on system-level software and embedded, where functionality like timer coalescing or watchdog support matter.

Getting Started

Here's a short example how to use sd-event in a simple daemon. In this example, we'll not just use sd-event.h, but also sd-daemon.h to implement a system service.

#include <alloca.h>
#include <endian.h>
#include <errno.h>
#include <netinet/in.h>
#include <signal.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <unistd.h>

#include <systemd/sd-daemon.h>
#include <systemd/sd-event.h>

static int io_handler(sd_event_source *es, int fd, uint32_t revents, void *userdata) {
        void *buffer;
        ssize_t n;
        int sz;

        /* UDP enforces a somewhat reasonable maximum datagram size of 64K, we can just allocate the buffer on the stack */
        if (ioctl(fd, FIONREAD, &sz) < 0)
                return -errno;
        buffer = alloca(sz);

        n = recv(fd, buffer, sz, 0);
        if (n < 0) {
                if (errno == EAGAIN)
                        return 0;

                return -errno;
        }

        if (n == 5 && memcmp(buffer, "EXIT\n", 5) == 0) {
                /* Request a clean exit */
                sd_event_exit(sd_event_source_get_event(es), 0);
                return 0;
        }

        fwrite(buffer, 1, n, stdout);
        fflush(stdout);
        return 0;
}

int main(int argc, char *argv[]) {
        union {
                struct sockaddr_in in;
                struct sockaddr sa;
        } sa;
        sd_event_source *event_source = NULL;
        sd_event *event = NULL;
        int fd = -1, r;
        sigset_t ss;

        r = sd_event_default(&event);
        if (r < 0)
                goto finish;

        if (sigemptyset(&ss) < 0 ||
            sigaddset(&ss, SIGTERM) < 0 ||
            sigaddset(&ss, SIGINT) < 0) {
                r = -errno;
                goto finish;
        }

        /* Block SIGTERM first, so that the event loop can handle it */
        if (sigprocmask(SIG_BLOCK, &ss, NULL) < 0) {
                r = -errno;
                goto finish;
        }

        /* Let's make use of the default handler and "floating" reference features of sd_event_add_signal() */
        r = sd_event_add_signal(event, NULL, SIGTERM, NULL, NULL);
        if (r < 0)
                goto finish;
        r = sd_event_add_signal(event, NULL, SIGINT, NULL, NULL);
        if (r < 0)
                goto finish;

        /* Enable automatic service watchdog support */
        r = sd_event_set_watchdog(event, true);
        if (r < 0)
                goto finish;

        fd = socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0);
        if (fd < 0) {
                r = -errno;
                goto finish;
        }

        sa.in = (struct sockaddr_in) {
                .sin_family = AF_INET,
                .sin_port = htobe16(7777),
        };
        if (bind(fd, &sa.sa, sizeof(sa)) < 0) {
                r = -errno;
                goto finish;
        }

        r = sd_event_add_io(event, &event_source, fd, EPOLLIN, io_handler, NULL);
        if (r < 0)
                goto finish;

        (void) sd_notifyf(false,
                          "READY=1\n"
                          "STATUS=Daemon startup completed, processing events.");

        r = sd_event_loop(event);

finish:
        event_source = sd_event_source_unref(event_source);
        event = sd_event_unref(event);

        if (fd >= 0)
                (void) close(fd);

        if (r < 0)
                fprintf(stderr, "Failure: %s\n", strerror(-r));

        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

The example above shows how to write a minimal UDP/IP server, that listens on port 7777. Whenever a datagram is received it outputs its contents to STDOUT, unless it is precisely the string EXIT\n in which case the service exits. The service will react to SIGTERM and SIGINT and do a clean exit then. It also notifies the service manager about its completed startup, if it runs under a service manager. Finally, it sends watchdog keep-alive messages to the service manager if it asked for that, and if it runs under a service manager.

When run as systemd service this service's STDOUT will be connected to the logging framework of course, which means the service can act as a minimal UDP-based remote logging service.

To compile and link this example, save it as event-example.c, then run:

$ gcc event-example.c -o event-example `pkg-config --cflags --libs libsystemd`

For a first test, simply run the resulting binary from the command line, and test it against the following netcat command line:

$ nc -u localhost 7777

For the sake of brevity error checking is minimal, and in a real-world application should, of course, be more comprehensive. However, it hopefully gets the idea across how to write a daemon that reacts to external events with sd-event.

For further details on the functions used in the example above, please consult the manual pages: sd-event(3), sd_event_exit(3), sd_event_source_get_event(3), sd_event_default(3), sd_event_add_signal(3), sd_event_set_watchdog(3), sd_event_add_io(3), sd_notifyf(3), sd_event_loop(3), sd_event_source_unref(3), sd_event_unref(3).

Conclusion

So, is this the event loop to end all other event loops? Certainly not. I actually believe in "event loop plurality". There are many reasons for that, but most importantly: sd-event is supposed to be an event loop suitable for writing a wide range of applications, but it's definitely not going to solve all event loop problems. For example, while the priority logic is important for many usecase it comes with drawbacks for others: if not used carefully high-priority event sources can easily starve low-priority event sources. Also, in order to implement the priority logic, sd-event needs to linearly iterate through the event structures returned by epoll_wait(2) to sort the events by their priority, resulting in worst case O(n*log(n)) complexity on each event loop wakeup (for n = number of file descriptors). Then, to implement priorities fully, sd-event only dispatches a single event before going back to the kernel and asking for new events. sd-event will hence not provide the theoretically possible best scalability to huge numbers of file descriptors. Of course, this could be optimized, by improving epoll, and making it support how todays's event loops actually work (after, all, this is the problem set all event loops that implement priorities -- including GLib's -- have to deal with), but even then: the design of sd-event is focussed on running one event loop per thread, and it dispatches events strictly ordered. In many other important usecases a very different design is preferable: one where events are distributed to a set of worker threads and are dispatched out-of-order.

Hence, don't mistake sd-event for what it isn't. It's not supposed to unify everybody on a single event loop. It's just supposed to be a very good implementation of an event loop suitable for a large part of the typical usecases.

Note that our APIs, including sd-bus, integrate nicely into sd-event event loops, but do not require it, and may be integrated into other event loops too, as long as they support watching for time and I/O events.

And that's all for now. If you are considering using sd-event for your project and need help or have questions, please direct them to the systemd mailing list.

© Lennart Poettering. Built using Pelican. Theme by Giulio Fidente on github. .