Category: projects

Addendum on the Brokenness of File Locking

I forgot to mention another central problem in my blog story about file locking on Linux:

Different machines have access to different features of the same file system. Here's an example: let's say you have two machines in your home LAN. You want them to share their $HOME directory, so that you (or your family) can use either machine and have access to all your (or their) data. So you export /home on one machine via NFS and mount it from the other machine.

So far so good. But what happens to file locking now? Programs on the first machine see a fully-featured ext3 or ext4 file system, where all kinds of locking works (even though the API might suck as mentioned in the earlier blog story). But what about the other machine? If you set up lockd properly then POSIX locking will work on both. If you didn't one machine can use POSIX locking properly, the other cannot. And it gets even worse: as mentioned recent NFS implementations on Linux transparently convert client-side BSD locking into POSIX locking on the server side. Now, if the same application uses BSD locking on both the client and the server side from two instances they will end up with two orthogonal locks and although both sides think they have properly acquired a lock (and they actually did) they will overwrite each other's data, because those two locks are independent. (And one wonders why the NFS developers implemented this brokenness nonetheless...).

This basically means that locking cannot be used unless it is verified that everyone accessing a file system can make use of the same file system feature set. If you use file locking on a file system you should do so only if you are sufficiently sure that nobody using a broken or weird NFS implementation might want to access and lock those files as well. And practically that is impossible. Even if fpathconf() was improved so that it could inform the caller whether it can successfully apply a file lock to a file, this would still not give any hint if the same is true for everybody else accessing the file. But that is essential when speaking of advisory (i.e. cooperative) file locking.

And no, this isn't easy to fix. So again, the recommendation: forget about file locking on Linux, it's nothing more than a useless toy.

Also read Jeremy Allison's (Samba) take on POSIX file locking. It's an interesting read.


On the Brokenness of File Locking

It's amazing how far Linux has come without providing for proper file locking that works and is usable from userspace. A little overview why file locking is still in a very sad state:

To begin with, there's a plethora of APIs, and all of them are awful:

  • POSIX File locking as available with fcntl(F_SET_LK): the POSIX locking API is the most portable one and in theory works across NFS. It can do byte-range locking. So much on the good side. On the bad side there's a lot more however: locks are bound to processes, not file descriptors. That means that this logic cannot be used in threaded environments unless combined with a process-local mutex. This is hard to get right, especially in libraries that do not know the environment they are run in, i.e. whether they are used in threaded environments or not. The worst part however is that POSIX locks are automatically released if a process calls close() on any (!) of its open file descriptors for that file. That means that when one part of a program locks a file and another by coincidence accesses it too for a short time, the first part's lock will be broken and it won't be notified about that. Modern software tends to load big frameworks (such as Gtk+ or Qt) into memory as well as arbitrary modules via mechanisms such as NSS, PAM, gvfs, GTK_MODULES, Apache modules, GStreamer modules where one module seldom can control what another module in the same process does or accesses. The effect of this is that POSIX locks are unusable in any non-trivial program where it cannot be ensured that a file that is locked is never accessed by any other part of the process at the same time. Example: a user managing daemon wants to write /etc/passwd and locks the file for that. At the same time in another thread (or from a stack frame further down) something calls getpwuid() which internally accesses /etc/passwd and causes the lock to be released, the first thread (or stack frame) not knowing that. Furthermore should two threads use the locking fcntl()s on the same file they will interfere with each other's locks and reset the locking ranges and flags of each other. On top of that locking cannot be used on any file that is publicly accessible (i.e. has the R bit set for groups/others, i.e. more access bits on than 0600), because that would otherwise effectively give arbitrary users a way to indefinitely block execution of any process (regardless of the UID it is running under) that wants to access and lock the file. This is generally not an acceptable security risk. Finally, while POSIX file locks are supposedly NFS-safe they not always really are as there are still many NFS implementations around where locking is not properly implemented, and NFS tends to be used in heterogenous networks. The biggest problem about this is that there is no way to properly detect whether file locking works on a specific NFS mount (or any mount) or not.
  • The other API for POSIX file locks: lockf() is another API for the same mechanism and suffers by the same problems. One wonders why there are two APIs for the same messed up interface.
  • BSD locking based on flock(). The semantics of this kind of locking are much nicer than for POSIX locking: locks are bound to file descriptors, not processes. This kind of locking can hence be used safely between threads and can even be inherited across fork() and exec(). Locks are only automatically broken on the close() call for the one file descriptor they were created with (or the last duplicate of it). On the other hand this kind of locking does not offer byte-range locking and suffers by the same security problems as POSIX locking, and works on even less cases on NFS than POSIX locking (i.e. on BSD and Linux < 2.6.12 they were NOPs returning success). And since BSD locking is not as portable as POSIX locking this is sometimes an unsafe choice. Some OSes even find it funny to make flock() and fcntl(F_SET_LK) control the same locks. Linux treats them independently -- except for the cases where it doesn't: on Linux NFS they are transparently converted to POSIX locks, too now. What a chaos!
  • Mandatory locking is available too. It's based on the POSIX locking API but not portable in itself. It's dangerous business and should generally be avoided in cleanly written software.
  • Traditional lock file based file locking. This is how things where done traditionally, based around known atomicity guarantees of certain basic file system operations. It's a cumbersome thing, and requires polling of the file system to get notifications when a lock is released. Also, On Linux NFS < 2.6.5 it doesn't work properly, since O_EXCL isn't atomic there. And of course the client cannot really know what the server is running, so again this brokeness is not detectable.

The Disappointing Summary

File locking on Linux is just broken. The broken semantics of POSIX locking show that the designers of this API apparently never have tried to actually use it in real software. It smells a lot like an interface that kernel people thought makes sense but in reality doesn't when you try to use it from userspace.

Here's a list of places where you shouldn't use file locking due to the problems shown above: If you want to lock a file in $HOME, forget about it as $HOME might be NFS and locks generally are not reliable there. The same applies to every other file system that might be shared across the network. If the file you want to lock is accessible to more than your own user (i.e. an access mode > 0700), forget about locking, it would allow others to block your application indefinitely. If your program is non-trivial or threaded or uses a framework such as Gtk+ or Qt or any of the module-based APIs such as NSS, PAM, ... forget about about POSIX locking. If you care about portability, don't use file locking.

Or to turn this around, the only case where it is kind of safe to use file locking is in trivial applications where portability is not key and by using BSD locking on a file system where you can rely that it is local and on files inaccessible to others. Of course, that doesn't leave much, except for private files in /tmp for trivial user applications.

Or in one sentence: in its current state Linux file locking is unusable.

And that is a shame.

Update: Check out the follow-up story on this topic.


On IDs

When programming software that cooperates with software running on behalf of other users, other sessions or other computers it is often necessary to work with unique identifiers. These can be bound to various hardware and software objects as well as lifetimes. Often, when people look for such an ID to use they pick the wrong one because semantics and lifetime or the IDs are not clear. Here's a little incomprehensive list of IDs accessible on Linux and how you should or should not use them.

Hardware IDs

  1. /sys/class/dmi/id/product_uuid: The main board product UUID, as set by the board manufacturer and encoded in the BIOS DMI information. It may be used to identify a mainboard and only the mainboard. It changes when the user replaces the main board. Also, often enough BIOS manufacturers write bogus serials into it. In addition, it is x86-specific. Access for unprivileged users is forbidden. Hence it is of little general use.
  2. CPUID/EAX=3 CPU serial number: A CPU UUID, as set by the CPU manufacturer and encoded on the CPU chip. It may be used to identify a CPU and only a CPU. It changes when the user replaces the CPU. Also, most modern CPUs don't implement this feature anymore, and older computers tend to disable this option by default, controllable via a BIOS Setup option. In addition, it is x86-specific. Hence this too is of little general use.
  3. /sys/class/net/*/address: One or more network MAC addresses, as set by the network adapter manufacturer and encoded on some network card EEPROM. It changes when the user replaces the network card. Since network cards are optional and there may be more than one the availability if this ID is not guaranteed and you might have more than one to choose from. On virtual machines the MAC addresses tend to be random. This too is hence of little general use.
  4. /sys/bus/usb/devices/*/serial: Serial numbers of various USB devices, as encoded in the USB device EEPROM. Most devices don't have a serial number set, and if they have it is often bogus. If the user replaces his USB hardware or plugs it into another machine these IDs may change or appear in other machines. This hence too is of little use.

There are various other hardware IDs available, many of which you may discover via the ID_SERIAL udev property of various devices, such hard disks and similar. They all have in common that they are bound to specific (replacable) hardware, not universally available, often filled with bogus data and random in virtualized environments. Or in other words: don't use them, don't rely on them for identification, unless you really know what you are doing and in general they do not guarantee what you might hope they guarantee.

Software IDs

  1. /proc/sys/kernel/random/boot_id: A random ID that is regenerated on each boot. As such it can be used to identify the local machine's current boot. It's universally available on any recent Linux kernel. It's a good and safe choice if you need to identify a specific boot on a specific booted kernel.
  2. gethostname(), /proc/sys/kernel/hostname: A non-random ID configured by the administrator to identify a machine in the network. Often this is not set at all or is set to some default value such as localhost and not even unique in the local network. In addition it might change during runtime, for example because it changes based on updated DHCP information. As such it is almost entirely useless for anything but presentation to the user. It has very weak semantics and relies on correct configuration by the administrator. Don't use this to identify machines in a distributed environment. It won't work unless centrally administered, which makes it useless in a globalized, mobile world. It has no place in automatically generated filenames that shall be bound to specific hosts. Just don't use it, please. It's really not what many people think it is. gethostname() is standardized in POSIX and hence portable to other Unixes.
  3. IP Addresses returned by SIOCGIFCONF or the respective Netlink APIs: These tend to be dynamically assigned and often enough only valid on local networks or even only the local links (i.e. 192.168.x.x style addresses, or even 169.254.x.x/IPv4LL). Unfortunately they hence have little use outside of networking.
  4. gethostid(): Returns a supposedly unique 32-bit identifier for the current machine. The semantics of this is not clear. On most machines this simply returns a value based on a local IPv4 address. On others it is administrator controlled via the /etc/hostid file. Since the semantics of this ID are not clear and most often is just a value based on the IP address it is almost always the wrong choice to use. On top of that 32bit are not particularly a lot. On the other hand this is standardized in POSIX and hence portable to other Unixes. It's probably best to ignore this value and if people don't want to ignore it they should probably symlink /etc/hostid to /var/lib/dbus/machine-id or something similar.
  5. /var/lib/dbus/machine-id: An ID identifying a specific Linux/Unix installation. It does not change if hardware is replaced. It is not unreliable in virtualized environments. This value has clear semantics and is considered part of the D-Bus API. It is supposedly globally unique and portable to all systems that have D-Bus. On Linux, it is universally available, given that almost all non-embedded and even a fair share of the embedded machines ship D-Bus now. This is the recommended way to identify a machine, possibly with a fallback to the host name to cover systems that still lack D-Bus. If your application links against libdbus, you may access this ID with dbus_get_local_machine_id(), if not you can read it directly from the file system.
  6. /proc/self/sessionid: An ID identifying a specific Linux login session. This ID is maintained by the kernel and part of the auditing logic. It is uniquely assigned to each login session during a specific system boot, shared by each process of a session, even across su/sudo and cannot be changed by userspace. Unfortunately some distributions have so far failed to set things up properly for this to work (Hey, you, Ubuntu!), and this ID is always (uint32_t) -1 for them. But there's hope they get this fixed eventually. Nonetheless it is a good choice for a unique session identifier on the local machine and for the current boot. To make this ID globally unique it is best combined with /proc/sys/kernel/random/boot_id.
  7. getuid(): An ID identifying a specific Unix/Linux user. This ID is usually automatically assigned when a user is created. It is not unique across machines and may be reassigned to a different user if the original user was deleted. As such it should be used only locally and with the limited validity in time in mind. To make this ID globally unique it is not sufficient to combine it with /var/lib/dbus/machine-id, because the same ID might be used for a different user that is created later with the same UID. Nonetheless this combination is often good enough. It is available on all POSIX systems.
  8. ID_FS_UUID: an ID that identifies a specific file system in the udev tree. It is not always clear how these serials are generated but this tends to be available on almost all modern disk file systems. It is not available for NFS mounts or virtual file systems. Nonetheless this is often a good way to identify a file system, and in the case of the root directory even an installation. However due to the weakly defined generation semantics the D-Bus machine ID is generally preferrable.

Generating IDs

Linux offers a kernel interface to generate UUIDs on demand, by reading from /proc/sys/kernel/random/uuid. This is a very simple interface to generate UUIDs. That said, the logic behind UUIDs is unnecessarily complex and often it is a better choice to simply read 16 bytes or so from /dev/urandom.

Summary

And the gist of it all: Use /var/lib/dbus/machine-id! Use /proc/self/sessionid! Use /proc/sys/kernel/random/boot_id! Use getuid()! Use /dev/urandom! And forget about the rest, in particular the host name, or the hardware IDs such as DMI. And keep in mind that you may combine the aforementioned IDs in various ways to get different semantics and validity constraints.


Slides from LinuxTag 2010

On popular request, here are my (terse) slides from LinuxTag on systemd.


Change of Plans

The upcoming week I'll do two talks at LinuxTag 2010 at the Berlin Fair Grounds. One of them was only added to the schedule today, about systemd. Systemd has never been presented in a public talk before, so make sure to attend this historic moment... ;-). Read about what has been written about systemd so far, so that you can ask the sharpest questions during my presentation.

My second talk might be about stuff a little less reported in the press, but still very interesting, about Surround Sound in Gnome.

See you at LinuxTag!


Mango Lassi is Back

Mango Lassi's Icon

Sven Herzberg has recently been doing a lot of work on Mango Lassi, a project deserving love but which I as its original author haven't touched in 3 years.

His work is already bearing fruits:

Mango Lassi

Distribution packagers, please go and package his version, Mango Lassi is an awesome, wonderful tool that needs distributor love.

If you want to use Mango Lassi without waiting for the distribution packagers to catch up, Sven has built some packages for you in the OpenSUSE Build Service.

Sven, KUTGW!


Name Your Threads

Stefan Kost recently pointed me to the fact that the Linux system call prctl(PR_SET_NAME) does not in fact change the process name, but the task name (comm field) -- in contrast to what the man page suggests.

That makes it very useful for naming threads, since you can read back the name you set with PR_SET_NAME earlier from the /proc file system (/proc/$PID/task/$TID/comm on newer kernels, /proc/$PID/task/$TID/stat's second field on older kernels), and hence distuingish which thread might be responsible for the high CPU load or similar problems.

So, now go, if you have a project which involves a lot of threads, name them all individually, and make it easier to debug them. What's missing now, of course, is that gdb learns this and shows the comm name when doing info threads.

I have changed PulseAudio now to name all threads it creates.

Of course, what would be even better than this is full file system extended attribute support in procfs, so that we could attach arbitrary information to processes and threads, including references to .desktop files and such.


systemd Now Has a Web Site

We now have a web site, a mailing list, a bugzilla component and moved our git repositories to freedesktop.org. Make sure to update your check-outs.

For more details see our new web site.


LAC Video Streams Online

The great people from the Linux Audio Conference uploaded the video streams from the event. Among them you can find my own presentation. Enjoy!


PulseAudio and Jack

#nocomments yes

One thing became very clear to me during my trip to the Linux Audio Conference 2010 in Utrecht: even many pro audio folks are not sure what Jack does that PulseAudio doesn't do and what PulseAudio does that Jack doesn't do; why they are not competing, why you cannot replace one by the other, and why merging them (at least in the short term) might not make immediate sense. In other words, why millions of phones on this world run PulseAudio and not Jack, and why a music studio running PulseAudio is crack.

To light this up a bit and for future reference I'll try to explain in the following text why there is this seperation between the two systems and why this isn't necessarily bad. This is mostly a written up version of (parts of) my slides from LAC, so if you attended that event you might find little new, but I hope it is interesting nonetheless.

This is mostly written from my perspective as a hacker working on consumer audio stuff (more specifically having written most of PulseAudio), but I am sure most pro audio folks would agree with the points I raise here, and have more things to add. What I explain below is in no way comprehensive, just a list of a couple of points I think are the most important, as they touch the very core of both systems (and we ignore all the toppings here, i.e. sound effects, yadda, yadda).

First of all let's clear up the background of the sound server use cases here:

Consumer Audio (i.e. PulseAudio) Pro Audio (i.e. Jack)
Reducing power usage is a defining requirement, most systems are battery powered (Laptops, Cell Phones). Power usage usually not an issue, power comes out of the wall.
Must support latencies low enough for telephony and games. Also covers high latency uses, such as movie and music playback (2s of latency is a good choice). Minimal latencies are a definining requirement.
System is highly dynamic, with applications starting/stopping, hardware added and removed all the time. System is usually static in its configuration during operation.
User is usually not proficient in the technologies used.[1] User is usually a professional and knows audio technology and computers well.
User is not necessarily the administrator of his machine, might have limited access. User usually administrates his own machines, has root privileges.
Audio is just one use of the system among many, and often just a background job. Audio is the primary purpose of the system.
Hardware tends to have limited resources and be crappy and cheap. Hardware is powerful, expensive and high quality.

Of course, things are often not as black and white like this, there are uses that fall in the middle of these two areas.

From the table above a few conclusions may be drawn:

  • A consumer sound system must support both low and high latency operation. Since low latencies mean high CPU load and hence high power consumption[2] (Heisenberg...), a system should always run with the highest latency latency possible, but the lowest latency necessary.
  • Since the consumer system is highly dynamic in its use latencies must be adjusted dynamically too. That makes a design such as PulseAudio's timer-based scheduling important.
  • A pro audio system's primary optimization target is low latency. Low power usage, dynamic changeble configuration (i.e. a short drop-out while you change your pipeline is acceptable) and user-friendliness may be sacrificed for that.
  • For large buffer sizes a zero-copy design suggests itself: since data blocks are large the cache pressure can be considerably reduced by zero-copy designs. Only for large buffers the cost of passing pointers around is considerable smaller than the cost of passing around the data itself (or the other way round: if your audio data has the same size as your pointers, then passing pointers around is useless extra work).
  • On a resource constrained system the ideal audio pipeline does not touch and convert the data passed along it unnecessarily. That makes it important to support natively the sample types and interleaving modes of the audio source or destination.
  • A consumer system needs to simplify the view on the hardware, hide the its complexity: hide redundant mixer elements, or merge them while making use of the hardware capabilities, and extending it in software so that the same functionality is provided on all hardware. A production system should not hide or simplify the hardware functionality.
  • A consumer system should not drop-out when a client misbehaves or the configuration changes (OTOH if it happens in exceptions it is not disastrous either). A synchronous pipeline is hence not advisable, clients need to supply their data asynchronously.
  • In a pro audio system a drop-out during reconfiguration is acceptable, during operation unacceptable.
  • In consumer audio we need to make compromises on resource usage, which pro audio does not have to commit to. Example: a pro audio system can issue memlock() with little limitations since the hardware is powerful (i.e. a lot of RAM available) and audio is the primary purpose. A consumer audio system cannot do that because that call practically makes memory unavailable to other applications, increasing their swap pressure. And since audio is not the primary purpose of the system and resources limited we hence need to find a different way.

Jack has been designed for low latencies, where synchronous operation is advisable, meaning that a misbehaving client call stall the entire pipeline. Changes of the pipeline or latencies usually result in drop-outs in one way or the other, since the entire pipeline is reconfigured, from the hardware to the various clients. Jack only supports FLOAT32 samples and non-interleaved audio channels (and that is a good thing). Jack does not employ reference-counted zero-copy buffers. It does not try to simplify the hardware mixer in any way.

PulseAudio OTOH can deal with varying latancies, dynamically adjusting to the lowest latencies any of the connected clients needs. Client communication is fully asynchronous, a single client cannot stall the entire pipeline. PulseAudio supports a variety of PCM formats and channel setups. PulseAudio's design is heavily based on reference-counted zero-copy buffers that are passed around, even between processes, instead of the audio data itself. PulseAudio tries to simplify the hardware mixer as suggested above.

Now, the two paragraphs above hopefully show how Jack is more suitable for the pro audio use case and PulseAudio more for the consumer audio use case. One question asks itself though: can we marry the two approaches? Yes, we probably can, MacOS has a unified approach for both uses. However, it is not clear this would be a good idea. First of all, a system with the complexities introduced by sample format/channel mapping conversion, as well as dynamically changing latencies and pipelines, and asynchronous behaviour would certainly be much less attractive to pro audio developers. In fact, that Jack limits itself to synchronous, FLOAT32-only, non-interleaved-only audio streams is one of the big features of its design. Marrying the two approaches would corrupt that. A merged solution would probably not have a good stand in the community.

But it goes even further than this: what would the use case for this be? After all, most of the time, you don't want your event sounds, your Youtube, your VoIP and your Rhythmbox mixed into the new record you are producing. Hence a clear seperation between the two worlds might even be handy?

Also, let's not forget that we lack the manpower to even create such an audio chimera.

So, where to from here? Well, I think we should put the focus on cooperation instead of amalgamation: teach PulseAudio to go out of the way as soon as Jack needs access to the device, and optionally make PulseAudio a normal JACK client while both are running. That way, the user has the option to use the PulseAudio supplied streams, but normally does not see them in his pipeline. The first part of this has already been implemented: Jack2 and PulseAudio do not fight for the audio device, a friendly handover takes place. Jack takes precedence, PulseAudio takes the back seat. The second part is still missing: you still have to manually hookup PulseAudio to Jack if you are interested in its streams. If both are implemented starting Jack basically has the effect of replacing PulseAudio's core with the Jack core, while still providing full compatibility with PulseAudio clients.

And that I guess is all I have to say on the entire Jack and PulseAudio story.

Oh, one more thing, while we are at clearing things up: some news sites claim that PulseAudio's not necessarily stellar reputation in some parts of the community comes from Ubuntu and other distributions having integrated it too early. Well, let me stress here explicitly, that while they might have made a mistake or two in packaging PulseAudio and I publicly pointed that out (and probably not in a too friendly way), I do believe that the point in time they adopted it was right. Why? Basically, it's a chicken and egg problem. If it is not used in the distributions it is not tested, and there is no pressure to get fixed what then turns out to be broken: in PulseAudio itself, and in both the layers on top and below of it. Don't forget that pushing a new layer into an existing stack will break a lot of assumptions that the neighboring layers made. Doing this must break things. Most Free Software projects could probably use more developers, and that is particularly true for Audio on Linux. And given that that is how it is, pushing the feature in at that point in time was the right thing to do. Or in other words, if the features are right, and things do work correctly as far as the limited test base the developers control shows, then one day you need to push into the distributions, even if this might break setups and software that previously has not been tested, unless you want to stay stuck in your development indefinitely. So yes, Ubuntu, I think you did well with adopting PulseAudio when you did.

Footnotes

[1] Side note: yes, consumers tend not to know what dB is, and expect volume settings in "percentages", a mostly meaningless unit in audio. This even spills into projects like VLC or Amarok which expose linear volume controls (which is a really bad idea).

[2] In case you are wondering why that is the case: if the latency is low the buffers must be sized smaller. And if the buffers are sized smaller then the CPU will have to wake up more often to fill them up for the same playback time. This drives up the CPU load since less actual payload can be processed for the amount of housekeeping that the CPU has to do during each buffer iteration. Also, frequent wake-ups make it impossible for the CPU to go to deeper sleep states. Sleep states are the primary way for modern CPUs to save power.

© Lennart Poettering. Built using Pelican. Theme by Giulio Fidente on github. .