Category: projects

What's Cooking in PulseAudio's glitch-free Branch

A while ago I started development of special branch of PulseAudio which is called glitch-free. In a few days I will merge it back to PulseAudio trunk, and eventually release it as 0.9.11. I think it's time to explain a little what all this "glitch-freeness" is about, what made it so tricky to implement, and why this is totally awesome technology. So, here we go:

Traditional Playback Model

Traditionally on most operating systems audio is scheduled via sound card interrupts (IRQs). When an application opens a sound card for playback it configures it for a fixed size playback buffer. Then it fills this buffer with digital PCM sample data. And after that it tells the hardware to start playback. Then, the hardware reads the samples from the buffer, one at a time, and passes it on to the DAC so that eventually it reaches the speakers.

After a certain number of samples played the sound hardware generates an interrupt. This interrupt is forwarded to the application. On Linux/Unix this is done via poll()/select(), which the application uses to sleep on the sound card file descriptor. When the application is notified via this interrupt it overwrites the samples that were just played by the hardware with new data and goes to sleep again. When the next interrupt arrives the next block of samples is overwritten, and so on and so on. When the hardware reaches the end of the hardware buffer it starts from its beginning again, in a true ring buffer fashion. This goes on and on and on.

The number of samples after which an interrupt is generated is usually called a fragment (ALSA likes to call the same thing a period for some reason). The number of fragments the entire playback buffer is split into is usually integral and usually a power of two, 2 and 4 being the most frequently used values.

Schematic overview
Image 1: Schematic overview of the playback buffer in the traditional playback model, in the best way the author can visualize this with his limited drawing abilities.

If the application is not quick enough to fill up the hardware buffer again after an interrupt we get a buffer underrun ("drop-out"). An underrun is clearly hearable by the user as a discontinuity in audio which is something we clearly don't want. We thus have to carefully make sure that the buffer and fragment sizes are chosen in a way that the software has enough time to calculate the data that needs to be played, and the OS has enough time to forward the interrupt from the hardware to the userspace software and the write request back to the hardware.

Depending on the requirements of the application the size of the playback buffer is chosen. It can be as small as 4ms for low-latency applications (such as music synthesizers), or as long as 2s for applications where latency doesn't matter (such as music players). The hardware buffer size directly translates to the latency that the playback adds to the system. The smaller the fragment sizes the application configures, the more time the application has to fill up the playback buffer again.

Let's formalize this a bit: Let BUF_SIZE be the size of the hardware playback buffer in samples, FRAG_SIZE the size of one fragment in samples, and NFRAGS the number of fragments the buffer is split into (equivalent to BUF_SIZE divided by FRAG_SIZE), RATE the sampling rate in samples per second. Then, the overall latency is identical to BUF_SIZE/RATE. An interrupt is generated every FRAG_SIZE/RATE. Every time one of those interrupts is generated the application should fill up one fragment again, if it missed one interrupt this might become more than one. If it doesn't miss any interrupt it has (NFRAGS-1)*FRAG_SIZE/RATE time to fulfill the request. If it needs more time than this we'll get an underrun. The fill level of the playback buffer should thus usually oscillate between BUF_SIZE and (NFRAGS-1)*FRAG_SIZE. In case of missed interrupts it might however fall considerably lower, in the worst case to 0 which is, again, an underrun.

It is difficult to choose the buffer and fragment sizes in an optimal way for an application:

  • The buffer size should be as large as possible to minimize the risk of drop-outs.
  • The buffer size should be as small as possible to guarantee minimal latencies.
  • The fragment size should be as large as possible to minimize the number of interrupts, and thus the required CPU time used, to maximize the time the CPU can sleep for between interrupts and thus the battery lifetime (i.e. the fewer interrupts are generated the lower your audio app will show up in powertop, and that's what all is about, right?)
  • The fragment size should be as small as possible to give the application as much time as possible to fill up the playback buffer, to minimize drop-outs.

As you can easily see it is impossible to choose buffering metrics in a way that they are optimal on all four requirements.

This traditional model has major drawbacks:

  • The buffering metrics are highly dependant on what the sound hardware can provide. Portable software needs to be able to deal with hardware that can only provide a very limited set of buffer and fragment sizes.
  • The buffer metrics are configured only once, when the device is opened, they usually cannot be reconfigured during playback without major discontinuities in audio. This is problematic if more than one application wants to output audio at the same time via a sound server (or dmix) and they have different requirements on latency. For these sound servers/dmix the fragment metrics are configured statically in a configuration file, and are the same during the whole lifetime. If a client connects that needs lower latencies, it basically lost. If a client connects that doesn't need as low latencies, we will continouisly burn more CPU/battery than necessary.
  • It is practically impossible to choose the buffer metrics optimal for your application -- there are too many variables in the equation: you can't know anything about the IRQ/scheduling latencies of the OS/machine your software will be running on; you cannot know how much time it will actually take to produce the audio data that shall be pushed to the audio device (unless you start counting cycles, which is a good way to make your code unportable); the scheduling latencies are hugely dependant on the system load on most current OSes (unless you have an RT system, which we generally do not have). As said, for sound servers/dmix it is impossible to know in advance what the requirements on latency are that the applications that might eventually connect will have.
  • Since the number of fragments is integral and at least 2 on almost all existing hardware we will generate at least two interrupts on each buffer iteration. If we fix the buffer size to 2s then we will generate an interrupt at least every 1s. We'd then have 1s to fill up the buffer again -- on all modern systems this is far more than we'd ever need. It would be much better if we could fix the fragment size to 1.9s, which still gives us 100ms to fill up the playback buffer again, still more than necessary on most systems.

Due to the limitations of this model most current (Linux/Unix) software uses buffer metrics that turned out to "work most of the time", very often they are chosen without much thinking, by copying other people's code, or totally at random.

PulseAudio <= 0.9.10 uses a fragment size of 25ms by default, with four fragments. That means that right now, unless you reconfigure your PulseAudio manually clients will not get latencies lower than 100ms whatever you try, and as long as music is playing you will get 40 interrupts/s. (The relevant configuration options for PulseAudio are default-fragments= and default-fragment-size-msec= in daemon.conf)

dmix uses 16 fragments by default with a size of 21 ms each (on my system at least -- this varies, depending on your hardware). You can't get less than 47 interrupts/s. (You can change the parameters in .asoundrc)

So much about the traditional model and its limitations. Now, we'll have a peek on how the new glitch-free branch of PulseAudio does its things. The technology is not really new. It's inspired by what Vista does these days and what Apple CoreAudio has already been doing for quite a while. However, on Linux this technology is new, we have been lagging behind quite a bit. Also I claim that what PA does now goes beyond what Vista/MacOS does in many ways, though of course, they provide much more than we provide in many other ways. The name glitch-free is inspired by the term Microsoft uses to call this model, however I must admit that I am not sure that my definition of this term and theirs actually is the same.

Glitch-Free Playback Model

The first basic idea of the glitch-free playback model (a better, less marketingy name is probably timer-based audio scheduling which is the term I internally use in the PA codebase) is to no longer depend on sound card interrupts to schedule audio but use system timers instead. System timers are far more flexible then the fragment-based sound card timers. They can be reconfigured at any time, and have a granularity that is independant from any buffer metrics of the sound card. The second basic idea is to use playback buffers that are as large as possible, up to a limit of 2s or 5s. The third basic idea is to allow rewriting of the hardware buffer at any time. This allows instant reaction on user-input (i.e. pause/seek requests in your music player, or instant event sounds) although the huge latency imposed by the hardware playback buffer would suggest otherwise.

PA configures the audio hardware to the largest playback buffer size possible, up to 2s. The sound card interrupts are disabled as far as possible (most of the time this means to simply lower NFRAGS to the minimal value supported by the hardware. It would be great if ALSA would allow us to disable sound card interrupts entirely). Then, PA constantly determines what the minimal latency requirement of all connected clients is. If no client specified any requirements we fill up the whole buffer all the time, i.e. have an actual latency of 2s. However, if some applications specified requirements, we take the lowest one and only use as much of the configured hardware buffer as this value allows us. In practice, this means we only partially fill the buffer each time we wake up. Then, we configure a system timer to wake us up 10ms before the buffer would run empty and fill it up again then. If the overall latency is configured to less than 10ms we wakeup after half the latency requested.

If the sleep time turns out to be too long (i.e. it took more than 10ms to fill up the hardware buffer) we will get an underrun. If this happens we can double the time we wake up before the buffer would run empty, to 20ms, and so on. If we notice that we only used much less than the time we estimated, we can halve this value again. This adaptive scheme makes sure that in the unlikely event of a buffer underrun it will happen most likely only once and never again.

When a new client connects or an existing client disconnects, or when a client wants to rewrite what it already wrote, or the user wants to change the volume of one of the streams, then PA will resample its data passed by the client, convert it to the proper hardware sample type, and remix it with the data of the other clients. This of course makes it necessary to keep a "history" of data of all clients around so that if one client requests a rewrite we have the necessary data around to remix what already was mixed before.

The benefits of this model are manyfold:

  • We minimize the overall number of interrupts, down to what the latency requirements of the connected clients allow us. i.e. we save power, don't show up in powertop anymore for normal music playback.
  • We maximize drop-out safety, because we buffer up to 2s in the usual cases. Only with operating systems which have scheduling latencies > 2s we can still get drop-outs. Thankfully no operating system is that bad.
  • In the event of an underrun we don't get stuck in it, but instead are able to recover quickly and can make sure it doesn't happen again.
  • We provide "zero-latency". Each client can rewrite its playback buffer at any time, and this is forwarded to the hardware, even if this means that the sample currently being played needs to be rewritten. This means much quicker reaction to user input, a more responsive user experience.
  • We become much less dependant on what the sound hardware provides us with. We can configure wakeup times that are independant from the fragment settings that the hardware actually supports.
  • We can provide almost any latency a client might request, dynamically without reconfiguration, without discontinuities in audio.

Of course, this scheme also comes with major complications:

  • System timers and sound card timers deviate. On many sound cards by quite a bit. Also, not all sound cards allow the user to query the playback frame index at any time, but only shortly after each IRQ. To compensate for this deviation PA contains a non-trivial algorithm which tries to estimate and follow the deviation over time. If this doesn't work properly it might happen that an underrun happens much earlier than we expected.
  • System timers on Unix are not very high precision. On traditional Linux with HZ=100 sleep times for timers are rounded up to multiples of 10ms. Only very recent Linux kernels with hrtimers can provide something better, but only on x86 and x86-64 until now. This makes the whole scheme unusable for low latency setups unless you run the very latest Linux. Also, hrtimers are not (yet) exposed in poll()/select(). It requires major jumping through loops to work around this limitation.
  • We need to keep a history of sample data for each stream around, thus increasing the memory footprint and potentially increased cache pressure. PA tries to work against the increased memory footprint and cache pressure this might cause by doing zero-copy memory management.
  • We're still dependant on the maximum playback buffer size the sound hardware supports. Many sound cards don't even support 2s, but only 300ms or suchlike.
  • The rewriting of the client buffers causing rewriting of the hardware buffer complicates the resampling/converting step immensly. In general the code to implement this model is more complex than for the traditional model. Also, ALSA has not really been designed with this design in mind, which makes some things very hard to get right and suboptimal.
  • Generally, this works reliably only on newest ALSA, newest kernel, newest everything. It has pretty steep requirements on software and sometimes even on hardware. To stay comptible with systems that don't fulfill these requirements we need to carry around code for the traditional playback model as well, increasing the code base by far.

The advantages of the scheme clearly outweigh the complexities it causes. Especially the power-saving features of glitch-free PA should be enough reason for the embedded Linux people to adopt it quickly. Make PA disappear from powertop even if you play music!

The code in the glitch-free is still rough and sometimes incomplete. I will merge it shortly into trunk and then upload a snapshot to Rawhide.

I hope this text also explains to the few remaining PA haters a little better why PA is a good thing, and why everyone should have it on his Linux desktop. Of course these changes are not visible on the surface, my hope with this blog story is to explain a bit better why infrastructure matters, and counter misconceptions what PA actually is and what it gives you on top of ALSA.


Updated PulseAudio Plugin for SDL

Quick update for all game kiddies: apply this patch to SDL and enjoy PulseAudio in your favourite SDL based game without buffering issues. It's basically just fixes the bogus buffer metrics of Stephan's original patch.


Updated PulseAudio Plugin for Xine

Quick update for all K-lovers: apply this patch to xine-lib and enjoy PulseAudio in Amarok and other KDE apps without stability issues. It's a race-free rework of Diego's original patch.


BOSSA 2008

Just three words: awesome awesome awesome.

And for those asking for it, here are my slides, in which I try to explain the new "glitch-free" audio scheduling core of PulseAudio that I recently commited to the glitch-free branch in PA SVN. I also try to make clear why this functionality is practically a *MUST* for all people who want to have low-latency audio, minimal power consumption and maximum drop-out safety for their audio playback. And thus, why all those fancy embedded Linux devices should adopt it better sooner than later. The slides might appear a bit terse if you don't have that awesome guy they usually come with presenting them to you.


Back from LCA

After coming back from my somewhat extended linux.conf.au trip I spent the whole day grepping through email. Only 263 unprocessed emails left in my inbox. Yay.

PRTPILU

Thanks to the LCA guys, video footage is now available of all talks, including my talk Practical Real-Time Programming in Linux Userspace (Theora, Slides). In my endless modesty I have to recommend: go, watch it, it contains some really good stuff (including me not being able to divide 1 by 1000). Right now, the real-time features of the Linux kernel are seldomly used on the desktop due to a couple of reasons, among them general difficulty and unsafety to use them but predominantly it's probably just unawareness. There are a couple of situations however, where scheduling desktop processes as RT makes a lot of sense (think of video playback, mouse curse feedback, etc.), to decouple the execution (scheduling) latency from the system load. This talk focussed mostly on non-trivial technical stuff and all the limitations RT on Linux still has. To fully grok what's going on you thus need some insight into concurrent programming and stuff.

My plan is to submit a related talk to GUADEC wich will focus more on actually building RT apps for the desktop, in the hope we will eventually be able to ship a desktop with audio and video that never skips, and where user feedback is still snappy and quick even if we do the most complicated IO intensive processing in lots of different processes in the background on slow hardware.

I didn't have time to go through all my slides (which I intended that way and is perfectly OK), so you might want to browse through my slides even if you saw the whole clip. The slides, however, are not particularly verbose.

Rumors

Regarding all those rumors that have been spread while I -- the maintainer of PulseAudio -- was in the middle of the australian outback, fist-fighting with kangaroos near Uluru: I am not really asking anyone to port their apps to the native PulseAudio API right now. While I do think the API is quite powerful and not redundant, I also acknowledge that it is very difficult to use properly (and very easy to misuse), (mostly) due to its fully asynchronous nature. The mysterious libsydney project is supposed to fix this and a lot more. libsydney is mostly the Dukem Nukem Forever of audio APIs right now, but in contrast to DNF I didn't really announce it publicly yet, so it doesn't really count. ;-) Suffice to say, the current situation of audio APIs is a big big mess. We are working on cleaning it up. For now: stick to the well established and least-broken APIs, which boils down to ALSA. Stop using the OSS API now! Don't program against the ESD API (except for event sounds). But, most importantly: please stop misusing the existing APIs. I am doing my best to allow all current APIs to run without hassles on top of PA, but due to the sometimes blatant misues, or even brutal violations of those APIs it is very hard to get that working for all applications (yes, that means you, Adobe, and you, Skype). Don't expect that mmap is available on all audio devices -- it's not, and especially not on PA. Don't use /proc/asound/pcm as an API for enumerating audio devices. It's totally unsuitable for that. Don't hard code device strings. Use default as device string. Don't make assumptions that are not and cannot be true for non-hardware devices. Don't fiddle around with period settings unless you fully grok them and know what you are doing. In short: be a better citizen, write code you don't need to be ashamed of. ALSA has its limitations and my compatibility code certainly as well, but this is not an excuse for working around them by writing code that makes little children cry. If you have a good ALSA backend for your program than this will not only fix your issues with PA, but also with Bluetooth, you will have less code to maintain and also code that is much easier to maintain.

Or even shorter: Fix. Your. Broken. ALSA. Client. Code. Thank you.

Oh, if you have questions regarding PA, just ping me on IRC (if I am around) or write me an email, like everyone else. Mysterious, blogged pseudo invitations to rumored meetings is not the best way to contact me.


GIT Mirrors of my SVN Repositories

To whom it may concern: as a first step to move away from SVN and towards GIT for all my code, I have now configured live GIT mirrors for all my SVN repositories. The plan is to move fully to GIT, but not as long as the GIT integration into Trac is as painful as it is right now. The scripts I used to initialize and update the mirrors are svn-live-init and svn-live-update, for those interested. They are based on scripts CJ van den Berg supplied me with.

It would be great to have the mirror to be both ways. Lazyweb, do you know how to do that?


Avahi/Zeroconf patch for distcc updated

I finally found them time to sit down and update my venerable Avahi/Zeroconf patch for distcc. A patched distcc automatically discovers suitable compiler servers on the local network, without the need to manually configure them. (Announcement).

Here's a quick HOWTO for using a patched distcc like this:

  • Make sure to start distccd (the server) with the new --zeroconf switch, This will make it announce its services on the network.
  • Edit your $HOME/.distcc/hosts and add +zeroconf. This magic string will enable Zeroconf support in the client, i.e. will be expanded to the list of available suitable distcc servers on your LAN.
  • Now set $CC to distcc gcc globally for your login sessions. This will tell all well-behaving build systems to use distcc for compilation (this doesn't work for the kernel, as one notable exception). Even better than setting $CC to distcc gcc is setting it to ccache distcc gcc which will enable ccache in addition to distcc. i.e. stick something like this in your ~/.bash_profile: export CC="ccache distcc gcc"
  • And finally use make -j `distcc -j` instead of plain make to enable parallel building with the right number of concurrent processes. Setting $MAKEFLAGS properly is an alternative option, however is suboptimal if the evalutation is only done once at login time.

If this doesn't work for you than it is a good idea to run distcc --show-hosts to get a list of discovered distcc servers. If this list isn't complete then this is most likely due to mismatching GCC versions or architectures. To check if that's the case use avahi-browse -r _distcc._tcp and compare the values of the cc_machine and cc_version fields. Please note that different Linux distributions use different GCC machine strings. Which is expected since GCC is usually patched quite a bit on the different distributions. This means that a Fedora distcc (the client) will not find a Debian distccd (the server) and vice versa. But again: that's a feature, not a bug.

The new -j and --show-hosts options for distcc are useful for non-zeroconf setups, too.

The patch will automatically discover the number of CPUs on remote machines and make use of that information to better distribute jobs.

In short: Zeroconf support in distcc is totally hot, everyone should have it!

For more information have a look on the announcement of my original patch from 2004 (at that time for the historic HOWL Zeroconf daemon), or read the new announcement linked above.

Distribution packagers! Please merge this new patch into your packages! It would be a pity to withhold Zeroconf support in distcc from your users any longer!

Unfortunately, Fedora doesn't include any distcc packages. Someone should be changing that (who's not me ;-)).

You like this patch? Then give me a kudo on ohloh.net. Now that I earned a golden 10 (after kicking Larry Ewing from position 64. Ha, take that Mr. Ewing!), I need to make sure I don't fall into silver oblivion again. ;-)


Avahi 0.6.22

A couple of minutes ago I released Avahi 0.6.22 into the wild, the newest iteration of everyone's favourite zero configuration networking suite.

Avahi Logo

You ask why this is something to blog about?

Firstly, new in this version is Sjoerd Simons' avahi-gobject library, a GObject wrapper around the Avahi API. It allows full GObject-style object oriented programming of Zeroconf applications, with signals and everything. To all you GNOME/Gtk+ hackers out there: now it is even more fun to hack your own Zeroconf applications for GNOME/Gtk+!

Secondly, this is the first release to ship i18n support. For those who prefer to run their systems with non-english locales[1] this should be good news. I've always been a little afraid of adding i18n support, since this either meant that I would have contstantly had to commit i18n patches, or that I would have needed to move my code to GNOME SVN. However, we now have Fedora's Transifex, which allows me to open up my SVN for translators without much organizational work on my side. Translations are handled centrally, and commited back to my repository when needed. It's a bit like Canonical's Rosetta, but with a focus on commiting i18n changes upstream, and without being closed-source crap.

You like this release? Then give me a kudo on ohloh.net. My ego still thirsts for gold, and I am still (or again) 25 positions away from that. ;-)

Footnotes

[1] Personally, I run my desktop with $LC_MESSAGES=C, but LANG=de_DE, which are the settings I can recommend to everyone who is from Germany and wants to stay sane. Unfortunately it is a PITA to configure this on GNOME, though.


Lazyweb: POSIX Process Groups and Sessions

Dear Lazyweb,

I have trouble understanding what exactly POSIX process groups and sessions are good for. The POSIX docs are very vague on this. What exactly is the effect of being in a process group with some other process, and what does being in the same session with it add on top? And what is the benefit of being a group/session leader in contrast of just being a normal random process in the group/session?

The only thing I understood is that kill(2) with a negative first parameter can be used to "multicast" signals to entire process groups, and that SIGINT on C-c is delivered that way. But, is that all? The POSIX docs say "... for the purpose of signaling, placement in foreground or background, and other job control actions", which is very vague. What are those "other job control actions?". What does job control persist of besides multicasting signals? And what is "placement in foreground or background" other than delivering signals?

And I totally don't get POSIX sessions and how they differ from POSIX process groups. Please enlighten me!

Puzzled,
    Lennart


Emulated atomic operations and real-time scheduling

Unfortunately not all CPU architectures have native support for atomic operations, or only support a very limited subset. Most prominently ARMv5 (and older) hasn't any support besides the most basic atomic swap operation[1]. Now, more and more free code is starting to use atomic operations and lock-free algorithms, one being my own project, PulseAudio. If you have ever done real-time programming you probably know that you cannot really do it without support for atomic operations. One question remains however: what to do on CPUs which support only the most basic atomic operations natively?

On the kernel side atomic ops are very easy to emulate: just disable interrupts temporarily, then do your operation non-atomically, and afterwards enable them again. That's relatively cheap and works fine (unless you are on SMP -- which fortunately you usually are not for those CPUs). The Linux kernel does it this way and it is good. But what to do in user-space, where you cannot just go and disable interrupts?

Let's see how the different userspace libraries/frameworks do it for ARMv5, a very relevant architecture that only knows an atomic swap (exchange) but no CAS or even atomic arithmetics. Let's start with an excerpt from glibc's atomic operations implementation for ARM:

/* Atomic compare and exchange.  These sequences are not actually atomic;
   there is a race if *MEM != OLDVAL and we are preempted between the two
   swaps.  However, they are very close to atomic, and are the best that a
   pre-ARMv6 implementation can do without operating system support.
   LinuxThreads has been using these sequences for many years.  */

This comment says it all. Not good. The more you make use of atomic operations the more likely you're going to to hit this race. Let's hope glibc is not a heavy user of atomic operations. PulseAudio however is, and PulseAudio happens to be my focus.

Let's have a look on how Qt4 does it:

extern Q_CORE_EXPORT char q_atomic_lock;

inline char q_atomic_swp(volatile char *ptr, char newval)
{
    register int ret;
    asm volatile("swpb %0,%1,[%2]"
                 : "=&r"(ret)
                 : "r"(newval), "r"(ptr)
                 : "cc", "memory");
    return ret;
}

inline int q_atomic_test_and_set_int(volatile int *ptr, int expected, int newval)
{
    int ret = 0;
    while (q_atomic_swp(&q_atomic_lock, ~0) != 0);
    if (*ptr == expected) {
	*ptr = newval;
	ret = 1;
    }
    q_atomic_swp(&q_atomic_lock, 0);
    return ret;
}

So, what do we have here? A slightly better version. In standard situations it actually works. But it sucks big time, too. Why? It contains a spin lock: the variable q_atomic_lock is used for locking the atomic operation. The code tries to set it to non-zero, and if that fails it tries again, until it succeeds, in the hope that the other thread -- which currently holds the lock -- gives it up. The big problem here is: it might take a while until that happens, up to 1/HZ time on Linux. Usually you want to use atomic operations to minimize the need for mutexes and thus speed things up. Now, here you got a lock, and it's the worst kind: the spinning lock. Not good. Also, if used from a real-time thread the machine simply locks up when we enter the loop in contended state, because preemption is disabled for RT threads and thus the loop will spin forever. Evil. And then, there's another problem: it's a big bottleneck, because all atomic operations are synchronized via a single variable which is q_atomic_lock. Not good either. And let's not forget that only code that has access to q_atomic_lock actually can execute this code safely. If you want to use it for lock-free IPC via shared memory this is going to break. And let's not forget that it is unusable from signal handlers (which probably doesn't matter much, though). So, in summary: this code sucks, too.

Next try, let's have a look on how glib does it:

static volatile int atomic_spin = 0;

static int atomic_spin_trylock (void)
{
  int result;

  asm volatile (
    "swp %0, %1, [%2]\n"
    : "=&r,&r" (result)
    : "r,0" (1), "r,r" (&atomic_spin)
    : "memory");
  if (result == 0)
    return 0;
  else
    return -1;
}

static void atomic_spin_lock (void)
{
  while (atomic_spin_trylock())
    sched_yield();
}

static void atomic_spin_unlock (void)
{
  atomic_spin = 0;
}

gint
g_atomic_int_exchange_and_add (volatile gint *atomic,
			       gint           val)
{
  gint result;

  atomic_spin_lock();
  result = *atomic;
  *atomic += val;
  atomic_spin_unlock();

  return result;
}

Once again, a spin loop. However, this implementation makes use of sched_yield() for asking the OS to reschedule. It's a bit better than the Qt version, since it doesn't spin just burning CPU, but instead tells the kernel to execute something else, increasing the chance that the thread currently holding the lock is scheduled. It's a bit friendlier, but it's not great either because this might still delay execution quite a bit. It's better then the Qt version. And probably one of the very few ligitimate occasions where using sched_yield() is OK. It still doesn't work for RT -- because sched_yield() in most cases is a NOP on for RT threads, so you still get a machine lockup. And it still has the one-lock-to-rule-them-all bottleneck. And it still is not compatible with shared memory.

Then, there's libatomic_ops. It's the most complex code, so I'll spare you to paste it here. Basically it uses the same spin loop. With three differences however:

  1. 16 lock variables instead of a single one are used. The variable that is used is picked via simple hashing of the pointer to the atomic variable that shall be modified. This removes the one-lock-to-rule-them-all bottleneck.
  2. Instead of pthread_yield() it uses select() with a small timeval parameter to give the current holder of the lock some time to give it up. To make sure that the select() is not optimized away by the kernel and the thread thus never is preempted the sleep time is increased on every loop iteration.
  3. It explicitly disables signals before doing the atomic operation.

It's certainly the best implementation of the ones discussed here: It doesn't suffer by the one-lock-to-rule-them-all bottleneck. It's (supposedly) signal handler safe (which however comes at the cost of doing two syscalls on every atomic operation -- probably a very high price). It actually works on RT, due to sleeping for an explicit time. However it still doesn't deal with priority inversion problems -- which is a big issue for real-time programming. Also, the time slept in the select() call might be relatively long, since at least on Linux the time passed to select() is rounded up to 1/HZ -- not good for RT either. And then, it still doesn't work for shared memory IPC.

So, what do we learn from this? At least one thing: better don't do real-time programming with ARMv5[2]. But more practically, how could a good emulation for atomic ops, solely based on atomic swap look like? Here are a few ideas:

  • Use an implementation inspired by libatomic_ops. Right now it's the best available. It's probably a good idea, though, to replace select() by a nanosleep(), since on recent kernels the latter doesn't round up to 1/HZ anymore, at least when you have high-resolution timers[3] Then, if you can live without signal handler safety, drop the signal mask changing.
  • If you use something based on libatomic_ops and want to use it for shared memory IPC, then you have the option to move the lock variables into shared memory too. Note however, that this allows evil applications to lock up your process by taking the locks and never giving them up. (Which however is always a problem if not all atomic operations you need are available in hardware) So if you do this, make sure that only trusted processes can attach to your memory segment.
  • Alternatively, spend some time and investigate if it is possible to use futexes to sleep on the lock variables. This is not trivial though, since futexes right now expect the availability of an atomic increment operation. But it might be possible to emulate this good enough with the swap operation. There's now even a FUTEX_LOCK_PI operation which would allow priority inheritance.
  • Alternatively, find a a way to allow user space disabling interrupts cheaply (requires kernel patching). Since enabling RT scheduling is a priviliged operation already (since you may easily lock up your machine with it), it might not be too problematic to extend the ability to disable interrupts to user space: it's just yet another way to lock up your machine.
  • For the libatomic_ops based algorithm: if you're lucky and defined a struct type for your atomic integer types, like the kernel does, or like I do in PulseAudio with pa_atomic_t, then you can stick the lock variable directly into your structure. This makes shared memory support transparent, and removes the one-lock-to-rule-them-all bottleneck completely. Of course, OTOH it increases the memory consumption a bit and increases cache pressure (though I'd assume that this is neglible).
  • For the libatomic_ops based algorithm: start sleeping for the time returned by clock_getres() (cache the result!). You cannot sleep shorter than that anyway.

Yepp, that's as good as it gets. Unfortunately I cannot serve you the optimal solution on a silver platter. I never actually did development for ARMv5, this blog story just sums up my thoughts on all the code I saw which emulates atomic ops on ARMv5. But maybe someone who actually cares about atomic operations on ARM finds this interesting and maybe invests some time to prepare patches for Qt, glib, glibc -- and PulseAudio.

Update: I added two more ideas to the list above.

Update 2: Andrew Haley just posted something like the optimal solution for the problem. It would be great if people would start using this.

Footnotes

[1] The Nokia 770 has an ARMv5 chip, N800 has ARMv6. The OpenMoko phone apparently uses ARMv5.

[2] And let's not even think about CPUs which don't even have an atomic swap!

[3] Which however you probably won't, given that they're only available on x86 on stable Linux kernels for now -- but still, it's cleaner.

© Lennart Poettering. Built using Pelican. Theme by Giulio Fidente on github. .