Category: projects

Emulated atomic operations and real-time scheduling

Unfortunately not all CPU architectures have native support for atomic operations, or only support a very limited subset. Most prominently ARMv5 (and older) hasn't any support besides the most basic atomic swap operation[1]. Now, more and more free code is starting to use atomic operations and lock-free algorithms, one being my own project, PulseAudio. If you have ever done real-time programming you probably know that you cannot really do it without support for atomic operations. One question remains however: what to do on CPUs which support only the most basic atomic operations natively?

On the kernel side atomic ops are very easy to emulate: just disable interrupts temporarily, then do your operation non-atomically, and afterwards enable them again. That's relatively cheap and works fine (unless you are on SMP -- which fortunately you usually are not for those CPUs). The Linux kernel does it this way and it is good. But what to do in user-space, where you cannot just go and disable interrupts?

Let's see how the different userspace libraries/frameworks do it for ARMv5, a very relevant architecture that only knows an atomic swap (exchange) but no CAS or even atomic arithmetics. Let's start with an excerpt from glibc's atomic operations implementation for ARM:

/* Atomic compare and exchange.  These sequences are not actually atomic;
   there is a race if *MEM != OLDVAL and we are preempted between the two
   swaps.  However, they are very close to atomic, and are the best that a
   pre-ARMv6 implementation can do without operating system support.
   LinuxThreads has been using these sequences for many years.  */

This comment says it all. Not good. The more you make use of atomic operations the more likely you're going to to hit this race. Let's hope glibc is not a heavy user of atomic operations. PulseAudio however is, and PulseAudio happens to be my focus.

Let's have a look on how Qt4 does it:

extern Q_CORE_EXPORT char q_atomic_lock;

inline char q_atomic_swp(volatile char *ptr, char newval)
{
    register int ret;
    asm volatile("swpb %0,%1,[%2]"
                 : "=&r"(ret)
                 : "r"(newval), "r"(ptr)
                 : "cc", "memory");
    return ret;
}

inline int q_atomic_test_and_set_int(volatile int *ptr, int expected, int newval)
{
    int ret = 0;
    while (q_atomic_swp(&q_atomic_lock, ~0) != 0);
    if (*ptr == expected) {
	*ptr = newval;
	ret = 1;
    }
    q_atomic_swp(&q_atomic_lock, 0);
    return ret;
}

So, what do we have here? A slightly better version. In standard situations it actually works. But it sucks big time, too. Why? It contains a spin lock: the variable q_atomic_lock is used for locking the atomic operation. The code tries to set it to non-zero, and if that fails it tries again, until it succeeds, in the hope that the other thread -- which currently holds the lock -- gives it up. The big problem here is: it might take a while until that happens, up to 1/HZ time on Linux. Usually you want to use atomic operations to minimize the need for mutexes and thus speed things up. Now, here you got a lock, and it's the worst kind: the spinning lock. Not good. Also, if used from a real-time thread the machine simply locks up when we enter the loop in contended state, because preemption is disabled for RT threads and thus the loop will spin forever. Evil. And then, there's another problem: it's a big bottleneck, because all atomic operations are synchronized via a single variable which is q_atomic_lock. Not good either. And let's not forget that only code that has access to q_atomic_lock actually can execute this code safely. If you want to use it for lock-free IPC via shared memory this is going to break. And let's not forget that it is unusable from signal handlers (which probably doesn't matter much, though). So, in summary: this code sucks, too.

Next try, let's have a look on how glib does it:

static volatile int atomic_spin = 0;

static int atomic_spin_trylock (void)
{
  int result;

  asm volatile (
    "swp %0, %1, [%2]\n"
    : "=&r,&r" (result)
    : "r,0" (1), "r,r" (&atomic_spin)
    : "memory");
  if (result == 0)
    return 0;
  else
    return -1;
}

static void atomic_spin_lock (void)
{
  while (atomic_spin_trylock())
    sched_yield();
}

static void atomic_spin_unlock (void)
{
  atomic_spin = 0;
}

gint
g_atomic_int_exchange_and_add (volatile gint *atomic,
			       gint           val)
{
  gint result;

  atomic_spin_lock();
  result = *atomic;
  *atomic += val;
  atomic_spin_unlock();

  return result;
}

Once again, a spin loop. However, this implementation makes use of sched_yield() for asking the OS to reschedule. It's a bit better than the Qt version, since it doesn't spin just burning CPU, but instead tells the kernel to execute something else, increasing the chance that the thread currently holding the lock is scheduled. It's a bit friendlier, but it's not great either because this might still delay execution quite a bit. It's better then the Qt version. And probably one of the very few ligitimate occasions where using sched_yield() is OK. It still doesn't work for RT -- because sched_yield() in most cases is a NOP on for RT threads, so you still get a machine lockup. And it still has the one-lock-to-rule-them-all bottleneck. And it still is not compatible with shared memory.

Then, there's libatomic_ops. It's the most complex code, so I'll spare you to paste it here. Basically it uses the same spin loop. With three differences however:

  1. 16 lock variables instead of a single one are used. The variable that is used is picked via simple hashing of the pointer to the atomic variable that shall be modified. This removes the one-lock-to-rule-them-all bottleneck.
  2. Instead of pthread_yield() it uses select() with a small timeval parameter to give the current holder of the lock some time to give it up. To make sure that the select() is not optimized away by the kernel and the thread thus never is preempted the sleep time is increased on every loop iteration.
  3. It explicitly disables signals before doing the atomic operation.

It's certainly the best implementation of the ones discussed here: It doesn't suffer by the one-lock-to-rule-them-all bottleneck. It's (supposedly) signal handler safe (which however comes at the cost of doing two syscalls on every atomic operation -- probably a very high price). It actually works on RT, due to sleeping for an explicit time. However it still doesn't deal with priority inversion problems -- which is a big issue for real-time programming. Also, the time slept in the select() call might be relatively long, since at least on Linux the time passed to select() is rounded up to 1/HZ -- not good for RT either. And then, it still doesn't work for shared memory IPC.

So, what do we learn from this? At least one thing: better don't do real-time programming with ARMv5[2]. But more practically, how could a good emulation for atomic ops, solely based on atomic swap look like? Here are a few ideas:

  • Use an implementation inspired by libatomic_ops. Right now it's the best available. It's probably a good idea, though, to replace select() by a nanosleep(), since on recent kernels the latter doesn't round up to 1/HZ anymore, at least when you have high-resolution timers[3] Then, if you can live without signal handler safety, drop the signal mask changing.
  • If you use something based on libatomic_ops and want to use it for shared memory IPC, then you have the option to move the lock variables into shared memory too. Note however, that this allows evil applications to lock up your process by taking the locks and never giving them up. (Which however is always a problem if not all atomic operations you need are available in hardware) So if you do this, make sure that only trusted processes can attach to your memory segment.
  • Alternatively, spend some time and investigate if it is possible to use futexes to sleep on the lock variables. This is not trivial though, since futexes right now expect the availability of an atomic increment operation. But it might be possible to emulate this good enough with the swap operation. There's now even a FUTEX_LOCK_PI operation which would allow priority inheritance.
  • Alternatively, find a a way to allow user space disabling interrupts cheaply (requires kernel patching). Since enabling RT scheduling is a priviliged operation already (since you may easily lock up your machine with it), it might not be too problematic to extend the ability to disable interrupts to user space: it's just yet another way to lock up your machine.
  • For the libatomic_ops based algorithm: if you're lucky and defined a struct type for your atomic integer types, like the kernel does, or like I do in PulseAudio with pa_atomic_t, then you can stick the lock variable directly into your structure. This makes shared memory support transparent, and removes the one-lock-to-rule-them-all bottleneck completely. Of course, OTOH it increases the memory consumption a bit and increases cache pressure (though I'd assume that this is neglible).
  • For the libatomic_ops based algorithm: start sleeping for the time returned by clock_getres() (cache the result!). You cannot sleep shorter than that anyway.

Yepp, that's as good as it gets. Unfortunately I cannot serve you the optimal solution on a silver platter. I never actually did development for ARMv5, this blog story just sums up my thoughts on all the code I saw which emulates atomic ops on ARMv5. But maybe someone who actually cares about atomic operations on ARM finds this interesting and maybe invests some time to prepare patches for Qt, glib, glibc -- and PulseAudio.

Update: I added two more ideas to the list above.

Update 2: Andrew Haley just posted something like the optimal solution for the problem. It would be great if people would start using this.

Footnotes

[1] The Nokia 770 has an ARMv5 chip, N800 has ARMv6. The OpenMoko phone apparently uses ARMv5.

[2] And let's not even think about CPUs which don't even have an atomic swap!

[3] Which however you probably won't, given that they're only available on x86 on stable Linux kernels for now -- but still, it's cleaner.


Fedora Interview

Don't miss this interesting Fedora interview with yours truly, where I go a bit into detail what's coming next for PulseAudio.


The next step

A few minutes ago, I finally released PulseAudio 0.9.7. Changes are numerous, especially internally where the core is now threaded and mostly lock-free. Check the rough list on the milestone page, announcement email. As many of you know we are shipping a pre-release of 0.9.7 in Fedora 8, enabled by default. The final release offers quite a few additions over that prerelease. To show off a couple of nice features, here's a screencast, showing hotplug, simultaneous playback (what Apple calls aggregation) and zeroconfish network support:

screencast

Please excuse the typos. Yes, I still use XMMS, don't ask [1]. Yes, you need a bit of imagination to fully appreciate a screencast that lacks an audio track -- but demos audio software.

So, what's coming next? Earcandy, timer-based scheduling/"glitch-free" audio, scriptability through Lua, the todo list is huge. My unnoffical, scratchy, partly german TODO list for PulseAudio is available online.

As it appears all relevant distros will now move to PA by default. So, hopefully, PA is coming to a desktop near you pretty soon. -- Oh, you are one of those who still don't see the benefit of a desktop sound server? Then, please reread this too long email of mine, or maybe this ars.technica article.

OTOH, if you happen to like this release, then consider giving me a kudo on ohloh.net, my ego wants a golden 10. ;-)

logo

Footnotes:

[1] Those music players which categorize audio by ID3 tags just don't work for me, because most of my music files are very badly named. However, my directory structure is very well organized, but all those newer players don't care about directory structures as it seems. XMMS doesn't really either, but xmms . does the job from the terminal.

Flameeyes, thank's for hosting this clip.


Yummy Mango Yummy Lassi Yummy

Zeeshan, Mango Lassi tastes a lot different than a milk shake, believe me! Also, even if Mango Lassi was actually a western thing, do you know that just recently I was witness of Sjoerd[1] ordering a Vindaloo Pizza (or was it Korma?) at a Boston restaurant -- italian pizza with indian-style curry on top. Now, that's what some people might be calling "ignorant of indian cuisine". But actually I think that, like in music, mixing different styles, combining things from different origins is a good thing, and is what makes culture live.

Footnotes
[1] Who doesn't have a blog. Can you believe it?

An Icon for Mango Lassi

Thanks to Vinicius Depizzol's great work, Mango Lassi now has an icon:

Mango Lassi's new icon

Muito obrigado!

I'd also like to thank everyone else who sent in an icon suggestion. Thank you very much!


Fedora Planet Noise

Am I the only one who thinks that the usefulness of Fedora Planet as severely limited because of the low signal-to-noise ratio that is due to far too many non-english (i.e. german, french) language posts?


PulseAudio FUD

If you want to know more about PulseAudio's relation to GNOME (especially if you think PA is evil) then please read through this thread on desktop-devel, and especially this long email I posted as a reply a few minutes ago, where I try to debunk all the FUD that has been spread.


Mango Lassi

Yesterday, at the GNOME Summit in Boston I did a quick presentation of my new desktop input sharing hotness thingy, called "Mango Lassi" (Alternatively known as "GNOME Input Sharing"). Something like a Synergy done right, or an x2x that doesn't suck.

So, for those who couldn't attend, here's a screenshot, which doesn't really tell how great it is, and which might also be a bit confusing:

Mango Lassi Screenshot

And here's a list of random features already available:

  • Discover desktops to share mouse and keyboards with automatically via Avahi.
  • Fully peer-to-peer. All Mango Lassi instances are both client and server at the same time. Other hosts may enter or leave a running session at any time.
  • No need to open X11 up for the network
  • You have a 50% chance that for your setup you don't need any configuration at all. In the case of the other 50% you might need to swap the order of your screens manually in a simple dialog, because Mango Lassi didn't guess correctly which screen is left and which screen is right.
  • libnotify integration so that it tells you whenever a desktop joins or leaves your session.
  • Shows a nice OSD on your screen when your screen's input is currently being redirected to another screen.
  • Uses all those nifty GNOME APIs, like D-Bus-over-TCP, Avahi, libnotify, Gtk, ...
  • Supports both the X11 clipboard and the selection, supporting all content types, and not just simple text -- i.e. you can copy and paste image data between Gimp on your screens
  • Lot's of bugs and useless debug output, since this is basically the work of just three weekends.
  • Tray icon

And here's a list of missing features:

  • Drag'n'drop between screens. (I figured out how this could work, it's just a matter of actually implementing this, which is probably considerable work, because this would require some UI work, to show a download dialog and suchlike.)
  • Integration with Matthias' GTK+ window migration patches, which would allow dragging GTK+ windows between screens. The migration code for GTK+ basically works. It's just a matter of getting them merged in GTK+ proper, and hooking them up properly with Mango Lassi, which probably needs some kind of special support in Metacity so that we get notified when a window drag is happening and the pointer comes near the edges of the screens.
  • Encryption, authentication: Best solution would probably be that D-Bus would get native TLS support which we could then make use of.
  • Support for legacy operating systems like Windows/MacOS. I personally don't care much about this. However, Zeroconf implementations and D-Bus is available on Windows/MacOS too, and the exposed D-Bus interfaces are not too X11-centric, so this should be doable without too much work.
  • UI Love, actually hooking up the desktop order changing buttons, save and restore the order automatically.
  • MPX support (this would *rock*)

And finally, here's where you can get it:

git clone http://git.0pointer.de/repos/mango-lassi.git/

gitweb

Oh, and I don't take feature wishlist requests for this project. If you need a feature, implement it yourself. It's Free Software after all! I'd be happy if someone would be willing to work on Mango Lassi in a way that it can become a really good GNOME citizen and maybe even a proper part of it. But personally I'll probably only work on it to a level where it does all I need to work with my Laptop and my Desktop PC on my desk in a sane way. I am almost 100% busy with PulseAudio these days, and thus unable to give Mango Lassi the love it could use. So, stand up now, if you want to take over maintainership!

Hmm, Mango Lassi could use some good artwork, starting with an icon. I am quite sure that someone with better graphic skills then me could easily create a delicious icon perhaps featuring a glass of fresh, juicy Mango Lassi. I'd be very thankful for every icon submission!


Twenty-First

Hahaha.

Yours truely, Lennart (C list blogger).


This is a good move

I hope other distributions will follow.

© Lennart Poettering. Built using Pelican. Theme by Giulio Fidente on github. .