Pid Eins

Measuring Lock Contention

When naively profiling multi-threaded applications the time spent waiting for mutexes is not necessarily visible in the generated output. However lock contention can have a big impact on the runtime behaviour of applications. On Linux valgrind's drd can be used to track down mutex contention. Unfortunately running applications under valgrind/drd slows them down massively, often having the effect of itself generating many of the contentions one is trying to track down. Also due to its slowness it is very time consuming work.

To improve the situation if have now written a mutex profiler called mutrace. In contrast to valgrind/drd it does not virtualize the CPU instruction set, making it a lot faster. In fact, the hooks mutrace relies on to profile mutex operations should only minimally influence application runtime. mutrace is not useful for finding synchronizations bugs, it is solely useful for profiling locks.

Now, enough of this introductionary blabla. Let's have a look on the data mutrace can generate for you. As an example we'll look at gedit as a bit of a prototypical Gnome application. Gtk+ and the other Gnome libraries are not really known for their heavy use of multi-threading, and the APIs are generally not thread-safe (for a good reason). However, internally subsytems such as gio do use threading quite extensibly. And as it turns out there are a few hotspots that can be discovered with mutrace:

$ LD_PRELOAD=/home/lennart/projects/mutrace/libmutrace.so gedit
mutrace: 0.1 sucessfully initialized.

gedit is now running and its mutex use is being profiled. For this example I have now opened a file with it, typed a few letters and then quit the program again without saving. As soon as gedit exits mutrace will print the profiling data it gathered to stderr. The full output you can see here. The most interesting part is at the end of the generated output, a breakdown of the most contended mutexes:

mutrace: 10 most contended mutexes:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]       Type
      35   368268      407      275      120,822        0,000        0,894     normal
       5   234645      100       21       86,855        0,000        0,494     normal
      26   177324       47        4       98,610        0,001        0,150     normal
      19    55758       53        2       23,931        0,000        0,092     normal
      53      106       73        1        0,769        0,007        0,160     normal
      25    15156       70        1        6,633        0,000        0,019     normal
       4      973       10        1        4,376        0,004        0,174     normal
      75       68       62        0        0,038        0,001        0,004     normal
       9     1663       52        0        1,068        0,001        0,412     normal
       3   136553       41        0       61,408        0,000        0,281     normal
     ...      ...      ...      ...          ...          ...          ...        ...

mutrace: Total runtime 9678,142 ms.

(Sorry, LC_NUMERIC was set to de_DE.UTF-8, so if you can't make sense of all the commas, think s/,/./g!)

For each mutex a line is printed. The 'Locked' column tells how often the mutex was locked during the entire runtime of about 10s. The 'Changed' column tells us how often the owning thread of the mutex changed. The 'Cont.' column tells us how often the lock was already taken when we tried to take it and we had to wait. The fifth column tell us for how long during the entire runtime the lock was locked, the sixth tells us the average lock time, and the seventh column tells us the longest time the lock was held. Finally, the last column tells us what kind of mutex this is (recursive, normal or otherwise).

The most contended lock in the example above is #35. 275 times during the runtime a thread had to wait until another thread released this mutex. All in all more then 120ms of the entire runtime (about 10s) were spent with this lock taken!

In the full output we can now look up which mutex #35 actually is:

Mutex #35 (0x0x7f48c7057d28) first referenced by:
	/home/lennart/projects/mutrace/libmutrace.so(pthread_mutex_lock+0x70) [0x7f48c97dc900]
	/lib64/libglib-2.0.so.0(g_static_rw_lock_writer_lock+0x6a) [0x7f48c674a03a]
	/lib64/libgobject-2.0.so.0(g_type_init_with_debug_flags+0x4b) [0x7f48c6e38ddb]
	/usr/lib64/libgdk-x11-2.0.so.0(gdk_pre_parse_libgtk_only+0x8c) [0x7f48c853171c]
	/usr/lib64/libgtk-x11-2.0.so.0(+0x14b31f) [0x7f48c891831f]
	/lib64/libglib-2.0.so.0(g_option_context_parse+0x90) [0x7f48c67308e0]
	/usr/lib64/libgtk-x11-2.0.so.0(gtk_parse_args+0xa1) [0x7f48c8918021]
	/usr/lib64/libgtk-x11-2.0.so.0(gtk_init_check+0x9) [0x7f48c8918079]
	/usr/lib64/libgtk-x11-2.0.so.0(gtk_init+0x9) [0x7f48c89180a9]
	/usr/bin/gedit(main+0x166) [0x427fc6]
	/lib64/libc.so.6(__libc_start_main+0xfd) [0x7f48c5b42b4d]
	/usr/bin/gedit() [0x4276c9]

As it appears in this Gtk+ program the rwlock type_rw_lock (defined in glib's gobject/gtype.c) is a hotspot. GLib's rwlocks are implemented on top of mutexes, so an obvious attempt in improving this could be to actually make them use the operating system's rwlock primitives.

If a mutex is used often but only ever by the same thread it cannot starve other threads. The 'Changed.' column lists how often a specific mutex changed the owning thread. If the number is high this means the risk of contention is also high. The 'Cont.' column tells you about contention that actually took place.

Due to the way mutrace works we cannot profile mutexes that are used internally in glibc, such as those used for synchronizing stdio and suchlike.

mutrace is implemented entirely in userspace. It uses all kinds of exotic GCC, glibc and kernel features, so you might have a hard time compiling and running it on anything but a very recent Linux distribution. I have tested it on Rawhide but it should work on slightly older distributions, too.

Make sure to build your application with -rdynamic to make the backtraces mutrace generates useful.

As of now, mutrace only profiles mutexes. Adding support for rwlocks should be easy to add though. Patches welcome.

The output mutrace generates can be influenced by various MUTRACE_xxx environment variables. See the sources for more information.

And now, please take mutrace and profile and speed up your application!

You may find the sources in my git repository.

pthread_key_create() is dangerous

If you use pthread_key_create() with a non-NULL destructor parameter (or an equivalent TLS construct) in a library/shared object then you MUST link your library wth -z nodelete (or an equivalent construct).

If you don't, then you'll have a lot of fun (like I just had) debugging segfaults in the TLS destruction logic where functions are called that might not even exist anymore in memory.

Now don't tell me I hadn't told you.

(Oh, and I hope I don't need to mention that all GObject-based libraries should link with -z nodelete anyway, for making sure the type system doesn't break.)

The Highest Man in Spain

Ever wanted to know what's the view like being the highest person in all of Spain? -- No? Hmm, can't help you then. -- Otherwise:

That's on the summit of Pico del Teide at 3718m, on Tenerife island. Unless you leave solid ground this is as high as you can get in Spain. 163m lower it's a bit more obvious that the Teide is a volcano:

And coming down to the surrounding caldera it's even more obvious:

On a ridge next to the caldera you find the Teide Observatory:

The caldera is covered in old lava flows:

Vulcanism has created various interesting rock formations in the caldera:

Tenerife is not just about the Teide and its dusty caldera. In the north of the island you find the Anaga mountain range:

Neighboring Gran Canaria was where our little trip started and ended, right after the Gran Canaria Desktop Summit. Gran Canaria has no Teide but a very impressive landscape nonetheless:

That's the view from the Roque Nublo, the island's most famous landmark. The rock itself is visible here (on the left):

Oh Nine Sixteen

#nocomments y

As a followup to Oh Nine Fifteen here's a little overview of the changes coming with PulseAudio 0.9.16 which will be part of Fedora 12 (already in Rawhide; I think Ubuntu Karmic (?) will have it too).

A New Mixer Logic

We now try to control more than just a single ALSA mixer element for volume control. This increases the hardware volume range and granularity exposed and should also help minimizing problems by incomplete or incorrect default mixer initialization on the lower levels.

This also adds support for allowing selection of input/output ports for sound cards. This is used to expose changing between Mic vs. Line-In for input source selection and Headphones vs. Speaker for output selection (of course the list of available port is strictly dependant on what you hardware supports). The list of available ports is deliberately kept minimal.

Thanks to Bastien the newest GNOME Volume Control now exposes profile/port switching quite nicely, which he blogged about. This screenshot shows how the port (here called 'Connector') can be selected in the new dialog.

The mixer rework also allows us to handle semi-pro/pro sound cards a bit more flexibly. For example, which profiles/ports are exposed in PulseAudio or how specific mixer elements are handled can now be controlled by editing .ini file like configuration files in /usr/share/pulseaudio/alsa-mixer/. Read this mail for more information about this.

UPnP MediaServer Support

PulseAudio now integrates with Zeeshan's fabulous Rygel UPnP/DLNA MediaServer. If enabled Rygel will automatically expose all local audio devices which are managed by PulseAudio as UPnP/DLNA MediaServer items which your UPnP/DLNA MediaRenderers can now tune into. (Meaning: you can now stream audio from your PC directly to your UPnP DMP (Digital Media Player) device, such as the PS3.) Communication between Rygel and PulseAudio follows our little Media Server Spec on the GNOME Wiki. This nicely complements the RAOP (Apple Airport) support we introduced in PulseAudio 0.9.15. In one of the next versions of PulseAudio/Rygel we hope to add support for PulseAudio becoming a MediaRenderer as well. This will then not only allow you to stream from your PC to your DMP device, but also allows PulseAudio to act as "networked speaker", which can be used by any UPnP/AV/DLNA control point, such as Windows' Media Player.

Hotplug Support Improved

If you select a particular device as the default for a specific application or class of streams, then when unplugging the device PulseAudio moves the stream automatically to another audio device if one exists. New in PulseAudio 0.9.16 is that if you replug the audio device the stream will instantly be moved back, requiring no further user intervention.

Also, PulseAudio now includes some implicit rules for doing the 'right thing' when finding an audio device for an application. For example, unless configured otherwise it will now route telephony applications automatically to Bluetooth headsets if one is connected, in favour of the internal sound card of the computer.

Surround Sound Support for Event Sounds

This is more a new feature of libcanberra than of PulseAudio, but nonetheless: we now support surround for events sounds. This allows us to play full 5.1 login sounds for example, in best THX cinema fashion. We'd love to ship a 5.1 sound for login by default in sound-theme-freedesktop. We'd be very thankful if you would be willing to contribute a sound here, or two! A sound a bit less bombastic than the famous cinema THX effect would probably be a good idea though.

And then there's of course the usual batch of fixes and small improvements. A substantial number of non-user visible changes have been made as well. For example, as HAL is now obsolete PulseAudio now moved to udev for its device discovery needs. We replaced our gdbm support by support for tdb. Also, we stripped all security senstive code from PulseAudio, and ported it to use RealtimeKit instead. For the upcoming distributions that means that PulseAudio will run as real-time process by default, improving drop-out safety.

And for some extra PA eye-candy, have a look on Impulse!

World Domination Accomplished

#nocomments y

I hereby officially declare that I have reached my goal of world domination. Emacs 23 (apparently due today) ships with Avahi support out of the box. Obviously, one of the most natural combinations of software thinkable.

After Emacs, there's not much else I could win, or is there?

Yet Another Kit

A while back I was celebrating that arrival of secure realtime scheduling for the desktop. As it appears this was a bit premature then, since (mis-)using cgroups for this turned out to be more problematic and messy than I anticipated.

As a followup I'd now like to point you to this announcement I posted to LAD yesterday, introducing RealtimeKit which should fix the problem for good. It has now entered Rawhide becoming part of the default install (by means of being a dependency of PulseAudio), and I assume the other distros are going to adopt it pretty soon, too.

Read the full announcement.

Linux Plumbers Conference 2009 CFP Ending Soon!

The Call for Papers for the Linux Plumbers Conference (LPC) in September in Portland, Oregon is ending soon, on June 15th 2009. It's a conference about the core infrastructure of Linux systems: the part of the system where userspace and the kernel interface. It's the first conference where the focus is specifically on getting together the kernel people who work on the userspace interfaces and the userspace people who have to deal with kernel interfaces. It's supposed to be a place where all the people doing infrastructure work sit down and talk, so that each other understands better what the requirements and needs of the other are, and where we can work towards fixing the major problems we currently have with our lower-level APIs.

Last year's conference was hugely successful. If you want to read up what happened then, LWN has good coverage.

Like last year, I will be running the Audio conference track of LPC. Audio infrastructure on Linux is still heavily fragmented. Pro, desktop and embedded worlds are very seperate. While we have quite good driver support the user experience is far from perfect, mostly because our infrastructure is so balkanized. Join us at the LPC and help to fix this! If you are doing audio infrastructure work on Linux, make sure to attend and submit a paper!

See you in Portland!

Living in Berlin? You are a GNOMEr?

If you live in Berlin and are a GNOMEr of some kind then please feel invited top drop by tomorrow (Fri 29) at 4 pm at the Prater Biergarten (Weather permitting outside, otherwise inside). We'll have a little GNOME get-together. For now, we know that at least the Openismus Berlin folks will be there, as will I and presumably one special guest from Finland, and whoever else wants to attend.

Hope to see you tomorrow!

The Sound of Fedora 11

I learned so much when I read this interview. And so will you!

All About Fragments

In my on-going series Writing Better Audio Applications for Linux, here's another installment: a little explanation how fragments/periods and buffer sizes should be chosen when doing audio playback with traditional audio APIs such as ALSA and OSS. This originates from some emails I exchanged with the Ekiga folks. In the last weeks I kept copying this explanation to various other folks. I guess it would make sense to post this on my blog here too to reach a wider audience. So here it is, mostly unedited:

Yes. You shouldn't misuse the fragments logic of sound devices. It's
like this:

   The latency is defined by the buffer size.
   The wakeup interval is defined by the fragment size.

The buffer fill level will oscillate between 'full buffer' and 'full
buffer minus 1x fragment size minus OS scheduling latency'. Setting
smaller fragment sizes will increase the CPU load and decrease battery
time since you force the CPU to wake up more often. OTOH it increases
drop out safety, since you fill up playback buffer earlier. Choosing
the fragment size is hence something which you should do balancing out
your needs between power consumption and drop-out safety. With modern
processors and a good OS scheduler like the Linux one setting the
fragment size to anything other than half the buffer size does not
make much sense.

Your [Ekiga's ptlib driver that is] ALSA output is configured
to set the the fragment size to the size of your codec audio
frames. And that's a bad idea. Because the codec frame size has not
been chosen based on power consumption or drop-out safety
reasoning. It has been chosen by the codec designers based on
different reasoning, such as latency.

You probably configured your backend this ways because the ALSA
library docs say that it is recommended to write to the sound card in
multiples of the fragment size. However deducing from this that you
hence should configure the fragment size to the codec frame size is
wrong!

The best way to implement playback these days for ALSA is to write as
much as snd_pcm_avail() tells you to each time you wake up due to
POLLOUT on the sound card. If that is not a multiple of your codec
frame size then you need to buffer the the remainder of the decoded
data yourself in system memory.

The ALSA fragment size you should normally set as large as possible
given your latency constraints but that you have at least two
fragments in your buffer size.

I hope this explains a bit how frag_size/buffer_size should be
chosen. If you have questions, just ask.

(Oh, ALSA uses the term 'period' for what I call 'fragment'
above. It's synonymous)