One thing became very clear to me during my trip to the Linux Audio Conference 2010 in Utrecht: even many pro audio folks are not sure what Jack does that PulseAudio doesn't do and what PulseAudio does that Jack doesn't do; why they are not competing, why you cannot replace one by the other, and why merging them (at least in the short term) might not make immediate sense. In other words, why millions of phones on this world run PulseAudio and not Jack, and why a music studio running PulseAudio is crack.
To light this up a bit and for future reference I'll try to explain in the following text why there is this seperation between the two systems and why this isn't necessarily bad. This is mostly a written up version of (parts of) my slides from LAC, so if you attended that event you might find little new, but I hope it is interesting nonetheless.
This is mostly written from my perspective as a hacker working on consumer audio stuff (more specifically having written most of PulseAudio), but I am sure most pro audio folks would agree with the points I raise here, and have more things to add. What I explain below is in no way comprehensive, just a list of a couple of points I think are the most important, as they touch the very core of both systems (and we ignore all the toppings here, i.e. sound effects, yadda, yadda).
First of all let's clear up the background of the sound server use cases here:
Consumer Audio (i.e. PulseAudio) | Pro Audio (i.e. Jack) |
---|---|
Reducing power usage is a defining requirement, most systems are battery powered (Laptops, Cell Phones). | Power usage usually not an issue, power comes out of the wall. |
Must support latencies low enough for telephony and games. Also covers high latency uses, such as movie and music playback (2s of latency is a good choice). | Minimal latencies are a definining requirement. |
System is highly dynamic, with applications starting/stopping, hardware added and removed all the time. | System is usually static in its configuration during operation. |
User is usually not proficient in the technologies used.[1] | User is usually a professional and knows audio technology and computers well. |
User is not necessarily the administrator of his machine, might have limited access. | User usually administrates his own machines, has root privileges. |
Audio is just one use of the system among many, and often just a background job. | Audio is the primary purpose of the system. |
Hardware tends to have limited resources and be crappy and cheap. | Hardware is powerful, expensive and high quality. |
Of course, things are often not as black and white like this, there are uses that fall in the middle of these two areas.
From the table above a few conclusions may be drawn:
- A consumer sound system must support both low and high latency operation. Since low latencies mean high CPU load and hence high power consumption[2] (Heisenberg...), a system should always run with the highest latency latency possible, but the lowest latency necessary.
- Since the consumer system is highly dynamic in its use latencies must be adjusted dynamically too. That makes a design such as PulseAudio's timer-based scheduling important.
- A pro audio system's primary optimization target is low latency. Low power usage, dynamic changeble configuration (i.e. a short drop-out while you change your pipeline is acceptable) and user-friendliness may be sacrificed for that.
- For large buffer sizes a zero-copy design suggests itself: since data blocks are large the cache pressure can be considerably reduced by zero-copy designs. Only for large buffers the cost of passing pointers around is considerable smaller than the cost of passing around the data itself (or the other way round: if your audio data has the same size as your pointers, then passing pointers around is useless extra work).
- On a resource constrained system the ideal audio pipeline does not touch and convert the data passed along it unnecessarily. That makes it important to support natively the sample types and interleaving modes of the audio source or destination.
- A consumer system needs to simplify the view on the hardware, hide the its complexity: hide redundant mixer elements, or merge them while making use of the hardware capabilities, and extending it in software so that the same functionality is provided on all hardware. A production system should not hide or simplify the hardware functionality.
- A consumer system should not drop-out when a client misbehaves or the configuration changes (OTOH if it happens in exceptions it is not disastrous either). A synchronous pipeline is hence not advisable, clients need to supply their data asynchronously.
- In a pro audio system a drop-out during reconfiguration is acceptable, during operation unacceptable.
- In consumer audio we need to make compromises on resource usage, which pro audio does not have to commit to. Example: a pro audio system can issue memlock() with little limitations since the hardware is powerful (i.e. a lot of RAM available) and audio is the primary purpose. A consumer audio system cannot do that because that call practically makes memory unavailable to other applications, increasing their swap pressure. And since audio is not the primary purpose of the system and resources limited we hence need to find a different way.
Jack has been designed for low latencies, where synchronous operation is advisable, meaning that a misbehaving client call stall the entire pipeline. Changes of the pipeline or latencies usually result in drop-outs in one way or the other, since the entire pipeline is reconfigured, from the hardware to the various clients. Jack only supports FLOAT32 samples and non-interleaved audio channels (and that is a good thing). Jack does not employ reference-counted zero-copy buffers. It does not try to simplify the hardware mixer in any way.
PulseAudio OTOH can deal with varying latancies, dynamically adjusting to the lowest latencies any of the connected clients needs. Client communication is fully asynchronous, a single client cannot stall the entire pipeline. PulseAudio supports a variety of PCM formats and channel setups. PulseAudio's design is heavily based on reference-counted zero-copy buffers that are passed around, even between processes, instead of the audio data itself. PulseAudio tries to simplify the hardware mixer as suggested above.
Now, the two paragraphs above hopefully show how Jack is more suitable for the pro audio use case and PulseAudio more for the consumer audio use case. One question asks itself though: can we marry the two approaches? Yes, we probably can, MacOS has a unified approach for both uses. However, it is not clear this would be a good idea. First of all, a system with the complexities introduced by sample format/channel mapping conversion, as well as dynamically changing latencies and pipelines, and asynchronous behaviour would certainly be much less attractive to pro audio developers. In fact, that Jack limits itself to synchronous, FLOAT32-only, non-interleaved-only audio streams is one of the big features of its design. Marrying the two approaches would corrupt that. A merged solution would probably not have a good stand in the community.
But it goes even further than this: what would the use case for this be? After all, most of the time, you don't want your event sounds, your Youtube, your VoIP and your Rhythmbox mixed into the new record you are producing. Hence a clear seperation between the two worlds might even be handy?
Also, let's not forget that we lack the manpower to even create such an audio chimera.
So, where to from here? Well, I think we should put the focus on cooperation instead of amalgamation: teach PulseAudio to go out of the way as soon as Jack needs access to the device, and optionally make PulseAudio a normal JACK client while both are running. That way, the user has the option to use the PulseAudio supplied streams, but normally does not see them in his pipeline. The first part of this has already been implemented: Jack2 and PulseAudio do not fight for the audio device, a friendly handover takes place. Jack takes precedence, PulseAudio takes the back seat. The second part is still missing: you still have to manually hookup PulseAudio to Jack if you are interested in its streams. If both are implemented starting Jack basically has the effect of replacing PulseAudio's core with the Jack core, while still providing full compatibility with PulseAudio clients.
And that I guess is all I have to say on the entire Jack and PulseAudio story.
Oh, one more thing, while we are at clearing things up: some news sites claim that PulseAudio's not necessarily stellar reputation in some parts of the community comes from Ubuntu and other distributions having integrated it too early. Well, let me stress here explicitly, that while they might have made a mistake or two in packaging PulseAudio and I publicly pointed that out (and probably not in a too friendly way), I do believe that the point in time they adopted it was right. Why? Basically, it's a chicken and egg problem. If it is not used in the distributions it is not tested, and there is no pressure to get fixed what then turns out to be broken: in PulseAudio itself, and in both the layers on top and below of it. Don't forget that pushing a new layer into an existing stack will break a lot of assumptions that the neighboring layers made. Doing this must break things. Most Free Software projects could probably use more developers, and that is particularly true for Audio on Linux. And given that that is how it is, pushing the feature in at that point in time was the right thing to do. Or in other words, if the features are right, and things do work correctly as far as the limited test base the developers control shows, then one day you need to push into the distributions, even if this might break setups and software that previously has not been tested, unless you want to stay stuck in your development indefinitely. So yes, Ubuntu, I think you did well with adopting PulseAudio when you did.
Footnotes
[1] Side note: yes, consumers tend not to know what dB is, and expect volume settings in "percentages", a mostly meaningless unit in audio. This even spills into projects like VLC or Amarok which expose linear volume controls (which is a really bad idea).
[2] In case you are wondering why that is the case: if the latency is low the buffers must be sized smaller. And if the buffers are sized smaller then the CPU will have to wake up more often to fill them up for the same playback time. This drives up the CPU load since less actual payload can be processed for the amount of housekeeping that the CPU has to do during each buffer iteration. Also, frequent wake-ups make it impossible for the CPU to go to deeper sleep states. Sleep states are the primary way for modern CPUs to save power.