File Descriptor Limits

TL;DR: don't use select() + bump the RLIMIT_NOFILE soft limit to the hard limit in your modern programs.

The primary way to reference, allocate and pin runtime OS resources on Linux today are file descriptors ("fds"). Originally they were used to reference open files and directories and maybe a bit more, but today they may be used to reference almost any kind of runtime resource in Linux userspace, including open devices, memory (memfd_create(2)), timers (timefd_create(2)) and even processes (with the new pidfd_open(2) system call). In a way, the philosophically skewed UNIX concept of "everything is a file" through the proliferation of fds actually acquires a bit of sensible meaning: "everything has a file descriptor" is certainly a much better motto to adopt.

Because of this proliferation of fds, non-trivial modern programs tend to have to deal with substantially more fds at the same time than they traditionally did. Today, you'll often encounter real-life programs that have a few thousand fds open at the same time.

Like on most runtime resources on Linux limits are enforced on file descriptors: once you hit the resource limit configured via RLIMIT_NOFILE any attempt to allocate more is refused with the EMFILE error — until you close a couple of those you already have open.

Because fds weren't such a universal concept traditionally, the limit of RLIMIT_NOFILE used to be quite low. Specifically, when the Linux kernel first invokes userspace it still sets RLIMIT_NOFILE to a low value of 1024 (soft) and 4096 (hard). (Quick explanation: the soft limit is what matters and causes the EMFILE issues, the hard limit is a secondary limit that processes may bump their soft limit to — if they like — without requiring further privileges to do so. Bumping the limit further would require privileges however.). A limit of 1024 fds made fds a scarce resource: APIs tried to be careful with using fds, since you simply couldn't have that many of them at the same time. This resulted in some questionable coding decisions and concepts at various places: often secondary descriptors that are very similar to fds — but were not actually fds — were introduced (e.g. inotify watch descriptors), simply to avoid for them the low limits enforced on true fds. Or code tried to aggressively close fds when not absolutely needing them (e.g. ftw()/nftw()), losing the nice + stable "pinning" effect of open fds.

Worse though is that certain OS level APIs were designed having only the low limits in mind. The worst offender being the BSD/POSIX select(2) system call: it only works with fds in the numeric range of 0…1023 (aka FD_SETSIZE-1). If you have an fd outside of this range, tough luck: select() won't work, and only if you are lucky you'll detect that and can handle it somehow.

Linux fds are exposed as simple integers, and for most calls it is guaranteed that the lowest unused integer is allocated for new fds. Thus, as long as the RLIMIT_NOFILE soft limit is set to 1024 everything remains compatible with select(): the resulting fds will also be below 1024. Yay. If we'd bump the soft limit above this threshold though and at some point in time an fd higher than the threshold is allocated, this fd would not be compatible with select() anymore.

Because of that, indiscriminately increasing the soft RLIMIT_NOFILE resource limit today for every userspace process is problematic: as long as there's userspace code still using select() doing so will risk triggering hard-to-handle, hard-to-debug errors all over the place.

However, given the nowadays ubiquitous use of fds for all kinds of resources (did you know, an eBPF program is an fd? and a cgroup too? and attaching an eBPF program to cgroup is another fd? …), we'd really like to raise the limit anyway. 🤔

So before we continue thinking about this problem, let's make the problem more complex (…uh, I mean… "more exciting") first. Having just one hard and one soft per-process limit on fds is boring. Let's add more limits on fds to the mix. Specifically on Linux there are two system-wide sysctls: fs.nr_open and fs.file-max. (Don't ask me why one uses a dash and the other an underscore, or why there are two of them...) On today's kernels they kinda lost their relevance. They had some originally, because fds weren't accounted by any other counter. But today, the kernel tracks fds mostly as small pieces of memory allocated on userspace requests — because that's ultimately what they are —, and thus charges them to the memory accounting done anyway.

So now, we have four limits (actually: five if you count the memory accounting) on the same kind of resource, and all of them make a resource artificially scarce that we don't want to be scarce. So what to do?

Back in systemd v240 already (i.e. 2019) we decided to do something about it. Specifically:

Automatically at boot we'll now bump the two sysctls to their maximum, making them effectively ineffective. This one was easy. We got rid of two pretty much redundant knobs. Nice!
The RLIMIT_NOFILE hard limit is bumped substantially to 512K. Yay, cheap fds! You may have an fd, and you, and you as well, everyone may have an fd!
But … we left the soft RLIMIT_NOFILE limit at 1024. We weren't quite ready to break all programs still using select() in 2019 yet. But it's not as bad as it might sound I think: given the hard limit is bumped every program can easily opt-in to a larger number of fds, by setting the soft limit to the hard limit early on — without requiring privileges.

So effectively, with this approach fds should be much less scarce (at least for programs that opt into that), and the limits should be much easier to configure, since there are only two knobs now one really needs to care about:

Configure the RLIMIT_NOFILE hard limit to the maximum number of fds you actually want to allow a process.
In the program code then either bump the soft to the hard limit, or not. If you do, you basically declare "I understood the problem, I promise to not use select(), drown me fds please!". If you don't then effectively everything remains as it always was.

Apparently this approach worked, since the negative feedback on change was even scarcer than fds traditionally were (ha, fun!). We got reports from pretty much only two projects that were bitten by the change (one being a JVM implementation): they already bumped their soft limit automatically to their hard limit during program initialization, and then allocated an array with one entry per possible fd. With the new high limit this resulted in one massive allocation that traditionally was just a few K, and this caused memory checks to be hit.

Anyway, here's the take away of this blog story:

Don't use select() anymore in 2021. Use poll(), epoll, iouring, …, but for heaven's sake don't use select(). It might have been all the rage in the 1990s but it doesn't scale and is simply not designed for today's programs. I wished the man page of select() would make clearer how icky it is and that there are plenty of more preferably APIs.
If you hack on a program that potentially uses a lot of fds, add some simple code somewhere to its start-up that bumps the RLIMIT_NOFILE soft limit to the hard limit. But if you do this, you have to make sure your code (and any code that you link to from it) refrains from using select(). (Note: there's at least one glibc NSS plugin using select() internally. Given that NSS modules can end up being loaded into pretty much any process such modules should probably be considered just buggy.)
If said program you hack on forks off foreign programs, make sure to reset the RLIMIT_NOFILE soft limit back to 1024 for them. Just because your program might be fine with fds >= 1024 it doesn't mean that those foreign programs might. And unfortunately RLIMIT_NOFILE is inherited down the process tree unless explicitly set.

And that's all I have for today. I hope this was enlightening.

Category: projects