Posted on Mo 09 Oktober 2017

IP Accounting and Access Lists with systemd

TL;DR: systemd now can do per-service IP traffic accounting, as well as access control for IP address ranges.

Last Friday we released systemd 235. I already blogged about its Dynamic User feature in detail, but there's one more piece of new functionality that I think deserves special attention: IP accounting and access control.

Before v235 systemd already provided per-unit resource management hooks for a number of different kinds of resources: consumed CPU time, disk I/O, memory usage and number of tasks. With v235 another kind of resource can be controlled per-unit with systemd: network traffic (specifically IP).

Three new unit file settings have been added in this context:

  1. IPAccounting= is a boolean setting. If enabled for a unit, all IP traffic sent and received by processes associated with it is counted both in terms of bytes and of packets.

  2. IPAddressDeny= takes an IP address prefix (that means: an IP address with a network mask). All traffic from and to this address will be prohibited for processes of the service.

  3. IPAddressAllow= is the matching positive counterpart to IPAddressDeny=. All traffic matching this IP address/network mask combination will be allowed, even if otherwise listed in IPAddressDeny=.

The three options are thin wrappers around kernel functionality introduced with Linux 4.11: the control group eBPF hooks. The actual work is done by the kernel, systemd just provides a number of new settings to configure this facet of it. Note that cgroup/eBPF is unrelated to classic Linux firewalling, i.e. NetFilter/iptables. It's up to you whether you use one or the other, or both in combination (or of course neither).

IP Accounting

Let's have a closer look at the IP accounting logic mentioned above. Let's write a simple unit /etc/systemd/system/ip-accounting-test.service:

[Service]
ExecStart=/usr/bin/ping 8.8.8.8
IPAccounting=yes

This simple unit invokes the ping(8) command to send a series of ICMP/IP ping packets to the IP address 8.8.8.8 (which is the Google DNS server IP; we use it for testing here, since it's easy to remember, reachable everywhere and known to react to ICMP pings; any other IP address responding to pings would be fine to use, too). The IPAccounting= option is used to turn on IP accounting for the unit.

Let's start this service after writing the file. Let's then have a look at the status output of systemctl:

# systemctl daemon-reload
# systemctl start ip-accounting-test
# systemctl status ip-accounting-test ip-accounting-test.service
   Loaded: loaded (/etc/systemd/system/ip-accounting-test.service; static; vendor preset: disabled)
   Active: active (running) since Mon 2017-10-09 18:05:47 CEST; 1s ago
 Main PID: 32152 (ping)
       IP: 168B in, 168B out
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/ip-accounting-test.service
           └─32152 /usr/bin/ping 8.8.8.8

Okt 09 18:05:47 sigma systemd[1]: Started ip-accounting-test.service.
Okt 09 18:05:47 sigma ping[32152]: PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
Okt 09 18:05:47 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=1 ttl=59 time=29.2 ms
Okt 09 18:05:48 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=2 ttl=59 time=28.0 ms

This shows the ping command running — it's currently at its second ping cycle as we can see in the logs at the end of the output. More interesting however is the IP: line further up showing the current IP byte counters. It currently shows 168 bytes have been received, and 168 bytes have been sent. That the two counters are at the same value is not surprising: ICMP ping requests and responses are supposed to have the same size. Note that this line is shown only if IPAccounting= is turned on for the service, as only then this data is collected.

Let's wait a bit, and invoke systemctl status again:

# systemctl status ip-accounting-test ip-accounting-test.service
   Loaded: loaded (/etc/systemd/system/ip-accounting-test.service; static; vendor preset: disabled)
   Active: active (running) since Mon 2017-10-09 18:05:47 CEST; 4min 28s ago
 Main PID: 32152 (ping)
       IP: 22.2K in, 22.2K out
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/ip-accounting-test.service
           └─32152 /usr/bin/ping 8.8.8.8

Okt 09 18:10:07 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=260 ttl=59 time=27.7 ms
Okt 09 18:10:08 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=261 ttl=59 time=28.0 ms
Okt 09 18:10:09 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=262 ttl=59 time=33.8 ms
Okt 09 18:10:10 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=263 ttl=59 time=48.9 ms
Okt 09 18:10:11 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=264 ttl=59 time=27.2 ms
Okt 09 18:10:12 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=265 ttl=59 time=27.0 ms
Okt 09 18:10:13 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=266 ttl=59 time=26.8 ms
Okt 09 18:10:14 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=267 ttl=59 time=27.4 ms
Okt 09 18:10:15 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=268 ttl=59 time=29.7 ms
Okt 09 18:10:16 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=269 ttl=59 time=27.6 ms

As we can see, after 269 pings the counters are much higher: at 22K.

Note that while systemctl status shows only the byte counters, packet counters are kept as well. Use the low-level systemctl show command to query the current raw values of the in and out packet and byte counters:

# systemctl show ip-accounting-test -p IPIngressBytes -p IPIngressPackets -p IPEgressBytes -p IPEgressPackets
IPIngressBytes=37776
IPIngressPackets=449
IPEgressBytes=37776
IPEgressPackets=449

Of course, the same information is also available via the D-Bus APIs. If you want to process this data further consider talking proper D-Bus, rather than scraping the output of systemctl show.

Now, let's stop the service again:

# systemctl stop ip-accounting-test

When a service with such accounting turned on terminates, a log line about all its consumed resources is written to the logs. Let's check with journalctl:

# journalctl -u ip-accounting-test -n 5
-- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:17:02 CEST. --
Okt 09 18:15:50 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=603 ttl=59 time=26.9 ms
Okt 09 18:15:51 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=604 ttl=59 time=27.2 ms
Okt 09 18:15:52 sigma systemd[1]: Stopping ip-accounting-test.service...
Okt 09 18:15:52 sigma systemd[1]: Stopped ip-accounting-test.service.
Okt 09 18:15:52 sigma systemd[1]: ip-accounting-test.service: Received 49.5K IP traffic, sent 49.5K IP traffic

The last line shown is the interesting one, that shows the accounting data. It's actually a structured log message, and among its metadata fields it contains the more comprehensive raw data:

# journalctl -u ip-accounting-test -n 1 -o verbose
-- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:18:50 CEST. --
Mon 2017-10-09 18:15:52.649028 CEST [s=89a2cc877fdf4dafb2269a7631afedad;i=14d7;b=4c7e7adcba0c45b69d612857270716d3;m=137592e75e;t=55b1f81298605;x=c3c9b57b28c9490e]
    PRIORITY=6
    _BOOT_ID=4c7e7adcba0c45b69d612857270716d3
    _MACHINE_ID=e87bfd866aea4ae4b761aff06c9c3cb3
    _HOSTNAME=sigma
    SYSLOG_FACILITY=3
    SYSLOG_IDENTIFIER=systemd
    _UID=0
    _GID=0
    _TRANSPORT=journal
    _PID=1
    _COMM=systemd
    _EXE=/usr/lib/systemd/systemd
    _CAP_EFFECTIVE=3fffffffff
    _SYSTEMD_CGROUP=/init.scope
    _SYSTEMD_UNIT=init.scope
    _SYSTEMD_SLICE=-.slice
    CODE_FILE=../src/core/unit.c
    _CMDLINE=/usr/lib/systemd/systemd --switched-root --system --deserialize 25
    _SELINUX_CONTEXT=system_u:system_r:init_t:s0
    UNIT=ip-accounting-test.service
    CODE_LINE=2115
    CODE_FUNC=unit_log_resources
    MESSAGE_ID=ae8f7b866b0347b9af31fe1c80b127c0
    INVOCATION_ID=98a6e756fa9d421d8dfc82b6df06a9c3
    IP_METRIC_INGRESS_BYTES=50880
    IP_METRIC_INGRESS_PACKETS=605
    IP_METRIC_EGRESS_BYTES=50880
    IP_METRIC_EGRESS_PACKETS=605
    MESSAGE=ip-accounting-test.service: Received 49.6K IP traffic, sent 49.6K IP traffic
    _SOURCE_REALTIME_TIMESTAMP=1507565752649028

The interesting fields of this log message are of course IP_METRIC_INGRESS_BYTES=, IP_METRIC_INGRESS_PACKETS=, IP_METRIC_EGRESS_BYTES=, IP_METRIC_EGRESS_PACKETS= that show the consumed data.

The log message carries a message ID that may be used to quickly search for all such resource log messages (ae8f7b866b0347b9af31fe1c80b127c0). We can combine a search term for messages of this ID with journalctl's -u switch to quickly find out about the resource usage of any invocation of a specific service. Let's try:

# journalctl -u ip-accounting-test MESSAGE_ID=ae8f7b866b0347b9af31fe1c80b127c0
-- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:25:27 CEST. --
Okt 09 18:15:52 sigma systemd[1]: ip-accounting-test.service: Received 49.6K IP traffic, sent 49.6K IP traffic

Of course, the output above shows only one message at the moment, since we started the service only once, but a new one will appear every time you start and stop it again.

The IP accounting logic is also hooked up with systemd-run, which is useful for transiently running a command as systemd service with IP accounting turned on. Let's try it:

# systemd-run -p IPAccounting=yes --wait wget https://cfp.all-systems-go.io/en/ASG2017/public/schedule/2.pdf
Running as unit: run-u2761.service
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 878ms
IP traffic received: 231.0K
IP traffic sent: 3.7K

This uses wget to download the PDF version of the 2nd day schedule of everybody's favorite Linux user-space conference All Systems Go! 2017 (BTW, have you already booked your ticket? We are very close to selling out, be quick!). The IP traffic this command generated was 231K ingress and 4K egress. In the systemd-run command line two parameters are important. First of all, we use -p IPAccounting=yes to turn on IP accounting for the transient service (as above). And secondly we use --wait to tell systemd-run to wait for the service to exit. If --wait is used, systemd-run will also show you various statistics about the service that just ran and terminated, including the IP statistics you are seeing if IP accounting has been turned on.

It's fun to combine this sort of IP accounting with interactive transient units. Let's try that:

# systemd-run -p IPAccounting=1 -t /bin/sh
Running as unit: run-u2779.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4# dnf update
…
sh-4.4# dnf install firefox
…
sh-4.4# exit
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 5.297s
IP traffic received: …B
IP traffic sent: …B

This uses systemd-run's --pty switch (or short: -t), which opens an interactive pseudo-TTY connection to the invoked service process, which is a bourne shell in this case. Doing this means we have a full, comprehensive shell with job control and everything. Since the shell is running as part of a service with IP accounting turned on, all IP traffic we generate or receive will be accounted for. And as soon as we exit the shell, we'll see what it consumed. (For the sake of brevity I actually didn't paste the whole output above, but truncated core parts. Try it out for yourself, if you want to see the output in full.)

Sometimes it might make sense to turn on IP accounting for a unit that is already running. For that, use systemctl set-property foobar.service IPAccounting=yes, which will instantly turn on accounting for it. Note that it won't count retroactively though: only the traffic sent/received after the point in time you turned it on will be collected. You may turn off accounting for the unit with the same command.

Of course, sometimes it's interesting to collect IP accounting data for all services, and turning on IPAccounting=yes in every single unit is cumbersome. To deal with that there's a global option DefaultIPAccounting= available which can be set in /etc/systemd/system.conf.

IP Access Lists

So much about IP accounting. Let's now have a look at IP access control with systemd 235. As mentioned above, the two new unit file settings, IPAddressAllow= and IPAddressDeny= maybe be used for that. They operate in the following way:

  1. If the source address of an incoming packet or the destination address of an outgoing packet matches one of the IP addresses/network masks in the relevant unit's IPAddressAllow= setting then it will be allowed to go through.

  2. Otherwise, if a packet matches an IPAddressDeny= entry configured for the service it is dropped.

  3. If the packet matches neither of the above it is allowed to go through.

Or in other words, IPAddressDeny= implements a blacklist, but IPAddressAllow= takes precedence.

Let's try that out. Let's modify our last example above in order to get a transient service running an interactive shell which has such an access list set:

# systemd-run -p IPAddressDeny=any -p IPAddressAllow=8.8.8.8 -p IPAddressAllow=127.0.0.0/8 -t /bin/sh
Running as unit: run-u2850.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4# ping 8.8.8.8 -c1
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=59 time=27.9 ms

--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 27.957/27.957/27.957/0.000 ms
sh-4.4# ping 8.8.4.4 -c1
PING 8.8.4.4 (8.8.4.4) 56(84) bytes of data.
ping: sendmsg: Operation not permitted
^C
--- 8.8.4.4 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
sh-4.4# ping 127.0.0.2 -c1
PING 127.0.0.1 (127.0.0.2) 56(84) bytes of data.
64 bytes from 127.0.0.2: icmp_seq=1 ttl=64 time=0.116 ms

--- 127.0.0.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.116/0.116/0.116/0.000 ms
sh-4.4# exit

The access list we set up uses IPAddressDeny=any in order to define an IP white-list: all traffic will be prohibited for the session, except for what is explicitly white-listed. In this command line, we white-listed two address prefixes: 8.8.8.8 (with no explicit network mask, which means the mask with all bits turned on is implied, i.e. /32), and 127.0.0.0/8. Thus, the service can communicate with Google's DNS server and everything on the local loop-back, but nothing else. The commands run in this interactive shell show this: First we try pinging 8.8.8.8 which happily responds. Then, we try to ping 8.8.4.4 (that's Google's other DNS server, but excluded from this white-list), and as we see it is immediately refused with an Operation not permitted error. As last step we ping 127.0.0.2 (which is on the local loop-back), and we see it works fine again, as expected.

In the example above we used IPAddressDeny=any. The any identifier is a shortcut for writing 0.0.0.0/0 ::/0, i.e. it's a shortcut for everything, on both IPv4 and IPv6. A number of other such shortcuts exist. For example, instead of spelling out 127.0.0.0/8 we could also have used the more descriptive shortcut localhost which is expanded to 127.0.0.0/8 ::1/128, i.e. everything on the local loopback device, on both IPv4 and IPv6.

Being able to configure IP access lists individually for each unit is pretty nice already. However, typically one wants to configure this comprehensively, not just for individual units, but for a set of units in one go or even the system as a whole. In systemd, that's possible by making use of .slice units (for those who don't know systemd that well, slice units are a concept for organizing services in hierarchical tree for the purpose of resource management): the IP access list in effect for a unit is the combination of the individual IP access lists configured for the unit itself and those of all slice units it is contained in.

By default, system services are assigned to system.slice, which in turn is a child of the root slice -.slice. Either of these two slice units are hence suitable for locking down all system services at once. If an access list is configured on system.slice it will only apply to system services, however, if configured on -.slice it will apply to all user processes of the system, including all user session processes (i.e. which are by default assigned to user.slice which is a child of -.slice) in addition to the system services.

Let's make use of this:

# systemctl set-property system.slice IPAddressDeny=any IPAddressAllow=localhost
# systemctl set-property apache.service IPAddressAllow=10.0.0.0/8

The two commands above are a very powerful way to first turn off all IP communication for all system services (with the exception of loop-back traffic), followed by an explicit white-listing of 10.0.0.0/8 (which could refer to the local company network, you get the idea) but only for the Apache service.

Use-cases

After playing around a bit with this, let's talk about use-cases. Here are a few ideas:

  1. The IP access list logic can in many ways provide a more modern replacement for the venerable TCP Wrapper, but unlike it it applies to all IP sockets of a service unconditionally, and requires no explicit support in any way in the service's code: no patching required. On the other hand, TCP wrappers have a number of features this scheme cannot cover, most importantly systemd's IP access lists operate solely on the level of IP addresses and network masks, there is no way to configure access by DNS name (though quite frankly, that is a very dubious feature anyway, as doing networking — unsecured networking even – in order to restrict networking sounds quite questionable, at least to me).

  2. It can also replace (or augment) some facets of IP firewalling, i.e. Linux NetFilter/iptables. Right now, systemd's access lists are of course a lot more minimal than NetFilter, but they have one major benefit: they understand the service concept, and thus are a lot more context-aware than NetFilter. Classic firewalls, such as NetFilter, derive most service context from the IP port number alone, but we live in a world where IP port numbers are a lot more dynamic than they used to be. As one example, a BitTorrent client or server may use any IP port it likes for its file transfer, and writing IP firewalling rules matching that precisely is hence hard. With the systemd IP access list implementing this is easy: just set the list for your BitTorrent service unit, and all is good.

    Let me stress though that you should be careful when comparing NetFilter with systemd's IP address list logic, it's really like comparing apples and oranges: to start with, the IP address list logic has a clearly local focus, it only knows what a local service is and manages access of it. NetFilter on the other hand may run on border gateways, at a point where the traffic flowing through is pure IP, carrying no information about a systemd unit concept or anything like that.

  3. It's a simple way to lock down distribution/vendor supplied system services by default. For example, if you ship a service that you know never needs to access the network, then simply set IPAddressDeny=any (possibly combined with IPAddressAllow=localhost) for it, and it will live in a very tight networking sand-box it cannot escape from. systemd itself makes use of this for a number of its services by default now. For example, the logging service systemd-journald.service, the login manager systemd-logind or the core-dump processing unit systemd-coredump@.service all have such a rule set out-of-the-box, because we know that neither of these services should be able to access the network, under any circumstances.

  4. Because the IP access list logic can be combined with transient units, it can be used to quickly and effectively sandbox arbitrary commands, and even include them in shell pipelines and such. For example, let's say we don't trust our curl implementation (maybe it got modified locally by a hacker, and phones home?), but want to use it anyway to download the the slides of my most recent casync talk in order to print it, but want to make sure it doesn't connect anywhere except where we tell it to (and to make this even more fun, let's minimize privileges further, by setting DynamicUser=yes):

    # systemd-resolve 0pointer.de
    0pointer.de: 85.214.157.71
                 2a01:238:43ed:c300:10c3:bcf3:3266:da74
    -- Information acquired via protocol DNS in 2.8ms.
    -- Data is authenticated: no
    # systemd-run --pipe -p IPAddressDeny=any \
                         -p IPAddressAllow=85.214.157.71 \
                         -p IPAddressAllow=2a01:238:43ed:c300:10c3:bcf3:3266:da74 \
                         -p DynamicUser=yes \
                         curl http://0pointer.de/public/casync-kinvolk2017.pdf | lp
    

So much about use-cases. This is by no means a comprehensive list of what you can do with it, after all both IP accounting and IP access lists are very generic concepts. But I do hope the above inspires your fantasy.

What does that mean for packagers?

IP accounting and IP access control are primarily concepts for the local administrator. However, As suggested above, it's a very good idea to ship services that by design have no network-facing functionality with an access list of IPAddressDeny=any (and possibly IPAddressAllow=localhost), in order to improve the out-of-the-box security of our systems.

An option for security-minded distributions might be a more radical approach: ship the system with -.slice or system.slice configured to IPAddressDeny=any by default, and ask the administrator to punch holes into that for each network facing service with systemctl set-property … IPAddressAllow=…. But of course, that's only an option for distributions willing to break compatibility with what was before.

Notes

A couple of additional notes:

  1. IP accounting and access lists may be mixed with socket activation. In this case, it's a good idea to configure access lists and accounting for both the socket unit that activates and the service unit that is activated, as both units maintain fully separate settings. Note that IP accounting and access lists configured on the socket unit applies to all sockets created on behalf of that unit, and even if these sockets are passed on to the activated services, they will still remain in effect and belong to the socket unit. This also means that IP traffic done on such sockets will be accounted to the socket unit, not the service unit. The fact that IP access lists are maintained separately for the kernel sockets created on behalf of the socket unit and for the kernel sockets created by the service code itself enables some interesting uses. For example, it's possible to set a relatively open access list on the socket unit, but a very restrictive access list on the service unit, thus making the sockets configured through the socket unit the only way in and out of the service.

  2. systemd's IP accounting and access lists apply to IP sockets only, not to sockets of any other address families. That also means that AF_PACKET (i.e. raw) sockets are not covered. This means it's a good idea to combine IP access lists with RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6 in order to lock this down.

  3. You may wonder if the per-unit resource log message and systemd-run --wait may also show you details about other types or resources consumed by a service. The answer is yes: if you turn on CPUAccounting= for a service, you'll also see a summary of consumed CPU time in the log message and the command output. And we are planning to hook-up IOAccounting= the same way too, soon.

  4. Note that IP accounting and access lists aren't entirely free. systemd inserts an eBPF program into the IP pipeline to make this functionality work. However, eBPF execution has been optimized for speed in the last kernel versions already, and given that it currently is in the focus of interest to many I'd expect to be optimized even further, so that the cost for enabling these features will be negligible, if it isn't already.

  5. IP accounting is currently not recursive. That means you cannot use a slice unit to join the accounting of multiple units into one. This is something we definitely want to add, but requires some more kernel work first.

  6. You might wonder how the PrivateNetwork= setting relates to IPAccessDeny=any. Superficially they have similar effects: they make the network unavailable to services. However, looking more closely there are a number of differences. PrivateNetwork= is implemented using Linux network name-spaces. As such it entirely detaches all networking of a service from the host, including non-IP networking. It does so by creating a private little environment the service lives in where communication with itself is still allowed though. In addition using the JoinsNamespaceOf= dependency additional services may be added to the same environment, thus permitting communication with each other but not with anything outside of this group. IPAddressAllow= and IPAddressDeny= are much less invasive. First of all they apply to IP networking only, and can match against specific IP addresses. A service running with PrivateNetwork= turned off but IPAddressDeny=any turned on, may enumerate the network interfaces and their IP configured even though it cannot actually do any IP communication. On the other hand if you turn on PrivateNetwork= all network interfaces besides lo disappear. Long story short: depending on your use-case one, the other, both or neither might be suitable for sand-boxing of your service. If possible I'd always turn on both, for best security, and that's what we do for all of systemd's own long-running services.

And that's all for now. Have fun with per-unit IP accounting and access lists!

© Lennart Poettering. Built using Pelican. Theme by Giulio Fidente on github. .