TL;DR: systemd now can do per-service IP traffic accounting, as well as access control for IP address ranges.
Last Friday we released systemd 235. I already blogged about its Dynamic User feature in detail, but there's one more piece of new functionality that I think deserves special attention: IP accounting and access control.
Before v235 systemd already provided per-unit resource management hooks for a number of different kinds of resources: consumed CPU time, disk I/O, memory usage and number of tasks. With v235 another kind of resource can be controlled per-unit with systemd: network traffic (specifically IP).
Three new unit file settings have been added in this context:
-
IPAccounting=
is a boolean setting. If enabled for a unit, all IP traffic sent and received by processes associated with it is counted both in terms of bytes and of packets. -
IPAddressDeny=
takes an IP address prefix (that means: an IP address with a network mask). All traffic from and to this address will be prohibited for processes of the service. -
IPAddressAllow=
is the matching positive counterpart toIPAddressDeny=
. All traffic matching this IP address/network mask combination will be allowed, even if otherwise listed inIPAddressDeny=
.
The three options are thin wrappers around kernel functionality
introduced with Linux 4.11: the control group eBPF hooks. The actual
work is done by the kernel, systemd just provides a number of new
settings to configure this facet of it. Note that cgroup/eBPF is
unrelated to classic Linux firewalling,
i.e. NetFilter/iptables
. It's up to you whether you use one or the
other, or both in combination (or of course neither).
IP Accounting
Let's have a closer look at the IP accounting logic mentioned
above. Let's write a simple unit
/etc/systemd/system/ip-accounting-test.service
:
[Service]
ExecStart=/usr/bin/ping 8.8.8.8
IPAccounting=yes
This simple unit invokes the
ping(8) command to
send a series of ICMP/IP ping packets to the IP address 8.8.8.8 (which
is the Google DNS server IP; we use it for testing here, since it's
easy to remember, reachable everywhere and known to react to ICMP
pings; any other IP address responding to pings would be fine to use,
too). The IPAccounting=
option is used to turn on IP accounting for
the unit.
Let's start this service after writing the file. Let's then have a
look at the status output of systemctl
:
# systemctl daemon-reload
# systemctl start ip-accounting-test
# systemctl status ip-accounting-test
● ip-accounting-test.service
Loaded: loaded (/etc/systemd/system/ip-accounting-test.service; static; vendor preset: disabled)
Active: active (running) since Mon 2017-10-09 18:05:47 CEST; 1s ago
Main PID: 32152 (ping)
IP: 168B in, 168B out
Tasks: 1 (limit: 4915)
CGroup: /system.slice/ip-accounting-test.service
└─32152 /usr/bin/ping 8.8.8.8
Okt 09 18:05:47 sigma systemd[1]: Started ip-accounting-test.service.
Okt 09 18:05:47 sigma ping[32152]: PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
Okt 09 18:05:47 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=1 ttl=59 time=29.2 ms
Okt 09 18:05:48 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=2 ttl=59 time=28.0 ms
This shows the ping
command running — it's currently at its second
ping cycle as we can see in the logs at the end of the output. More
interesting however is the IP:
line further up showing the current
IP byte counters. It currently shows 168 bytes have been received, and
168 bytes have been sent. That the two counters are at the same value
is not surprising: ICMP ping requests and responses are supposed to
have the same size. Note that this line is shown only if
IPAccounting=
is turned on for the service, as only then this data
is collected.
Let's wait a bit, and invoke systemctl status
again:
# systemctl status ip-accounting-test
● ip-accounting-test.service
Loaded: loaded (/etc/systemd/system/ip-accounting-test.service; static; vendor preset: disabled)
Active: active (running) since Mon 2017-10-09 18:05:47 CEST; 4min 28s ago
Main PID: 32152 (ping)
IP: 22.2K in, 22.2K out
Tasks: 1 (limit: 4915)
CGroup: /system.slice/ip-accounting-test.service
└─32152 /usr/bin/ping 8.8.8.8
Okt 09 18:10:07 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=260 ttl=59 time=27.7 ms
Okt 09 18:10:08 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=261 ttl=59 time=28.0 ms
Okt 09 18:10:09 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=262 ttl=59 time=33.8 ms
Okt 09 18:10:10 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=263 ttl=59 time=48.9 ms
Okt 09 18:10:11 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=264 ttl=59 time=27.2 ms
Okt 09 18:10:12 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=265 ttl=59 time=27.0 ms
Okt 09 18:10:13 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=266 ttl=59 time=26.8 ms
Okt 09 18:10:14 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=267 ttl=59 time=27.4 ms
Okt 09 18:10:15 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=268 ttl=59 time=29.7 ms
Okt 09 18:10:16 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=269 ttl=59 time=27.6 ms
As we can see, after 269 pings the counters are much higher: at 22K.
Note that while systemctl status
shows only the byte counters,
packet counters are kept as well. Use the low-level systemctl show
command to query the current raw values of the in and out packet and
byte counters:
# systemctl show ip-accounting-test -p IPIngressBytes -p IPIngressPackets -p IPEgressBytes -p IPEgressPackets
IPIngressBytes=37776
IPIngressPackets=449
IPEgressBytes=37776
IPEgressPackets=449
Of course, the same information is also available via the D-Bus
APIs. If you want to process this data further consider talking proper
D-Bus, rather than scraping the output of systemctl show
.
Now, let's stop the service again:
# systemctl stop ip-accounting-test
When a service with such accounting turned on terminates, a log line
about all its consumed resources is written to the logs. Let's check
with journalctl
:
# journalctl -u ip-accounting-test -n 5
-- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:17:02 CEST. --
Okt 09 18:15:50 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=603 ttl=59 time=26.9 ms
Okt 09 18:15:51 sigma ping[32152]: 64 bytes from 8.8.8.8: icmp_seq=604 ttl=59 time=27.2 ms
Okt 09 18:15:52 sigma systemd[1]: Stopping ip-accounting-test.service...
Okt 09 18:15:52 sigma systemd[1]: Stopped ip-accounting-test.service.
Okt 09 18:15:52 sigma systemd[1]: ip-accounting-test.service: Received 49.5K IP traffic, sent 49.5K IP traffic
The last line shown is the interesting one, that shows the accounting data. It's actually a structured log message, and among its metadata fields it contains the more comprehensive raw data:
# journalctl -u ip-accounting-test -n 1 -o verbose
-- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:18:50 CEST. --
Mon 2017-10-09 18:15:52.649028 CEST [s=89a2cc877fdf4dafb2269a7631afedad;i=14d7;b=4c7e7adcba0c45b69d612857270716d3;m=137592e75e;t=55b1f81298605;x=c3c9b57b28c9490e]
PRIORITY=6
_BOOT_ID=4c7e7adcba0c45b69d612857270716d3
_MACHINE_ID=e87bfd866aea4ae4b761aff06c9c3cb3
_HOSTNAME=sigma
SYSLOG_FACILITY=3
SYSLOG_IDENTIFIER=systemd
_UID=0
_GID=0
_TRANSPORT=journal
_PID=1
_COMM=systemd
_EXE=/usr/lib/systemd/systemd
_CAP_EFFECTIVE=3fffffffff
_SYSTEMD_CGROUP=/init.scope
_SYSTEMD_UNIT=init.scope
_SYSTEMD_SLICE=-.slice
CODE_FILE=../src/core/unit.c
_CMDLINE=/usr/lib/systemd/systemd --switched-root --system --deserialize 25
_SELINUX_CONTEXT=system_u:system_r:init_t:s0
UNIT=ip-accounting-test.service
CODE_LINE=2115
CODE_FUNC=unit_log_resources
MESSAGE_ID=ae8f7b866b0347b9af31fe1c80b127c0
INVOCATION_ID=98a6e756fa9d421d8dfc82b6df06a9c3
IP_METRIC_INGRESS_BYTES=50880
IP_METRIC_INGRESS_PACKETS=605
IP_METRIC_EGRESS_BYTES=50880
IP_METRIC_EGRESS_PACKETS=605
MESSAGE=ip-accounting-test.service: Received 49.6K IP traffic, sent 49.6K IP traffic
_SOURCE_REALTIME_TIMESTAMP=1507565752649028
The interesting fields of this log message are of course
IP_METRIC_INGRESS_BYTES=
, IP_METRIC_INGRESS_PACKETS=
,
IP_METRIC_EGRESS_BYTES=
, IP_METRIC_EGRESS_PACKETS=
that show the
consumed data.
The log message carries a message
ID
that may be used to quickly search for all such resource log messages
(ae8f7b866b0347b9af31fe1c80b127c0
). We can combine a search term for
messages of this ID with journalctl
's -u
switch to quickly find
out about the resource usage of any invocation of a specific
service. Let's try:
# journalctl -u ip-accounting-test MESSAGE_ID=ae8f7b866b0347b9af31fe1c80b127c0
-- Logs begin at Thu 2016-08-18 23:09:37 CEST, end at Mon 2017-10-09 18:25:27 CEST. --
Okt 09 18:15:52 sigma systemd[1]: ip-accounting-test.service: Received 49.6K IP traffic, sent 49.6K IP traffic
Of course, the output above shows only one message at the moment, since we started the service only once, but a new one will appear every time you start and stop it again.
The IP accounting logic is also hooked up with
systemd-run
,
which is useful for transiently running a command as systemd service
with IP accounting turned on. Let's try it:
# systemd-run -p IPAccounting=yes --wait wget https://cfp.all-systems-go.io/en/ASG2017/public/schedule/2.pdf
Running as unit: run-u2761.service
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 878ms
IP traffic received: 231.0K
IP traffic sent: 3.7K
This uses wget
to download the
PDF version of the 2nd day
schedule
of everybody's favorite Linux user-space conference All Systems Go!
2017 (BTW, have you already booked your
ticket? We are very close to
selling out, be quick!). The IP traffic this command generated was
231K ingress and 4K egress. In the systemd-run
command line two
parameters are important. First of all, we use -p IPAccounting=yes
to turn on IP accounting for the transient service (as above). And
secondly we use --wait
to tell systemd-run
to wait for the service
to exit. If --wait
is used, systemd-run
will also show you various
statistics about the service that just ran and terminated, including
the IP statistics you are seeing if IP accounting has been turned on.
It's fun to combine this sort of IP accounting with interactive transient units. Let's try that:
# systemd-run -p IPAccounting=1 -t /bin/sh
Running as unit: run-u2779.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4# dnf update
…
sh-4.4# dnf install firefox
…
sh-4.4# exit
Finished with result: success
Main processes terminated with: code=exited/status=0
Service runtime: 5.297s
IP traffic received: …B
IP traffic sent: …B
This uses systemd-run
's --pty
switch (or short: -t
), which opens
an interactive pseudo-TTY connection to the invoked service process,
which is a bourne shell in this case. Doing this means we have a full,
comprehensive shell with job control and everything. Since the shell
is running as part of a service with IP accounting turned on, all IP
traffic we generate or receive will be accounted for. And as soon as
we exit the shell, we'll see what it consumed. (For the sake of
brevity I actually didn't paste the whole output above, but truncated
core parts. Try it out for yourself, if you want to see the output in
full.)
Sometimes it might make sense to turn on IP accounting for a unit that
is already running. For that, use systemctl set-property
foobar.service IPAccounting=yes
, which will instantly turn on
accounting for it. Note that it won't count retroactively though: only
the traffic sent/received after the point in time you turned it on
will be collected. You may turn off accounting for the unit with the
same command.
Of course, sometimes it's interesting to collect IP accounting data
for all services, and turning on IPAccounting=yes
in every single
unit is cumbersome. To deal with that there's a global option
DefaultIPAccounting=
available which can be set in /etc/systemd/system.conf
.
IP Access Lists
So much about IP accounting. Let's now have a look at IP access
control with systemd 235. As mentioned above, the two new unit file
settings, IPAddressAllow=
and IPAddressDeny=
maybe be used for
that. They operate in the following way:
-
If the source address of an incoming packet or the destination address of an outgoing packet matches one of the IP addresses/network masks in the relevant unit's
IPAddressAllow=
setting then it will be allowed to go through. -
Otherwise, if a packet matches an
IPAddressDeny=
entry configured for the service it is dropped. -
If the packet matches neither of the above it is allowed to go through.
Or in other words, IPAddressDeny=
implements a blacklist, but
IPAddressAllow=
takes precedence.
Let's try that out. Let's modify our last example above in order to get a transient service running an interactive shell which has such an access list set:
# systemd-run -p IPAddressDeny=any -p IPAddressAllow=8.8.8.8 -p IPAddressAllow=127.0.0.0/8 -t /bin/sh
Running as unit: run-u2850.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4# ping 8.8.8.8 -c1
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=59 time=27.9 ms
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 27.957/27.957/27.957/0.000 ms
sh-4.4# ping 8.8.4.4 -c1
PING 8.8.4.4 (8.8.4.4) 56(84) bytes of data.
ping: sendmsg: Operation not permitted
^C
--- 8.8.4.4 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
sh-4.4# ping 127.0.0.2 -c1
PING 127.0.0.1 (127.0.0.2) 56(84) bytes of data.
64 bytes from 127.0.0.2: icmp_seq=1 ttl=64 time=0.116 ms
--- 127.0.0.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.116/0.116/0.116/0.000 ms
sh-4.4# exit
The access list we set up uses IPAddressDeny=any
in order to define
an IP white-list: all traffic will be prohibited for the session,
except for what is explicitly white-listed. In this command line, we
white-listed two address prefixes: 8.8.8.8 (with no explicit network
mask, which means the mask with all bits turned on is implied,
i.e. /32
), and 127.0.0.0/8. Thus, the service can communicate with
Google's DNS server and everything on the local loop-back, but nothing
else. The commands run in this interactive shell show this: First we
try pinging 8.8.8.8 which happily responds. Then, we try to ping
8.8.4.4 (that's Google's other DNS server, but excluded from this
white-list), and as we see it is immediately refused with an Operation
not permitted error. As last step we ping 127.0.0.2 (which is on the
local loop-back), and we see it works fine again, as expected.
In the example above we used IPAddressDeny=any
. The any
identifier is a shortcut for writing 0.0.0.0/0 ::/0, i.e. it's a
shortcut for everything, on both IPv4 and IPv6. A number of other
such shortcuts exist. For example, instead of spelling out
127.0.0.0/8
we could also have used the more descriptive shortcut
localhost
which is expanded to 127.0.0.0/8 ::1/128, i.e. everything
on the local loopback device, on both IPv4 and IPv6.
Being able to configure IP access lists individually for each unit is
pretty nice already. However, typically one wants to configure this
comprehensively, not just for individual units, but for a set of units
in one go or even the system as a whole. In systemd, that's possible
by making use of
.slice
units (for those who don't know systemd that well, slice units are a
concept for organizing services in hierarchical tree for the purpose of
resource management): the IP access list in effect for a unit is the
combination of the individual IP access lists configured for the unit
itself and those of all slice units it is contained in.
By default, system services are assigned to
system.slice
,
which in turn is a child of the root slice
-.slice
. Either
of these two slice units are hence suitable for locking down all
system services at once. If an access list is configured on
system.slice
it will only apply to system services, however, if
configured on -.slice
it will apply to all user processes of the
system, including all user session processes (i.e. which are by
default assigned to user.slice
which is a child of -.slice
) in
addition to the system services.
Let's make use of this:
# systemctl set-property system.slice IPAddressDeny=any IPAddressAllow=localhost
# systemctl set-property apache.service IPAddressAllow=10.0.0.0/8
The two commands above are a very powerful way to first turn off all IP communication for all system services (with the exception of loop-back traffic), followed by an explicit white-listing of 10.0.0.0/8 (which could refer to the local company network, you get the idea) but only for the Apache service.
Use-cases
After playing around a bit with this, let's talk about use-cases. Here are a few ideas:
-
The IP access list logic can in many ways provide a more modern replacement for the venerable TCP Wrapper, but unlike it it applies to all IP sockets of a service unconditionally, and requires no explicit support in any way in the service's code: no patching required. On the other hand, TCP wrappers have a number of features this scheme cannot cover, most importantly systemd's IP access lists operate solely on the level of IP addresses and network masks, there is no way to configure access by DNS name (though quite frankly, that is a very dubious feature anyway, as doing networking — unsecured networking even – in order to restrict networking sounds quite questionable, at least to me).
-
It can also replace (or augment) some facets of IP firewalling, i.e. Linux NetFilter/
iptables
. Right now, systemd's access lists are of course a lot more minimal than NetFilter, but they have one major benefit: they understand the service concept, and thus are a lot more context-aware than NetFilter. Classic firewalls, such as NetFilter, derive most service context from the IP port number alone, but we live in a world where IP port numbers are a lot more dynamic than they used to be. As one example, a BitTorrent client or server may use any IP port it likes for its file transfer, and writing IP firewalling rules matching that precisely is hence hard. With the systemd IP access list implementing this is easy: just set the list for your BitTorrent service unit, and all is good.Let me stress though that you should be careful when comparing NetFilter with systemd's IP address list logic, it's really like comparing apples and oranges: to start with, the IP address list logic has a clearly local focus, it only knows what a local service is and manages access of it. NetFilter on the other hand may run on border gateways, at a point where the traffic flowing through is pure IP, carrying no information about a systemd unit concept or anything like that.
-
It's a simple way to lock down distribution/vendor supplied system services by default. For example, if you ship a service that you know never needs to access the network, then simply set
IPAddressDeny=any
(possibly combined withIPAddressAllow=localhost
) for it, and it will live in a very tight networking sand-box it cannot escape from. systemd itself makes use of this for a number of its services by default now. For example, the logging servicesystemd-journald.service
, the login managersystemd-logind
or the core-dump processing unitsystemd-coredump@.service
all have such a rule set out-of-the-box, because we know that neither of these services should be able to access the network, under any circumstances. -
Because the IP access list logic can be combined with transient units, it can be used to quickly and effectively sandbox arbitrary commands, and even include them in shell pipelines and such. For example, let's say we don't trust our
curl
implementation (maybe it got modified locally by a hacker, and phones home?), but want to use it anyway to download the the slides of my most recent casync talk in order to print it, but want to make sure it doesn't connect anywhere except where we tell it to (and to make this even more fun, let's minimize privileges further, by settingDynamicUser=yes
):# systemd-resolve 0pointer.de 0pointer.de: 85.214.157.71 2a01:238:43ed:c300:10c3:bcf3:3266:da74 -- Information acquired via protocol DNS in 2.8ms. -- Data is authenticated: no # systemd-run --pipe -p IPAddressDeny=any \ -p IPAddressAllow=85.214.157.71 \ -p IPAddressAllow=2a01:238:43ed:c300:10c3:bcf3:3266:da74 \ -p DynamicUser=yes \ curl http://0pointer.de/public/casync-kinvolk2017.pdf | lp
So much about use-cases. This is by no means a comprehensive list of what you can do with it, after all both IP accounting and IP access lists are very generic concepts. But I do hope the above inspires your fantasy.
What does that mean for packagers?
IP accounting and IP access control are primarily concepts for the
local administrator. However, As suggested above, it's a very good
idea to ship services that by design have no network-facing
functionality with an access list of IPAddressDeny=any
(and possibly
IPAddressAllow=localhost
), in order to improve the out-of-the-box
security of our systems.
An option for security-minded distributions might be a more radical
approach: ship the system with -.slice
or system.slice
configured
to IPAddressDeny=any
by default, and ask the administrator to punch
holes into that for each network facing service with systemctl
set-property … IPAddressAllow=…
. But of course, that's only an
option for distributions willing to break compatibility with what was
before.
Notes
A couple of additional notes:
-
IP accounting and access lists may be mixed with socket activation. In this case, it's a good idea to configure access lists and accounting for both the socket unit that activates and the service unit that is activated, as both units maintain fully separate settings. Note that IP accounting and access lists configured on the socket unit applies to all sockets created on behalf of that unit, and even if these sockets are passed on to the activated services, they will still remain in effect and belong to the socket unit. This also means that IP traffic done on such sockets will be accounted to the socket unit, not the service unit. The fact that IP access lists are maintained separately for the kernel sockets created on behalf of the socket unit and for the kernel sockets created by the service code itself enables some interesting uses. For example, it's possible to set a relatively open access list on the socket unit, but a very restrictive access list on the service unit, thus making the sockets configured through the socket unit the only way in and out of the service.
-
systemd's IP accounting and access lists apply to IP sockets only, not to sockets of any other address families. That also means that
AF_PACKET
(i.e. raw) sockets are not covered. This means it's a good idea to combine IP access lists withRestrictAddressFamilies=AF_UNIX AF_INET AF_INET6
in order to lock this down. -
You may wonder if the per-unit resource log message and
systemd-run --wait
may also show you details about other types or resources consumed by a service. The answer is yes: if you turn onCPUAccounting=
for a service, you'll also see a summary of consumed CPU time in the log message and the command output. And we are planning to hook-upIOAccounting=
the same way too, soon. -
Note that IP accounting and access lists aren't entirely free. systemd inserts an eBPF program into the IP pipeline to make this functionality work. However, eBPF execution has been optimized for speed in the last kernel versions already, and given that it currently is in the focus of interest to many I'd expect to be optimized even further, so that the cost for enabling these features will be negligible, if it isn't already.
-
IP accounting is currently not recursive. That means you cannot use a slice unit to join the accounting of multiple units into one. This is something we definitely want to add, but requires some more kernel work first.
-
You might wonder how the
PrivateNetwork=
setting relates toIPAccessDeny=any
. Superficially they have similar effects: they make the network unavailable to services. However, looking more closely there are a number of differences.PrivateNetwork=
is implemented using Linux network name-spaces. As such it entirely detaches all networking of a service from the host, including non-IP networking. It does so by creating a private little environment the service lives in where communication with itself is still allowed though. In addition using theJoinsNamespaceOf=
dependency additional services may be added to the same environment, thus permitting communication with each other but not with anything outside of this group.IPAddressAllow=
andIPAddressDeny=
are much less invasive. First of all they apply to IP networking only, and can match against specific IP addresses. A service running withPrivateNetwork=
turned off butIPAddressDeny=any
turned on, may enumerate the network interfaces and their IP configured even though it cannot actually do any IP communication. On the other hand if you turn onPrivateNetwork=
all network interfaces besideslo
disappear. Long story short: depending on your use-case one, the other, both or neither might be suitable for sand-boxing of your service. If possible I'd always turn on both, for best security, and that's what we do for all of systemd's own long-running services.
And that's all for now. Have fun with per-unit IP accounting and access lists!