Use TemporaryFileSystem to hide files or directories from systemd services

For decades people used chroot to restrict the access a service has to the filesystem tree. This was never meant to be a security boundary and there are many ways to break out of it. Implementations got better over time, but where never good enough to be called a security feature. Built on this idea, of creating a limit environment for a service, containers like LXC and Docker were born. Containers are useful on many areas but add a lot of complexity if you just want to hide files and directories from your services. This is where systemd's TemporaryFileSystem
option can shine.
TemporaryFileSystem
TemporaryFileSystem
is a relative new option added in systemd 238. When using Ubuntu LTS you need Ubuntu 20.04 or later. Before that you could use the RootDirectory option, systemd's implementation of chroot. TemporaryFileSystem
mounts an empty tmpfs filesystem over the the space-separated list of filesystem paths you pass it. You can then use BindReadOnlyPaths
and BindPaths
to mount part of the original filesystem back on top of this tmpfs.
Example from the systemd docs: If a unit has the following,
TemporaryFileSystem=/var:ro
BindReadOnlyPaths=/var/lib/systemd
then the invoked processes by the unit cannot see any files or directories under /var/
except for /var/lib/systemd
or its contents.
Hiding everything but the necessary minimum
Using TemporaryFileSystem=/:ro
you can mount an entire empty root filesystem and the service would not see a single file. But it also could not start since its binary and other needed stuff would also be missing.
You need to mount everything necessary for the service to operate on top of this empty tmpfs. You don't have to think of every little detail, since many systemd options already include mounts for paths many services need to function.
RuntimeDirectory
, LogsDirectory
, CacheDirectory
, ConfigurationDirectory
, StateDirectory
, and PrivateTmp
mount the corresponding path on top of the tmpfs and make them writeable if needed. PrivateDevices
mounts /dev/null
, /dev/zero
or /dev/random
and other important devices. ProtectControlGroups
, ProtectKernelModules
, and ProtectKernelTunables
all imply MountAPIVFS
which mounts /sys
, /proc
, and /dev
. MountAPIVFS
mounts these directories writeable so the three options should be used to make content, the service doesn't need to modify, read-only again. The very recent systemd 247 also includes the option ProtectProc=invisible
which hides all, excluding its own, processes in /proc from the services. I recommend setting the hidepid=2
option in /etc/fstab
for the /proc
mount, if your systemd is not that current.
To find stuff most services need I read the AppArmor base profile and used trial and error on a few services. I came up with the following paths that where needed by most services: BindReadOnlyPaths=/lib/ /lib64/ /usr/lib/ /usr/lib64/ /etc/ld.so.cache /etc/ld.so.conf /etc/ld.so.conf.d/ /etc/bindresvport.blacklist /usr/share/zoneinfo/ /usr/share/locale/ /etc/localtime /usr/share/common-licenses/ /etc/ssl/certs/ /etc/alternatives/
. Not all services need every path, but I would say these are not security relevant, so I just always include them. The systemd docs also recommend (see Example 1) the following mounts, to allow Type=notify
and logging to journald: BindReadOnlyPaths=/dev/log /run/systemd/journal/socket /run/systemd/journal/stdout /run/systemd/notify
.
Using BindReadOnlyPaths
and BindPath
you can then mount additional needed directories and files like the service binary itself. For example my nginx service includes the following additional paths: BindReadOnlyPaths=/usr/sbin/nginx /bin/kill /srv/ /run/php/
. /srv/
is where my web apps are located.
Finding paths that are needed
Finding all files and directories that a service needs to function is very frustrating, because the error messages you get can be misleading. For example, if I don't include the library paths like /lib/
and try to start the nginx service, I get the following error message in journalctl:
systemd[1]: Starting nginx...
systemd[1365084]: nginx.service: Failed to execute command: No such file or directory
systemd[1365084]: nginx.service: Failed at step EXEC spawning /usr/sbin/nginx: No such file or directory
systemd[1]: nginx.service: Main process exited, code=exited, status=203/EXEC
systemd[1]: nginx.service: Failed with result 'exit-code'.
systemd[1]: Failed to start nginx.
It looks like /usr/sbin/nginx
is missing even though it is included with BindReadOnlyPaths
, but in reality /lib/x86_64-linux-gnu
is missing. Once I had my basic list of needed libraries, finding service specific dependencies got easier. Since now most binaries could start, even if failing immediately afterwards, I at least got more informative error messages from the app itself. These error messages would then complain about missing, inaccessible or non writeable paths, that I could add. If the service crashed without an error message, I used strace to find the last path the service tried to access, before crashing. To to this, I included strace in BindReadOnlyPaths
and then added it at the beginning of ExecStart
like this:
BindReadOnlyPaths=/usr/bin/strace
ExecStart=/usr/bin/strace -e trace=%%file /usr/sbin/nginx -c /etc/nginx/nginx.conf
On the command line you would type strace -e trace=%file
with only one % sign, but you need to escape it with a second one or systemd will interpret it. You can then see every file operation the service tries to make before crashing with journalctl.
To debug other failed starts it can be helpful to enable systemd debugging. This can only be applied globally and applies to all services until you disable it.
sudo systemctl log-level debug
This can help find other problems you have starting the service that might or might not be related to other options you put into your systemd service file.
Don't use ProtectSystem in conjunction TemporaryFileSystem
While implementing TemporaryFileSystem
I stumbled into a problem that took me some time to understand. I have been using ProtectSystem=strict
to restrict my service to write to the filesystem. But if you use this together with TemporaryFileSystem
it will mount the entire filesystem on top of your empty tmpfs and all files are again accessible to the service.
Testing if it works
To see if hiding the filesystem worked, I run ls
inside ExecStartPre
and look at the journalctl output. I temporarily add this to my nginx service for example:
BindReadOnlyPaths=/usr/bin/ls
ExecStartPre=/usr/bin/ls -l /bin /dev /etc /run /var
And I see this in journalctl:
systemd[1]: Starting nginx...
ls[1378398]: /:
ls[1378398]: total 16
ls[1378398]: drwxr-xr-x 2 0 0 60 May 8 14:35 bin
ls[1378398]: drwxr-xr-x 7 0 0 400 May 8 14:35 dev
ls[1378398]: drwxr-xr-x 7 0 0 220 May 8 14:35 etc
ls[1378398]: drwxr-xr-x 102 0 0 4096 May 7 04:08 lib
ls[1378398]: drwxr-xr-x 2 0 0 4096 Apr 27 00:45 lib64
ls[1378398]: dr-xr-xr-x 639 0 0 0 May 8 14:35 proc
ls[1378398]: drwxr-xr-x 5 0 0 100 May 8 14:35 run
ls[1378398]: drwxr-xr-x 9 0 0 4096 Mar 22 21:32 srv
ls[1378398]: dr-xr-xr-x 13 0 0 0 Apr 28 12:27 sys
ls[1378398]: drwxrwxrwt 2 0 0 4096 May 8 14:35 tmp
ls[1378398]: drwxr-xr-x 7 0 0 140 May 8 14:35 usr
ls[1378398]: drwxr-xr-x 7 0 0 140 May 8 14:35 var
ls[1378398]: /bin:
ls[1378398]: total 32
ls[1378398]: -rwxr-xr-x 1 0 0 30952 Mar 24 19:51 kill
ls[1378398]: /dev:
ls[1378398]: total 0
ls[1378398]: drwxr-xr-x 2 0 0 180 May 8 14:35 char
ls[1378398]: lrwxrwxrwx 1 0 0 11 May 8 14:35 core -> /proc/kcore
ls[1378398]: lrwxrwxrwx 1 0 0 13 May 8 14:35 fd -> /proc/self/fd
ls[1378398]: crw-rw-rw- 1 0 0 1, 7 May 8 14:35 full
ls[1378398]: drwxr-xr-x 2 0 0 0 Apr 28 12:27 hugepages
ls[1378398]: lrwxrwxrwx 1 0 0 28 May 8 14:35 log -> /run/systemd/journal/dev-log
ls[1378398]: drwxrwxrwt 2 0 0 40 Apr 28 12:27 mqueue
ls[1378398]: crw-rw-rw- 1 0 0 1, 3 May 8 14:35 null
ls[1378398]: crw-rw-rw- 1 0 0 5, 2 May 8 14:35 ptmx
ls[1378398]: drwxr-xr-x 2 0 0 0 Apr 28 12:27 pts
ls[1378398]: crw-rw-rw- 1 0 0 1, 8 May 8 14:35 random
ls[1378398]: drwxrwxrwt 4 0 0 100 May 8 14:14 shm
ls[1378398]: lrwxrwxrwx 1 0 0 15 May 8 14:35 stderr -> /proc/self/fd/2
ls[1378398]: lrwxrwxrwx 1 0 0 15 May 8 14:35 stdin -> /proc/self/fd/0
ls[1378398]: lrwxrwxrwx 1 0 0 15 May 8 14:35 stdout -> /proc/self/fd/1
ls[1378398]: crw-rw-rw- 1 0 0 5, 0 May 8 14:35 tty
ls[1378398]: crw-rw-rw- 1 0 0 1, 9 May 8 14:35 urandom
ls[1378398]: crw-rw-rw- 1 0 0 1, 5 May 8 14:35 zero
ls[1378398]: /etc:
ls[1378398]: total 88
ls[1378398]: drwxr-xr-x 2 0 0 20480 Dec 11 09:03 alternatives
ls[1378398]: -rw-r--r-- 1 0 0 367 Apr 14 2020 bindresvport.blacklist
ls[1378398]: drwxr-xr-x 2 1001 1001 4096 May 1 2020 certificates
ls[1378398]: -rw-r--r-- 1 0 0 43222 May 7 04:10 ld.so.cache
ls[1378398]: -rw-r--r-- 1 0 0 34 Apr 14 2020 ld.so.conf
ls[1378398]: drwxr-xr-x 2 0 0 4096 Apr 27 00:46 ld.so.conf.d
ls[1378398]: -rw-r--r-- 1 0 0 2326 Jan 27 22:32 localtime
ls[1378398]: drwxr-xr-x 3 0 0 4096 Apr 14 00:22 nginx
ls[1378398]: drwxr-xr-x 3 0 0 60 May 8 14:35 ssl
ls[1378398]: /run:
ls[1378398]: total 0
ls[1378398]: drwxr-xr-x 2 113 120 40 May 8 14:35 nginx
ls[1378398]: drwxr-xr-x 2 0 0 80 Apr 28 12:30 php
ls[1378398]: drwxr-xr-x 3 0 0 80 May 8 14:35 systemd
ls[1378398]: /var:
ls[1378398]: total 4
ls[1378398]: drwxr-xr-x 3 0 0 60 May 8 14:35 cache
ls[1378398]: drwxr-xr-x 3 0 0 60 May 8 14:35 lib
ls[1378398]: drwxr-xr-x 3 0 0 60 May 8 14:35 log
ls[1378398]: drwxr-xr-x 2 0 0 80 May 8 14:35 opt
ls[1378398]: drwxrwxrwt 2 0 0 4096 May 8 14:35 tmp
systemd[1]: Started nginx.
Examples
On my GitHub https://github.com/stephan13360/systemd-services I have a few systemd services I built over time. Some of them like nginx and php-fpm now include TemporaryFileSystem
, so go check them out.