Use TemporaryFileSystem to hide files or directories from systemd services

Use TemporaryFileSystem to hide files or directories from systemd services

For decades people used chroot to restrict the access a service has to the filesystem tree. This was never meant to be a security boundary and there are many ways to break out of it. Implementations got better over time, but where never good enough to be called a security feature. Built on this idea, of creating a limit environment for a service, containers like LXC and Docker were born. Containers are useful on many areas but add a lot of complexity if you just want to hide files and directories from your services. This is where systemd's TemporaryFileSystem option can shine.

TemporaryFileSystem

TemporaryFileSystem is a relative new option added in systemd 238. When using Ubuntu LTS you need Ubuntu 20.04 or later. Before that you could use the RootDirectory option, systemd's implementation of chroot. TemporaryFileSystem mounts an empty tmpfs filesystem over the the space-separated list of filesystem paths you pass it. You can then use BindReadOnlyPaths and BindPaths to mount part of the original filesystem back on top of this tmpfs.

Example from the systemd docs: If a unit has the following,

TemporaryFileSystem=/var:ro
BindReadOnlyPaths=/var/lib/systemd

then the invoked processes by the unit cannot see any files or directories under /var/ except for /var/lib/systemd or its contents.

Hiding everything but the necessary minimum

Using TemporaryFileSystem=/:ro you can mount an entire empty root filesystem and the service would not see a single file. But it also could not start since its binary and other needed stuff would also be missing.

You need to mount everything necessary for the service to operate on top of this empty tmpfs. You don't have to think of every little detail, since many systemd options already include mounts for paths many services need to function.

RuntimeDirectory, LogsDirectory, CacheDirectory, ConfigurationDirectory, StateDirectory, and PrivateTmp mount the corresponding path on top of the tmpfs and make them writeable if needed. PrivateDevices mounts /dev/null, /dev/zero or /dev/random and other important devices. ProtectControlGroups, ProtectKernelModules, and ProtectKernelTunables all imply MountAPIVFS which mounts /sys, /proc, and /dev. MountAPIVFS mounts these directories writeable so the three options should be used to make content, the service doesn't need to modify, read-only again. The very recent systemd 247 also includes the option ProtectProc=invisible which hides all, excluding its own, processes in /proc from the services. I recommend setting the hidepid=2 option in /etc/fstab for the /proc mount, if your systemd is not that current.

To find stuff most services need I read the AppArmor base profile and used trial and error on a few services. I came up with the following paths that where needed by most services: BindReadOnlyPaths=/lib/ /lib64/ /usr/lib/ /usr/lib64/ /etc/ld.so.cache /etc/ld.so.conf /etc/ld.so.conf.d/ /etc/bindresvport.blacklist /usr/share/zoneinfo/ /usr/share/locale/ /etc/localtime /usr/share/common-licenses/ /etc/ssl/certs/ /etc/alternatives/. Not all services need every path, but I would say these are not security relevant, so I just always include them. The systemd docs also recommend (see Example 1) the following mounts, to allow Type=notify and logging to journald: BindReadOnlyPaths=/dev/log /run/systemd/journal/socket /run/systemd/journal/stdout /run/systemd/notify.

Using BindReadOnlyPaths and BindPath you can then mount additional needed directories and files like the service binary itself. For example my nginx service includes the following additional paths: BindReadOnlyPaths=/usr/sbin/nginx /bin/kill /srv/ /run/php/. /srv/ is where my web apps are located.

Finding paths that are needed

Finding all files and directories that a service needs to function is very frustrating, because the error messages you get can be misleading. For example, if I don't include the library paths like /lib/ and try to start the nginx service, I get the following error message in journalctl:

systemd[1]: Starting nginx...
systemd[1365084]: nginx.service: Failed to execute command: No such file or directory
systemd[1365084]: nginx.service: Failed at step EXEC spawning /usr/sbin/nginx: No such file or directory
systemd[1]: nginx.service: Main process exited, code=exited, status=203/EXEC
systemd[1]: nginx.service: Failed with result 'exit-code'.
systemd[1]: Failed to start nginx.

It looks like /usr/sbin/nginx is missing even though it is included with BindReadOnlyPaths, but in reality /lib/x86_64-linux-gnu is missing. Once I had my basic list of needed libraries, finding service specific dependencies got easier. Since now most binaries could start, even if failing immediately afterwards, I at least got more informative error messages from the app itself. These error messages would then complain about missing, inaccessible or non writeable paths, that I could add. If the service crashed without an error message, I used strace to find the last path the service tried to access, before crashing. To to this, I included strace in BindReadOnlyPaths and then added it at the beginning of ExecStart like this:

BindReadOnlyPaths=/usr/bin/strace
ExecStart=/usr/bin/strace -e trace=%%file /usr/sbin/nginx -c /etc/nginx/nginx.conf

On the command line you would type strace -e trace=%file with only one % sign, but you need to escape it with a second one or systemd will interpret it. You can then see every file operation the service tries to make before crashing with journalctl.

To debug other failed starts it can be helpful to enable systemd debugging. This can only be applied globally and applies to all services until you disable it.

sudo systemctl log-level debug

This can help find other problems you have starting the service that might or might not be related to other options you put into your systemd service file.

Don't use ProtectSystem in conjunction TemporaryFileSystem

While implementing TemporaryFileSystem I stumbled into a problem that took me some time to understand. I have been using ProtectSystem=strict to restrict my service to write to the filesystem. But if you use this together with TemporaryFileSystem it will mount the entire filesystem on top of your empty tmpfs and all files are again accessible to the service.

Testing if it works

To see if hiding the filesystem worked, I run ls inside ExecStartPre and look at the journalctl output. I temporarily add this to my nginx service for example:

BindReadOnlyPaths=/usr/bin/ls
ExecStartPre=/usr/bin/ls -l /bin /dev /etc /run /var

And I see this in journalctl:

systemd[1]: Starting nginx...
ls[1378398]: /:
ls[1378398]: total 16
ls[1378398]: drwxr-xr-x   2 0 0   60 May  8 14:35 bin
ls[1378398]: drwxr-xr-x   7 0 0  400 May  8 14:35 dev
ls[1378398]: drwxr-xr-x   7 0 0  220 May  8 14:35 etc
ls[1378398]: drwxr-xr-x 102 0 0 4096 May  7 04:08 lib
ls[1378398]: drwxr-xr-x   2 0 0 4096 Apr 27 00:45 lib64
ls[1378398]: dr-xr-xr-x 639 0 0    0 May  8 14:35 proc
ls[1378398]: drwxr-xr-x   5 0 0  100 May  8 14:35 run
ls[1378398]: drwxr-xr-x   9 0 0 4096 Mar 22 21:32 srv
ls[1378398]: dr-xr-xr-x  13 0 0    0 Apr 28 12:27 sys
ls[1378398]: drwxrwxrwt   2 0 0 4096 May  8 14:35 tmp
ls[1378398]: drwxr-xr-x   7 0 0  140 May  8 14:35 usr
ls[1378398]: drwxr-xr-x   7 0 0  140 May  8 14:35 var
ls[1378398]: /bin:
ls[1378398]: total 32
ls[1378398]: -rwxr-xr-x 1 0 0 30952 Mar 24 19:51 kill
ls[1378398]: /dev:
ls[1378398]: total 0
ls[1378398]: drwxr-xr-x 2 0 0  180 May  8 14:35 char
ls[1378398]: lrwxrwxrwx 1 0 0   11 May  8 14:35 core -> /proc/kcore
ls[1378398]: lrwxrwxrwx 1 0 0   13 May  8 14:35 fd -> /proc/self/fd
ls[1378398]: crw-rw-rw- 1 0 0 1, 7 May  8 14:35 full
ls[1378398]: drwxr-xr-x 2 0 0    0 Apr 28 12:27 hugepages
ls[1378398]: lrwxrwxrwx 1 0 0   28 May  8 14:35 log -> /run/systemd/journal/dev-log
ls[1378398]: drwxrwxrwt 2 0 0   40 Apr 28 12:27 mqueue
ls[1378398]: crw-rw-rw- 1 0 0 1, 3 May  8 14:35 null
ls[1378398]: crw-rw-rw- 1 0 0 5, 2 May  8 14:35 ptmx
ls[1378398]: drwxr-xr-x 2 0 0    0 Apr 28 12:27 pts
ls[1378398]: crw-rw-rw- 1 0 0 1, 8 May  8 14:35 random
ls[1378398]: drwxrwxrwt 4 0 0  100 May  8 14:14 shm
ls[1378398]: lrwxrwxrwx 1 0 0   15 May  8 14:35 stderr -> /proc/self/fd/2
ls[1378398]: lrwxrwxrwx 1 0 0   15 May  8 14:35 stdin -> /proc/self/fd/0
ls[1378398]: lrwxrwxrwx 1 0 0   15 May  8 14:35 stdout -> /proc/self/fd/1
ls[1378398]: crw-rw-rw- 1 0 0 5, 0 May  8 14:35 tty
ls[1378398]: crw-rw-rw- 1 0 0 1, 9 May  8 14:35 urandom
ls[1378398]: crw-rw-rw- 1 0 0 1, 5 May  8 14:35 zero
ls[1378398]: /etc:
ls[1378398]: total 88
ls[1378398]: drwxr-xr-x 2    0    0 20480 Dec 11 09:03 alternatives
ls[1378398]: -rw-r--r-- 1    0    0   367 Apr 14  2020 bindresvport.blacklist
ls[1378398]: drwxr-xr-x 2 1001 1001  4096 May  1  2020 certificates
ls[1378398]: -rw-r--r-- 1    0    0 43222 May  7 04:10 ld.so.cache
ls[1378398]: -rw-r--r-- 1    0    0    34 Apr 14  2020 ld.so.conf
ls[1378398]: drwxr-xr-x 2    0    0  4096 Apr 27 00:46 ld.so.conf.d
ls[1378398]: -rw-r--r-- 1    0    0  2326 Jan 27 22:32 localtime
ls[1378398]: drwxr-xr-x 3    0    0  4096 Apr 14 00:22 nginx
ls[1378398]: drwxr-xr-x 3    0    0    60 May  8 14:35 ssl
ls[1378398]: /run:
ls[1378398]: total 0
ls[1378398]: drwxr-xr-x 2 113 120 40 May  8 14:35 nginx
ls[1378398]: drwxr-xr-x 2   0   0 80 Apr 28 12:30 php
ls[1378398]: drwxr-xr-x 3   0   0 80 May  8 14:35 systemd
ls[1378398]: /var:
ls[1378398]: total 4
ls[1378398]: drwxr-xr-x 3 0 0   60 May  8 14:35 cache
ls[1378398]: drwxr-xr-x 3 0 0   60 May  8 14:35 lib
ls[1378398]: drwxr-xr-x 3 0 0   60 May  8 14:35 log
ls[1378398]: drwxr-xr-x 2 0 0   80 May  8 14:35 opt
ls[1378398]: drwxrwxrwt 2 0 0 4096 May  8 14:35 tmp
systemd[1]: Started nginx.

Examples

On my GitHub https://github.com/stephan13360/systemd-services I have a few systemd services I built over time. Some of them like nginx and php-fpm now include TemporaryFileSystem, so go check them out.