Rootless podman best practice

Posted on Sat 13 May 2023 in misc

Podman is a daemonless container engine that runs OCI and Docker compatible containers.

A podman container can be executed in four ways:

A rootfull container (i.e., the container is started as the host user "root") with a container application running as root. With this the container application has access to system resources, comparable to any application running as root. An example is the busybox container.
A rootfull container with a container application running as an unprivileged user. In this case the application voluntarily relinquishes its root privilege. An example of this is the official Mariadb container.
A rootless container with the appplication running as root. Within the container everything seems to run as root but on the host the processes run as the user who started the container.
A rootless container with an application running as an unprivileged user. The userid and groupid of the container user are mapped to host ids in a range assigned to each user in /etc/subuid and /etc/subgid.

This blog will focus on the last two use cases.

Contrary to docker, there is no systemwide daemon that spawns the containers but processes are indepedent entities. The standard procedure is to start pods and containers via systemd and where rootless containers are concerned, to use "systemctl --user". To enable starting a rootless container at boot time, a user specific systemd should be started. This is accomplished by issuing the command "loginctl --enable-linger username" as root.

An important drawback of running rootless containers is that it is not possible to create data volumes on its own block storage (via "podman volume create") or create default private networks (note that there is a limited option for the latter, see "Netavark networking" below).

Starting containers in a pod

To enable inter-container communications one should run the containers together in a pod. Network port forwarding from the host to containers can happen at the pod level as well.

Imagine that you have a Wordpress and a Mediawiki environment, both running as containers on the same host. For this you create two pods, one containing the Mediawiki container and a Mariadb container and the other with a Wordpress and a Mariadb container. In both cases, the database access is configured as TCP port 3306 on localhost. Both web applications find their own database because the pod contains the access.

Starting or stopping a pod starts or stops all containers assigned to the pod. Creating a pod with containers happens like this:

podman pod create --name podname [ -p hostport:containerport ] ...
podman create --pod podname --name container1 [ -v hostdirectory:mountpont ] ... imagename[:tag] [args]
podman create --pod podname --name container2 [ -v hostdirectory:mountpont ] ... imagename[:tag] [args]

It is a good idea to generate systemd files for the pods and containers so starting and stopping them can be delegated to systemd. That enables starting the pod and containers at boot time. Generating the files is done with:

podman generate systemd --files --name --new podname

Creating the service file for the pod creates service files for the relevant containers as well. The --files option creates service files i.o. dumping them to STDOUT. --name creates files with image names and tags i.o. image ids so you don't need to generate new files whenever a new version of the container image is used. Only make sure that the image name and tag of the new version is the same as that of the previous image. --new causes the podman process to not only stop when systemctl stop is used but the podman process is actually removed from the list (comparable to using the podman rm -f command).

When the files are generated, they can be copied to $HOME/.config/systemd/user and enabled with:

systemctl --user enable pod-podname container-container1 container-container2

Persistent storage

In docker (and rootfull podman) it is possible to assign block storage dedicated to docker or podman (with the "docker volume create --opt device=xxx" command). In a rootless environment you cannot connect a device. Instead, a directory is created in .local:

[poduser@uxhost ~]$ ls .local/share/containers/storage 
defaultNetworkBackend  networks   overlay-images  storage.lock
libpod       overlayfs   overlay-layers  tmp
mounts       overlay-containers  secrets   userns.lock
[poduser@uxhost ~]$ podman volume create MyVolume
MyVolume
[poduser@uxhost ~]$ ls .local/share/containers/storage/volumes/MyVolume 
_data
[poduser@uxhost ~]$ podman run -it --rm -v MyVolume:/root/mydata alpine sh
/ # exit
[poduser@uxhost ~]$

It's also possible to mount another directory in a container. With that you can decide yourself where the data is placed (given proper write access). Do note that if SELinux restrictions are enforced, the file context for the mounted directory should be labled as "container_file_t". You can delegate this to podman by appending ":z" to the mount definition. You can also opt for appending ":Z", then access to the mount is further restricted via sVirt to only the relevant pod or container.

[poduser@uxhost ~]$ mkdir mariadb privateshare
[poduser@uxhost ~]$ ls -Zd mariadb privateshare
system_u:object_r:user_home_t:s0 mariadb  system_u:object_r:user_home_t:s0 privateshare
[poduser@uxhost ~]$ podman run -d --rm --name mariadb -e MARIADB_RANDOM_ROOT_PASSWORD=yes \
    -v ./mariadb:/var/lib/mysql:z -v ./privateshare:/privateshare:Z mariadb
b9a95df84be5f7feaf0b40ba2dd004eee4e4b83bd0fb2ad50692fab2a50bb06e
[poduser@uxhost ~]$ ls -Zd mariadb privateshare
system_u:object_r:container_file_t:s0 mariadb  system_u:object_r:container_file_t:s0:c119,c552 privateshare
[poduser@uxhost ~]$ ps -Zp $(pgrep mariadb)
LABEL                               PID TTY          TIME CMD
system_u:system_r:container_t:s0:c119,c552 15827 ? 00:00:01 mariadbd
[poduser@uxhost ~]$

Files that are generated as the root user or group within the container are created as the user starting the container or her primary group on the host. Files that are created as an unprivileged user or non-root group in the container are mapped to a user or group within the /etc/subuid or /etc/subgid range for the user starting the container. Also make sure that the directory that is mounted in the container and its files are either owned and belong to the primary group of the user starting the container, or of a uid or gid withing that user's subuid/subgid range.

[poduser@uxhost ~]$ podman exec -it mariadb bash
root@b9a95df84be5:/# touch /privateshare/{rootfile,userfile}
root@b9a95df84be5:/# chown mysql:mysql /privateshare/userfile
root@b9a95df84be5:/# ls -l /privateshare
total 0
-rw-r--r--. 1 root  root  0 Feb 20 14:38 rootfile
-rw-r--r--. 1 mysql mysql 0 Feb 20 14:38 userfile
root@b9a95df84be5:/# id mysql
uid=999(mysql) gid=999(mysql) groups=999(mysql)
root@b9a95df84be5:/#
exit
[poduser@uxhost ~]$ ls -l ./privateshare
total 0
-rw-r--r--. 1 poduser poduser 0 Feb 20 15:38 rootfile
-rw-r--r--. 1  494214  494214 0 Feb 20 15:38 userfile
[poduser@uxhost ~]$ grep poduser /etc/sub?id
/etc/subgid:poduser:493216:65536
/etc/subuid:poduser:493216:65536
[poduser@uxhost ~]$

Outside of the container the starting user does not have the right to modify files that are not owned by root in the container (like the "userfile" above). To enable this, the user has to temprary become "container root" with the "podman unshare" command. Note that owners and groups are then mapped according to /etc/passwd and /etc/group from the host, not those of the container.

[poduser@uxhost ~]$ echo foo > ./privateshare/userfile
bash: permission denied: ./privateshare/userfile
[poduser@uxhost ~]$ podman unshare bash
[root@uxhost ~]# ls -l ./privateshare
total 0
-rw-r--r--. 1 root             root  0 Feb 20 15:38 rootfile
-rw-r--r--. 1 systemd-coredump input 0 Feb 20 15:38 userfile
[root@uxhost ~]# id systemd-coredump
uid=999(systemd-coredump) gid=997(systemd-coredump) groups=997(systemd-coredump)
[root@uxhost ~]# echo foo > ./privateshare/userfile
[root@uxhost ~]# cat ./privateshare/userfile 
foo
[root@uxhost ~]#

Rootless nested containers

Rootless podman containers use fuse-overlayfs by default to store container images. This is centrally configured in /etc/containers/storage.conf. It has a better performance than native VFS storage but the drawback is that you can't run nested rootless containers (or e.g. K3S Kubernetes) because /dev/fuse is unavailable. For such a use case it's better to use VFS for container image storage but it's not possible to mix fuse-overlayfs and VFS for different images by the same user. To see what storage is used (given you have at least one container image present), just list the storage directory:

[poduser@uxhost ~]$ ls .local/share/containers/storage/
libpod  mounts  overlay  overlay-containers  overlay-images  overlay-layers  storage.lock  tmp  userns.lock
[poduser@uxhost ~]$

As the directory names imply, overlayfs is used. To switch to VFS all containers and container images need to be deleted first. That can be accomplished with

podman system reset

To use VFS given this clean state, create the file $HOME/.config/containers/storage.conf with the following content:

[storage]
driver = "vfs"

Now if you download or create an image it is stored as VFS:

[poduser@uxhost ~]$ ls .local/share/containers/storage/
libpod  mounts  storage.lock  tmp  userns.lock  vfs  vfs-containers  vfs-images  vfs-layers
[poduser@uxhost ~]$

Netavark networking

With rooted containers it is possible to create your own network and give each container its own IP-address in that range. There are use cases where that is preferable (e.g. with a CI/CD pipeline where multiple similar containers are spawn, each of which listens to the same network port). That is possible by using the "Netavark" network backend. For this (assuming Enterprise Linux) you install the netavark rpm. Installing that will also install the aardvark-dns RPM. There are a few issues with using a rootless network:

The network is not available to the host, only to containers in the network.
Containers in the network are not limited to other containers in the same pod but can connect to all other containers in the same network
Port publishing via the pod only works with CNI networks, so if a port of a container in a netavark network needs to be puslished to the host, it should be done in the container.
For every netavark network an aardvark-dns process is started so containers can resolve other containers on the name. This only works for containers within the same netavark network, so not for the host and not for containers in other networks.

Each podman user has the same standard network available named "podman". This network has a "cni" backend which is used for forwarding exposed container ports at the pod level as explained in Starting containers in a pod. To enable networks with a "netavark" backend, the following configuration should be added to "$HOME/.config/containers/containers.conf":

[network]
network_backend = "netavark"

To use this, all existing containers should be stopped and removed. Creating a netavark network is done like this:

[poduser@uxhost ~]$ podman network create --subnet 172.16.1.0/24 --gateway 172.16.1.1 rootlessnet
rootlessnet
[poduser@uxhost ~]$ podman network ls
NETWORK ID    NAME         DRIVER
2f259bab93aa  podman       bridge
8a0786c87d36  rootlessnet  bridge
[poduser@uxhost ~]$

It is now possible to create containers in the new network and they can connect to each other using the container name:

[poduser@uxhost ~]$ podman run -d --rm --network rootlessnet --name host10 alpine sleep 86400
e4c70309bc2b020772c873716174d38d18aaa51e7618e788dd18fcc0f21174cf
[poduser@uxhost ~]$ podman run -d --rm --network rootlessnet --name host11 alpine sleep 86400
4ab2be743a68b3e4a645e0545532a5de09e0564635a7057a8275b3bacecf152e
[poduser@uxhost ~]$ podman exec -it host10 ping -c2 host11
PING host11 (172.16.1.4): 56 data bytes
64 bytes from 172.16.1.4: seq=0 ttl=64 time=0.044 ms
64 bytes from 172.16.1.4: seq=1 ttl=64 time=0.114 ms

--- host11 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.044/0.079/0.114 ms
[poduser@uxhost ~]$

There's one drawback to this approach, i.e. that the netavark backend is now hard wired and all new containers will use that whether it's preferred or not. It's better to leave the backend configuration empty which enabled a cni backend unless explicitely stated otherwise. This way it's still possible to contain exposed ports within a pod and not have them available to all containers in a network. To achieve this, use the following content for "$HOME/.config/containers/containers.conf":

[network]
network_backend = ""

It's now possible to combine containers with cni and netavark network backends in the same pod:

[poduser@uxhost ~]$ podman pod create --name mixed -p 8888:80 
ea6215f74bfbed0c5db456af187c041e0fd2ffce4b3b1673a832eaa30e838b9e
[poduser@uxhost ~]$ podman pod start mixed
ea6215f74bfbed0c5db456af187c041e0fd2ffce4b3b1673a832eaa30e838b9e
[poduser@uxhost ~]$ podman run -d --rm --pod mixed --name nginx nginx
804044254b846b26e281925ea8b1bba076167afe2203cfadd4bdc0ec021abe5b
[poduser@uxhost ~]$ podman run -d --rm --pod mixed --name alpcni curl sleep 86400
587afafd81b2d2e6ad49ef518cf1396f99e6ba661ac9bfc9df53a85e2386e66f
[poduser@uxhost ~]$ podman run -d --rm --pod mixed --name alpavark --network rootlessnet curl sleep 43200
bba06b34f0b22efb1db5ae33c93a60100e2a94d81bb372691812eef94cd24a56
[poduser@uxhost ~]$ podman ps
CONTAINER ID  IMAGE                                    COMMAND               CREATED             STATUS                 PORTS                 NAMES
7e53fabf7e0a  localhost/podman-pause:4.2.0-1673519486                        About a minute ago  Up About a minute ago  0.0.0.0:8888->80/tcp  ea6215f74bfb-infra
804044254b84  localhost/nginx:latest                   nginx -g daemon o...  About a minute ago  Up About a minute ago  0.0.0.0:8888->80/tcp  nginx
587afafd81b2  localhost/curl:latest                    sleep 86400           24 seconds ago      Up 24 seconds ago      0.0.0.0:8888->80/tcp  alpcni
bba06b34f0b2  localhost/curl:latest                    sleep 43200           4 seconds ago       Up 4 seconds ago                             alpavark

The port forwarding in the list shows that the "nginx" and "alpcni" containers use the cni network and the "alpavark" container doesn't. The host can connect to the web server via the forwarded port and the Alpine container in the cni network can connect to it via the exposed port (80) but the container in the netavark network can't reach the exposed port, only the forwarded port on the host:

[poduser@uxhost ~]$ curl -s http://localhost:8888 |grep title
<title>Welcome to nginx!</title>
[poduser@uxhost ~]$ podman exec -it alpcni curl -s http://localhost:80 |grep title
<title>Welcome to nginx!</title>
[poduser@uxhost ~]$ podman exec -it alpavark curl -s http://localhost:80 |grep title
[poduser@uxhost ~]$ podman exec -it alpavark curl -s http://localhost:8888 |grep title 
[poduser@uxhost ~]$ ip a s enp1s0|grep ' inet '
    inet 10.24.1.128/24 brd 10.24.1.255 scope global noprefixroute enp1s0
[poduser@uxhost ~]$ podman exec -it alpavark curl -s http://10.24.1.128:8888 |grep title
<title>Welcome to nginx!</title>
[poduser@uxhost ~]$

(BTW, the "curl" image is is just a standard Alpine image with the "curl" tool added). In fact the advantage for a container with a netavark network to be part of a pod is limited. About the only achievement is that the container gets the same sVirt context as the pod and other containers in it while other containers get their own sVirt context:

[poduser@uxhost ~]$ podman run -d --name nonpod --network rootlessnet curl sleep 22600
5a1d1d2ff4db7fb7e319f44b8516f9cbbb31263a6cd285e92acb4e9ec4921f3e
[poduser@uxhost ~]$ ps -Zu $(id -u)|grep :container_t:                                
system_u:system_r:container_t:s0:c125,c698 14196 ? 00:00:00 catatonit
system_u:system_r:container_t:s0:c125,c698 14227 ? 00:00:00 nginx
system_u:system_r:container_t:s0:c125,c698 14282 ? 00:00:00 sleep
system_u:system_r:container_t:s0:c125,c698 14391 ? 00:00:00 sleep
system_u:system_r:container_t:s0:c584,c600 14735 ? 00:00:00 sleep
[poduser@uxhost ~]$

With that the container can access mounts that are configured as private storage with the ":Z" mount option by other containers in the same pod.

Podman Secrets

Container secrets are available for rootless containers and work as usual. You can create a secret with the "podman secret create" command and if you pass that secret to the container with the --secret argument, the secret is available in the /run/secrets directory of the container. This is a tmpfs filesystem so even if you would use the podman commit command to create a new image of the running container, the secret would not be part of the image. Note that by default the "file" driver is used which stores the secret base64 encoded (i.e. unencrypted) on the host filesystem but if the container application is capabale of using a password file, it is still better than passing the password in an environment variable.

[poduser@uxhost ~]$ echo 'D33pS3cr37'| podman secret create blogeg -
666eb268d629db07752367da4
[poduser@uxhost ~]$ grep 666eb268d629db07752367da4 .local/share/containers/storage/secrets/filedriver/secretsdata.json
  "666eb268d629db07752367da4": "RDMzcFMzY3IzNwo=",
[poduser@uxhost ~]$ echo RDMzcFMzY3IzNwo=|base64 -d
D33pS3cr37
[poduser@uxhost ~]$ podman run -it --rm --secret blogeg alpine sh
/ # cat /run/secrets/blogeg
D33pS3cr37
/ #

There are two more drivers available to podman secrets but they are still undocumented. One is the "pass" driver which uses GPG in the same way as the GNU pass command and the other is the "shell" driver where you define shell scripts for the four options list, lookup, store and delete. These can be specified in the containers.conf file or as --driver.opts flags during secret creation time.