Rootless podman best practice
Posted on Sat 13 May 2023 in misc
Podman is a daemonless container engine that runs OCI and Docker compatible containers.
A podman container can be executed in four ways:
- A rootfull container (i.e., the container is started as the host user "root") with a container application running as root. With this the container application has access to system resources, comparable to any application running as root. An example is the busybox container.
- A rootfull container with a container application running as an unprivileged user. In this case the application voluntarily relinquishes its root privilege. An example of this is the official Mariadb container.
- A rootless container with the appplication running as root. Within the container everything seems to run as root but on the host the processes run as the user who started the container.
- A rootless container with an application running as an unprivileged user.
The userid and groupid of the container user are mapped to host ids in a
range assigned to each user in
/etc/subuid
and/etc/subgid
.
This blog will focus on the last two use cases.
Contrary to docker, there is no systemwide daemon that spawns the containers
but processes are indepedent entities. The standard procedure is to start pods
and containers via systemd and where rootless containers are concerned, to use
"systemctl --user
". To enable starting a rootless container at boot time, a
user specific systemd should be started. This is accomplished by issuing the
command "loginctl --enable-linger username
" as root.
An important drawback of running rootless containers is that it is not
possible to create data volumes on its own block storage (via "podman volume
create
") or create default private networks (note that there is a limited
option for the latter, see "Netavark networking"
below).
Starting containers in a pod
To enable inter-container communications one should run the containers together in a pod. Network port forwarding from the host to containers can happen at the pod level as well.
Imagine that you have a Wordpress and a Mediawiki environment, both running as containers on the same host. For this you create two pods, one containing the Mediawiki container and a Mariadb container and the other with a Wordpress and a Mariadb container. In both cases, the database access is configured as TCP port 3306 on localhost. Both web applications find their own database because the pod contains the access.
Starting or stopping a pod starts or stops all containers assigned to the pod. Creating a pod with containers happens like this:
podman pod create --name podname [ -p hostport:containerport ] ... podman create --pod podname --name container1 [ -v hostdirectory:mountpont ] ... imagename[:tag] [args] podman create --pod podname --name container2 [ -v hostdirectory:mountpont ] ... imagename[:tag] [args]
It is a good idea to generate systemd files for the pods and containers so starting and stopping them can be delegated to systemd. That enables starting the pod and containers at boot time. Generating the files is done with:
podman generate systemd --files --name --new podname
Creating the service file for the pod creates service files for the relevant
containers as well. The --files
option creates service files i.o. dumping
them to STDOUT. --name
creates files with image names and tags i.o. image
ids so you don't need to generate new files whenever a new version of the
container image is used. Only make sure that the image name and tag of the new
version is the same as that of the previous image. --new
causes the podman
process to not only stop when systemctl stop
is used but the podman process
is actually removed from the list (comparable to using the podman rm -f
command).
When the files are generated, they can be copied to $HOME/.config/systemd/user
and enabled with:
systemctl --user enable pod-podname container-container1 container-container2
Persistent storage
In docker (and rootfull podman) it is possible to assign block storage
dedicated to docker or podman (with the "docker volume create --opt
device=xxx
" command). In a rootless environment you cannot connect a device.
Instead, a directory is created in .local
:
[poduser@uxhost ~]$ ls .local/share/containers/storage defaultNetworkBackend networks overlay-images storage.lock libpod overlayfs overlay-layers tmp mounts overlay-containers secrets userns.lock [poduser@uxhost ~]$ podman volume create MyVolume MyVolume [poduser@uxhost ~]$ ls .local/share/containers/storage/volumes/MyVolume _data [poduser@uxhost ~]$ podman run -it --rm -v MyVolume:/root/mydata alpine sh / # exit [poduser@uxhost ~]$
It's also possible to mount another directory in a container. With that you can
decide yourself where the data is placed (given proper write access). Do note
that if SELinux restrictions are enforced, the file context for the mounted
directory should be labled as "container_file_t
". You can delegate this to
podman by appending ":z
" to the mount definition. You can also opt for
appending ":Z
", then access to the mount is further restricted via sVirt
to only the relevant pod or container.
[poduser@uxhost ~]$ mkdir mariadb privateshare [poduser@uxhost ~]$ ls -Zd mariadb privateshare system_u:object_r:user_home_t:s0 mariadb system_u:object_r:user_home_t:s0 privateshare [poduser@uxhost ~]$ podman run -d --rm --name mariadb -e MARIADB_RANDOM_ROOT_PASSWORD=yes \ -v ./mariadb:/var/lib/mysql:z -v ./privateshare:/privateshare:Z mariadb b9a95df84be5f7feaf0b40ba2dd004eee4e4b83bd0fb2ad50692fab2a50bb06e [poduser@uxhost ~]$ ls -Zd mariadb privateshare system_u:object_r:container_file_t:s0 mariadb system_u:object_r:container_file_t:s0:c119,c552 privateshare [poduser@uxhost ~]$ ps -Zp $(pgrep mariadb) LABEL PID TTY TIME CMD system_u:system_r:container_t:s0:c119,c552 15827 ? 00:00:01 mariadbd [poduser@uxhost ~]$
Files that are generated as the root user or group within the container are
created as the user starting the container or her primary group on the host.
Files that are created as an unprivileged user or non-root group in the
container are mapped to a user or group within the /etc/subuid
or
/etc/subgid
range for the user starting the container. Also make sure that
the directory that is mounted in the container and its files are either owned
and belong to the primary group of the user starting the container, or of a
uid or gid withing that user's subuid/subgid range.
[poduser@uxhost ~]$ podman exec -it mariadb bash root@b9a95df84be5:/# touch /privateshare/{rootfile,userfile} root@b9a95df84be5:/# chown mysql:mysql /privateshare/userfile root@b9a95df84be5:/# ls -l /privateshare total 0 -rw-r--r--. 1 root root 0 Feb 20 14:38 rootfile -rw-r--r--. 1 mysql mysql 0 Feb 20 14:38 userfile root@b9a95df84be5:/# id mysql uid=999(mysql) gid=999(mysql) groups=999(mysql) root@b9a95df84be5:/# exit [poduser@uxhost ~]$ ls -l ./privateshare total 0 -rw-r--r--. 1 poduser poduser 0 Feb 20 15:38 rootfile -rw-r--r--. 1 494214 494214 0 Feb 20 15:38 userfile [poduser@uxhost ~]$ grep poduser /etc/sub?id /etc/subgid:poduser:493216:65536 /etc/subuid:poduser:493216:65536 [poduser@uxhost ~]$
Outside of the container the starting user does not have the right to modify
files that are not owned by root in the container (like the "userfile
"
above). To enable this, the user has to temprary become "container root" with
the "podman unshare
" command. Note that owners and groups are then mapped
according to /etc/passwd
and /etc/group
from the host, not those of the
container.
[poduser@uxhost ~]$ echo foo > ./privateshare/userfile bash: permission denied: ./privateshare/userfile [poduser@uxhost ~]$ podman unshare bash [root@uxhost ~]# ls -l ./privateshare total 0 -rw-r--r--. 1 root root 0 Feb 20 15:38 rootfile -rw-r--r--. 1 systemd-coredump input 0 Feb 20 15:38 userfile [root@uxhost ~]# id systemd-coredump uid=999(systemd-coredump) gid=997(systemd-coredump) groups=997(systemd-coredump) [root@uxhost ~]# echo foo > ./privateshare/userfile [root@uxhost ~]# cat ./privateshare/userfile foo [root@uxhost ~]#
Rootless nested containers
Rootless podman containers use fuse-overlayfs by default to store container
images. This is centrally configured in /etc/containers/storage.conf
. It has
a better performance than native VFS storage but the drawback is that you
can't run nested rootless containers (or e.g. K3S Kubernetes)
because /dev/fuse
is unavailable. For such a use case it's better to use VFS
for container image storage but it's not possible to mix fuse-overlayfs and
VFS for different images by the same user. To see what storage is used (given
you have at least one container image present), just list the storage
directory:
[poduser@uxhost ~]$ ls .local/share/containers/storage/ libpod mounts overlay overlay-containers overlay-images overlay-layers storage.lock tmp userns.lock [poduser@uxhost ~]$
As the directory names imply, overlayfs is used. To switch to VFS all containers and container images need to be deleted first. That can be accomplished with
podman system reset
To use VFS given this clean state, create the file
$HOME/.config/containers/storage.conf
with the following content:
[storage] driver = "vfs"
Now if you download or create an image it is stored as VFS:
[poduser@uxhost ~]$ ls .local/share/containers/storage/ libpod mounts storage.lock tmp userns.lock vfs vfs-containers vfs-images vfs-layers [poduser@uxhost ~]$
Netavark networking
With rooted containers it is possible to create your own network and give each
container its own IP-address in that range. There are use cases where that is
preferable (e.g. with a CI/CD pipeline where multiple similar containers are
spawn, each of which listens to the same network port). That is possible by
using the "Netavark" network backend. For this (assuming Enterprise Linux) you
install the netavark
rpm. Installing that will also install the
aardvark-dns
RPM. There are a few issues with using a rootless network:
- The network is not available to the host, only to containers in the network.
- Containers in the network are not limited to other containers in the same pod but can connect to all other containers in the same network
- Port publishing via the pod only works with CNI networks, so if a port of a container in a netavark network needs to be puslished to the host, it should be done in the container.
- For every netavark network an aardvark-dns process is started so containers can resolve other containers on the name. This only works for containers within the same netavark network, so not for the host and not for containers in other networks.
Each podman user has the same standard network available named "podman
".
This network has a "cni
" backend which is used for forwarding
exposed container ports at the pod level as explained in
Starting containers in a pod. To enable
networks with a "netavark" backend, the following configuration should be
added to "$HOME/.config/containers/containers.conf"
:
[network] network_backend = "netavark"
To use this, all existing containers should be stopped and removed. Creating a netavark network is done like this:
[poduser@uxhost ~]$ podman network create --subnet 172.16.1.0/24 --gateway 172.16.1.1 rootlessnet rootlessnet [poduser@uxhost ~]$ podman network ls NETWORK ID NAME DRIVER 2f259bab93aa podman bridge 8a0786c87d36 rootlessnet bridge [poduser@uxhost ~]$
It is now possible to create containers in the new network and they can connect to each other using the container name:
[poduser@uxhost ~]$ podman run -d --rm --network rootlessnet --name host10 alpine sleep 86400 e4c70309bc2b020772c873716174d38d18aaa51e7618e788dd18fcc0f21174cf [poduser@uxhost ~]$ podman run -d --rm --network rootlessnet --name host11 alpine sleep 86400 4ab2be743a68b3e4a645e0545532a5de09e0564635a7057a8275b3bacecf152e [poduser@uxhost ~]$ podman exec -it host10 ping -c2 host11 PING host11 (172.16.1.4): 56 data bytes 64 bytes from 172.16.1.4: seq=0 ttl=64 time=0.044 ms 64 bytes from 172.16.1.4: seq=1 ttl=64 time=0.114 ms --- host11 ping statistics --- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max = 0.044/0.079/0.114 ms [poduser@uxhost ~]$
There's one drawback to this approach, i.e. that the netavark
backend is now
hard wired and all new containers will use that whether it's preferred or not.
It's better to leave the backend configuration empty which enabled a cni
backend unless explicitely stated otherwise. This way it's still possible to
contain exposed ports within a pod and not have them available to all
containers in a network. To achieve this, use the following content for
"$HOME/.config/containers/containers.conf"
:
[network] network_backend = ""
It's now possible to combine containers with cni
and netavark
network
backends in the same pod:
[poduser@uxhost ~]$ podman pod create --name mixed -p 8888:80 ea6215f74bfbed0c5db456af187c041e0fd2ffce4b3b1673a832eaa30e838b9e [poduser@uxhost ~]$ podman pod start mixed ea6215f74bfbed0c5db456af187c041e0fd2ffce4b3b1673a832eaa30e838b9e [poduser@uxhost ~]$ podman run -d --rm --pod mixed --name nginx nginx 804044254b846b26e281925ea8b1bba076167afe2203cfadd4bdc0ec021abe5b [poduser@uxhost ~]$ podman run -d --rm --pod mixed --name alpcni curl sleep 86400 587afafd81b2d2e6ad49ef518cf1396f99e6ba661ac9bfc9df53a85e2386e66f [poduser@uxhost ~]$ podman run -d --rm --pod mixed --name alpavark --network rootlessnet curl sleep 43200 bba06b34f0b22efb1db5ae33c93a60100e2a94d81bb372691812eef94cd24a56 [poduser@uxhost ~]$ podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 7e53fabf7e0a localhost/podman-pause:4.2.0-1673519486 About a minute ago Up About a minute ago 0.0.0.0:8888->80/tcp ea6215f74bfb-infra 804044254b84 localhost/nginx:latest nginx -g daemon o... About a minute ago Up About a minute ago 0.0.0.0:8888->80/tcp nginx 587afafd81b2 localhost/curl:latest sleep 86400 24 seconds ago Up 24 seconds ago 0.0.0.0:8888->80/tcp alpcni bba06b34f0b2 localhost/curl:latest sleep 43200 4 seconds ago Up 4 seconds ago alpavark
The port forwarding in the list shows that the "nginx" and "alpcni" containers use the cni network and the "alpavark" container doesn't. The host can connect to the web server via the forwarded port and the Alpine container in the cni network can connect to it via the exposed port (80) but the container in the netavark network can't reach the exposed port, only the forwarded port on the host:
[poduser@uxhost ~]$ curl -s http://localhost:8888 |grep title <title>Welcome to nginx!</title> [poduser@uxhost ~]$ podman exec -it alpcni curl -s http://localhost:80 |grep title <title>Welcome to nginx!</title> [poduser@uxhost ~]$ podman exec -it alpavark curl -s http://localhost:80 |grep title [poduser@uxhost ~]$ podman exec -it alpavark curl -s http://localhost:8888 |grep title [poduser@uxhost ~]$ ip a s enp1s0|grep ' inet ' inet 10.24.1.128/24 brd 10.24.1.255 scope global noprefixroute enp1s0 [poduser@uxhost ~]$ podman exec -it alpavark curl -s http://10.24.1.128:8888 |grep title <title>Welcome to nginx!</title> [poduser@uxhost ~]$
(BTW, the "curl
" image is is just a standard Alpine image with the "curl"
tool added). In fact the advantage for a container with a netavark network to
be part of a pod is limited. About the only achievement is that the container
gets the same sVirt context as the pod and other containers in it while other
containers get their own sVirt context:
[poduser@uxhost ~]$ podman run -d --name nonpod --network rootlessnet curl sleep 22600 5a1d1d2ff4db7fb7e319f44b8516f9cbbb31263a6cd285e92acb4e9ec4921f3e [poduser@uxhost ~]$ ps -Zu $(id -u)|grep :container_t: system_u:system_r:container_t:s0:c125,c698 14196 ? 00:00:00 catatonit system_u:system_r:container_t:s0:c125,c698 14227 ? 00:00:00 nginx system_u:system_r:container_t:s0:c125,c698 14282 ? 00:00:00 sleep system_u:system_r:container_t:s0:c125,c698 14391 ? 00:00:00 sleep system_u:system_r:container_t:s0:c584,c600 14735 ? 00:00:00 sleep [poduser@uxhost ~]$
With that the container can access mounts that are configured as private
storage with the ":Z
" mount option by other containers in the same pod.
Podman Secrets
Container secrets are available for rootless containers and work as usual. You
can create a secret with the "podman secret create
" command and if you pass
that secret to the container with the --secret
argument, the secret is
available in the /run/secrets
directory of the container. This is a tmpfs
filesystem so even if you would use the podman commit
command to create a new
image of the running container, the secret would not be part of the image.
Note that by default the "file" driver is used which stores the secret base64
encoded (i.e. unencrypted) on the host filesystem but if the container
application is capabale of using a password file, it is still better than
passing the password in an environment variable.
[poduser@uxhost ~]$ echo 'D33pS3cr37'| podman secret create blogeg - 666eb268d629db07752367da4 [poduser@uxhost ~]$ grep 666eb268d629db07752367da4 .local/share/containers/storage/secrets/filedriver/secretsdata.json "666eb268d629db07752367da4": "RDMzcFMzY3IzNwo=", [poduser@uxhost ~]$ echo RDMzcFMzY3IzNwo=|base64 -d D33pS3cr37 [poduser@uxhost ~]$ podman run -it --rm --secret blogeg alpine sh / # cat /run/secrets/blogeg D33pS3cr37 / #
There are two more drivers available to podman secrets but they are still
undocumented. One is the "pass" driver which uses GPG in the same way as the
GNU pass command and the other is the "shell" driver where you define shell
scripts for the four options list
, lookup
, store
and delete
. These can
be specified in the containers.conf
file or as --driver.opts
flags during
secret creation time.