Container capabilities: a short tour

A while ago, for reasons that I no longer remember clearly, I spent some time investigating the differences between different configurations of Docker containers and the implications they had on the Linux capabilities the resulting containers would have. I find that advice around Docker security is often a bit muddled – a common occurrence of this confusion is the use of ‘privileged’ and ‘root’ as interchangeable terms, when the implications of either choice for a container are different.

The topic as a whole is surprisingly complex, but I thought it might be useful to compare Linux capabilities across different container configurations. In short, I’m trying to answer the question – what Linux capabilities does a particular configuration combination running on Docker give you? I’m going to consider the following combinations:

While not exhaustive, I think the comparison is useful to highlight some interesting nuances of Docker security.

For simplicity, I’m not going to consider containerization engines other than Docker, and I’m also going to (mostly) ignore other ways of setting capabilities. I’m also going to focus exclusively on capabilities, but both the --privileged flag and the root user have other security implications.

A root container running with the --privileged flag

To begin, I’m going to start at the highly privileged end of the scale, looking at a containerized process running as root, with the --privileged flag enabled. This is probably the most straightforward case. Running a Dockerfile like the following:

FROM ubuntu:latest  
RUN apt update && apt -y -q install libcap2-bin iputils-ping python3 python3-pip  
ENV DEBIAN_FRONTEND=noninteractive  

with the --privileged flag (docker run -it --privileged my-root-container bash) puts us in a container with the following capability sets:

root@91d6878472c6:/# grep Cap /proc/$$/task/$$/status  
CapInh:	0000000000000000  
CapPrm:	000001ffffffffff  
CapEff:	000001ffffffffff  
CapBnd:	000001ffffffffff  
CapAmb:	0000000000000000  

The capabilities man page explains the algorithm used to calculate the effective permissions set from the various permissions sets above (as well as what the sets mean).

Decoding the Effective set using capsh shows a pretty extensive list of permissions:

root@91d6878472c6:/# capsh --decode=000001ffffffffff  
0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore  

This includes the highly permissioned CAP_SYS_ADMIN capability, but there’s no real surprise here. A simple check using Python shows that we can (using cap_setuid), set the process’ UID, for example:

root@91d6878472c6:/# python3  
Python 3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0] on linux  
Type "help", "copyright", "credits" or "license" for more information.  
>>> import os  
>>> os.setuid(65534)  
>>> import pwd  
>>> print(pwd.getpwuid(os.getuid()))  
pwd.struct_passwd(pw_name='nobody', pw_passwd='x', pw_uid=65534, pw_gid=65534, pw_gecos='nobody', pw_dir='/nonexistent', pw_shell='/usr/sbin/nologin')  

A root container running without the --privileged flag

Dropping the --privileged flag, as expected, grants a lower set of default capabilities:

root@4a54f2ed2f00:/# grep Cap /proc/$$/task/$$/status  
CapInh:	0000000000000000  
CapPrm:	00000000a80425fb  
CapEff:	00000000a80425fb  
CapBnd:	00000000a80425fb  
CapAmb:	0000000000000000  
root@4a54f2ed2f00:/# capsh --decode=00000000a80425fb  
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap  

As with the --privileged example, these capabilities are already in the effective set, which means that processes that need the included capabilities will be able to run successfully (although you may need to use setcap or similar to set the required file capabilities on the applications you wish to execute – note that the list of effective process capabilities includes CAP_SETFCAP).

Due to the way in which capabilities are calculated, you are absolutely bound by the Bounding set in this situation, so any further capabilities you need must be added via --cap-add when executing the container. Adding file-level capabilities that are outside the process’ bounding set and attempting to execute those files will not succeed:

root@3c456e81068c:/# getcap /usr/bin/ping  
/usr/bin/ping cap_net_raw=ep  
root@3c456e81068c:/# setcap 'cap_net_raw=ep cap_sys_admin=ep' /usr/bin/ping  
root@3c456e81068c:/# getcap /usr/bin/ping  
/usr/bin/ping cap_net_raw,cap_sys_admin=ep  
root@3c456e81068c:/# ping  
bash: /usr/bin/ping: Operation not permitted  

This is all still pretty much as you would expect, although the default list of capabilities provides a relatively significant list of effective permissions that could have security implications depending on your threat model.

A non-root, non-privileged Docker container

What happens if you then switch away from the root user on a non-privileged container?

My Dockerfile looks like this:

FROM ubuntu:latest  
RUN apt update && apt -y -q install libcap2-bin iputils-ping  
ENV DEBIAN_FRONTEND=noninteractive  
USER nobody  

Running the container and looking at the capability sets shows us the following:

nobody@f1bf44469f4a:/$ grep Cap /proc/$$/task/$$/status  
CapInh:	0000000000000000  
CapPrm:	0000000000000000  
CapEff:	0000000000000000  
CapBnd:	00000000a80425fb  
CapAmb:	0000000000000000  
nobody@8f013c859738:/$ capsh --decode=00000000a80425fb  
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap  

Even though the process’ Effective and Permitted sets are empty, this does not mean that you can’t run processes that require capabilities. For example, ping in our Dockerfile installation is again installed with CAP_NET_RAW as a file-level capability:

nobody@8f013c859738:/$ getcap /usr/bin/ping  
/usr/bin/ping cap_net_raw=ep  

and therefore you can run ping without issues:

nobody@8f013c859738:/$ ping -c 1 1.1.1.1  
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.  
64 bytes from 1.1.1.1: icmp_seq=1 ttl=63 time=26.4 ms

--- 1.1.1.1 ping statistics ---  
1 packets transmitted, 1 received, 0% packet loss, time 0ms  
rtt min/avg/max/mdev = 26.355/26.355/26.355/0.000 ms  

This was initially unintuitive to me (as the Permitted and Effective capability sets are empty for the bash process that we examined), but it again comes down to the way in which the Permitted set for a new process is calculated:

P'(permitted) = (P(inheritable) & F(inheritable)) | (F(permitted) & cap_bset)  

That is, ping is permitted to use CAP_NET_RAW because CAP_NET_RAW is both in the bounding set (CapBnd above) and the permitted set for the /usr/bin/ping file.

If there’s one takeaway from this section, it should be that when using Docker containers, dropping capabilities (potentially at the file level but ideally at the bounding set level), can be quite important in pruning viable attack paths available to an attacker – simply running as non-root and without the --privileged flag may not be sufficient, dependent on your threat model.

For example, in a container started with docker run -it --cap-drop NET_RAW non-root bash:

nobody@f51647bac472:/$ capsh --print  
…  
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap  
…  
nobody@f51647bac472:/$ ping  
bash: /usr/bin/ping: Operation not permitted  

A non-root, privileged container (or a non-root container with --cap-add)

This was the edge case that initially drove me down this rabbit hole, and the Docker permissions model in this context is curious.

Using the non-root Docker image from the previous example, let’s also use the --privileged flag when running the container. My initial expectation would be that the Permitted and Effective capability sets would be the same as in the root container, but that is not the case:

nobody@b802f123279c:/$ grep Cap /proc/$$/task/$$/status  
CapInh:	0000000000000000  
CapPrm:	0000000000000000  
CapEff:	0000000000000000  
CapBnd:	000001ffffffffff  
CapAmb:	0000000000000000  
nobody@b802f123279c:/$ capsh --decode=000001ffffffffff  
0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore  

So the Bounding set gets all the permissions, but the Permitted and Effective sets are completely empty. This effectively means that the capabilities you need have to be defined via setcap on the application file you wish to execute, before the container is running. This won’t work, for example:

nobody@418e021432ff:/$ python3  
Python 3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0] on linux  
Type "help", "copyright", "credits" or "license" for more information.  
>>> import os  
>>> os.setuid(0)  
Traceback (most recent call last):  
  File "<stdin>", line 1, in <module>  
PermissionError: [Errno 1] Operation not permitted  
>>> import prctl  
>>> prctl.cap_effective.setuid = True  
Traceback (most recent call last):  
…  
    return _prctl.set_caps(*_parse_caps(True, *args))  
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  
PermissionError: [Errno 1] Operation not permitted  
>>>  

Instead, your Dockerfile needs to looks like this (note the setcap step):

FROM ubuntu:latest  
RUN apt update && apt -y -q install libcap2-bin iputils-ping python3 python3-prctl  
RUN setcap 'cap_setuid=ep' /usr/bin/python3.12  
ENV DEBIAN_FRONTEND=noninteractive  
USER nobody  

And then:

nobody@d31b4e5dc664:/$ python3  
…  
>>> import os  
>>> os.setuid(0)  
>>> os.getuid()  
0  

It’s non-intuitive to me that using the --privileged flag with a non-root user does not immediately grant the user all capabilities. The Docker docs state that “The –privileged flag gives all capabilities to the container.”, which is straightforwardly true for root users but not so much for other users. The --cap-add flag works similarly: it only changes the Bounding set rather than the Effective or Permitted sets, so using setcap is still required.

I’m not the first person to bump into this, there’s more discussion on the internet:

I’m not sure what to make of the security impact other than the behaviour is slightly unintuitive to me and is worth understanding when evaluating a container’s security posture.

Summary

Running through the various permutations of the root user and the --privileged flag was useful for me to understand a lot more about how capabilities work in Linux. In addition, there are a handful of things that stand out to me from this review:

*****
Written by Feroz Salam on 15 December 2024