Container Escapes 101 - What capabilities do I have?
What capabilities do I have?
Capabilities define what a process is allowed to do. Instead of an all-or-nothing approach of either being root
with all the permissions or a user with nothing, capabilities allow for granularity. It allows users to selectively escalate processes with scoped permissions such as bind a service to a port below 1024 (CAP_NET_BIND_SERVICE
) or read the kernel’s audit log (CAP_AUDIT_READ
). There are about 40 unique capabilities , which is much more than can be covered today. Some of these are far more powerful than others, so we’ll want to look for those.
Granting minimal permissions to each part of your containerized application is tricky. It requires folks to understand deeply what the app needs to do and how that translates to kernel capabilities. It’s tempting to just “give it everything” and move on, which is why we’ll talk more about CAP_SYS_ADMIN
than many others.
How do I know my host capabilities from inside a guest container?
Let’s find out using another interesting property of processes. Processes need to know information about themselves, but may not need to know their own PID. They “see” a lot of information about themselves in /proc/self/
, which is always a symlink to the currently running process. Taking a look at this directory within a container means we’re reading information about ourselves.
1
2
3
4
5
6
7
8
9
10
user@escapes:~$ docker run -it redhat/ubi9:9.6
[root@0d21143df54b /]# ls -lah /proc/self/
total 0
dr-xr-xr-x 9 root root 0 Jul 17 01:35 .
dr-xr-xr-x 162 root root 0 Jul 17 01:35 ..
dr-xr-xr-x 2 root root 0 Jul 17 01:35 attr
-rw-r--r-- 1 root root 0 Jul 17 01:35 autogroup
# # # and so much more # # #
-rw-r--r-- 1 root root 0 Jul 17 01:35 uid_map
-r--r--r-- 1 root root 0 Jul 17 01:35 wchan
But a caveat
What’s going on here??
Each time I list the directory, and not the contents (lack of trailing /
), I get a new number! 🤔
1
2
3
4
5
6
[root@0d21143df54b /]# ls -lah /proc/self
lrwxrwxrwx 1 root root 0 Jul 17 01:35 /proc/self -> 20
[root@0d21143df54b /]# ls -lah /proc/self
lrwxrwxrwx 1 root root 0 Jul 17 01:35 /proc/self -> 21
[root@0d21143df54b /]# ls -lah /proc/self
lrwxrwxrwx 1 root root 0 Jul 17 01:35 /proc/self -> 22
When you have a shell, /proc/self/
has the same contents as listing the process directory for that process. Listing /proc/self
as PID 22 would give you identical information as /proc/22/
in this last instance - which is information about ls
on a directory.
Init funny?
To work around that, it’s alright to list the status of another process. Many containers with shells have an init system or the shell itself running as PID 1
. This means we can inspect it to figure out our capabilities instead just as well.
1
2
3
4
5
6
[root@0d21143df54b /]# cat /proc/1/status | grep Cap
CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
There are lots of caps?
There are five strings returned on that status output. They all mean different things. To learn more about this in-depth, I strongly recommend the capabilities(7)
man page of the Linux docs.
-
CapInh
- inherited capabilities, what this process’s child processes can inherit from it -
CapPrm
- permitted capabilities, used to add additional syscalls into the effective set -
CapEff
- effective capabilities, what’s verified at runtime when you do a thing (eg, when you open a network socket below 1024, you’ll needCAP_NET_BIND_SERVICE
in this set for this process if you don’t already have it) -
CapBnd
- bounded capabilities, the superset of all of the above -
CapAmb
- ambient capabilities, which persevere across unprivileged runs of a program.
We’re most interested in the ambient or effective capabilities. I can rely on having my ambient or effective capabilities, but have to add more to use anything between that and my bounded capabilities set.
Making sense of that hexadecimal string
So … these values are expressed in hex and aren’t easy to read. Luckily, there’s the capsh
utility to decode those.
1
2
[root@0d21143df54b /]# capsh --decode=00000000a80425fb
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
There are a lot of entries. The two that catch my eye for being reasonably privileged are cap_sys_chroot
for adding namespaces or using chroot
to move about the filesystem and cap_net_raw
for binding to any address or editing network sockets.
Exercise #1 - Now you try!
❓ What process is running as PID 1 in this redhat/ubi9:9.6
image?
hint
You're looking for the name of what's at PID 1.example answer
[root@0d21143df54b /]# cat /proc/1/status | grep Name
Name: bash
❓ How does this change if you run the container with --privileged
?
hint
$ docker run --privileged -it redhat/ubi9:9.6
example answer
[root@afe8703ef7f4 /]# cat /proc/self/status | grep Cap
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
[root@afe8703ef7f4 /]# capsh --decode=000001ffffffffff
0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore
Wow, that's a lot more capabilities, including my favorite `cap_sys_admin` ... or basically root access.
❓ Now let’s change the runtime. Use podman
and the same container image to compare capabilities in the process. How does it differ?
hint
user@escapes:~$ sudo apt install podman -y
user@escapes:~$ podman run -it docker.io/redhat/ubi9:9.6
example answer
[root@02ddecfe8555 /]# cat /proc/self/status | grep Cap
CapInh: 0000000000000000
CapPrm: 00000000800405fb
CapEff: 00000000800405fb
CapBnd: 00000000800405fb
CapAmb: 0000000000000000
[root@02ddecfe8555 /]# capsh --decode=00000000800405fb
0x00000000800405fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_setfcap
‼️ This means our identical container is now running with less privileges on the host. Specifically, it now lacks `cap_net_raw`, `cap_mknod`, and `cap_audit_write`.
Why the difference?
Container runtimes provide easy-to-use wrappers around a bunch of kernel primitives. They’re not particularly special or isolated compared to other processes. The opinions of what permissions are needed and how to best provide these wrappers differ across runtimes.
What capabilities matter?
I mean … all of them matter in their own way? But some are more risky or allow more tasks than others. I try to look for the following:
-
CAP_SYS_ADMIN
means I’m more or less root and can do all of the things below. -
CAP_SYS_CHROOT
lets me change the root directory of the container, which is a common way to escape a container to read or write data outside of what’s explicitly allowed. -
CAP_NET_ADMIN
orCAP_NET_RAW
allows me to manipulate network interfaces, which can be used to exfiltrate data or create new network connections. -
CAP_SYS_MODULE
allows loading kernel modules, which can be used to escalate privileges or modify the kernel’s behavior. -
CAP_SYS_PTRACE
allows tracing other processes, which can be used to read memory or manipulate other processes. -
CAP_SYS_BOOT
lets me modify the system’s kernel and reboot the host. -
CAP_SYSLOG
allows privileged system logging access, which can be used to gather information about the host and other processes.
📚 tl;dr - knowing what we’re able to do means we can understand what naughty things are possible. This will be important as we look at planning our escapes. Next up … exploring seccomp profiles and filtering
Back to the index.