Your Container Is Running as Root and That Should Terrify You
Picture this: your containerized nginx needs to bind to port 80. So you run it as root. Fine, right? It’s just a container. Isolated. Safe.
Except now that nginx process — the one that accepts raw internet traffic, parses untrusted input, and runs third-party modules — has the theoretical ability to load kernel modules, modify system time, bypass file permissions, and reboot your host if the container escape goes just right. You handed your web server the master keys to the building because it needed to unlock one specific door.
This is the problem Linux capabilities solve. And once you understand them, running anything as full root starts to feel like a design smell.
What Are Linux Capabilities, Actually?
Traditionally, Unix had two privilege levels: root (UID 0, can do anything) and everyone else (can do almost nothing interesting). This binary model is fine until you need a process to do one privileged thing — like bind to a low-numbered port — without giving it everything else.
Linux capabilities break the monolithic root privilege into about 40 discrete units. Each capability grants a specific superpower. Your process can have exactly the capabilities it needs and nothing more.
Some capabilities you’ll encounter constantly:
| Capability | What It Allows |
|---|---|
CAP_NET_BIND_SERVICE | Bind to ports below 1024 |
CAP_NET_ADMIN | Configure network interfaces, routing, firewall rules |
CAP_SYS_PTRACE | Trace processes (debuggers, strace) |
CAP_CHOWN | Change file ownership arbitrarily |
CAP_DAC_OVERRIDE | Bypass file read/write/execute permission checks |
CAP_SYS_TIME | Set system clock |
CAP_KILL | Send signals to arbitrary processes |
CAP_SYS_BOOT | Reboot or load kernel modules |
CAP_SETUID | Switch UIDs (what sudo uses) |
CAP_AUDIT_WRITE | Write to kernel audit log |
A process running as root has all of these by default. A non-root process has none of them by default. The interesting space is in between.
The Capability Sets
Each process actually maintains multiple sets of capabilities:
- Permitted: The maximum capabilities a process can ever have
- Effective: The capabilities currently active (what the kernel actually checks)
- Inheritable: Capabilities that can be passed to child processes
- Bounding: Hard ceiling — capabilities that can never be added even if you ask nicely
You can inspect your current process’s capabilities:
cat /proc/self/status | grep Cap
You’ll see hex values like CapEff: 0000000000000000. You can decode these:
capsh --decode=0000003fffffffff
Or use the more human-friendly:
# Install libcap2-bin
apt install libcap2-bin
# Check capabilities of a running process
getpcaps <PID>
# Check capabilities of a binary
getcap /usr/bin/ping
Setting Capabilities on Binaries with setcap
The classic example is ping. Old-school ping was setuid root because it needed raw socket access. Modern systems use capabilities instead:
getcap /usr/bin/ping
# /usr/bin/ping cap_net_raw=ep
cap_net_raw=ep means: grant CAP_NET_RAW in the effective (e) and permitted (p) sets when this binary is executed. No root needed.
You can set capabilities on your own binaries:
# Allow your Go binary to bind to port 443 without root
sudo setcap cap_net_bind_service=ep /usr/local/bin/myapp
# Remove all capabilities
sudo setcap -r /usr/local/bin/myapp
# Verify
getcap /usr/local/bin/myapp
This is particularly useful for services that traditionally required root just for port binding. Your Python web app can now listen on port 80 without running as root:
sudo setcap cap_net_bind_service=ep /usr/bin/python3.11
Though honestly, a better solution is to run on a high port and put a reverse proxy in front. But setcap is there when you need it.
Capabilities in Docker: cap_add and cap_drop
Docker containers run with a default capability set that’s a subset of full root. It’s not terrible, but it’s not minimal either. The Docker default set includes things like CAP_NET_BIND_SERVICE, CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FOWNER, and about a dozen others.
The nuclear option that security-conscious folks recommend:
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE nginx
Drop everything. Add back only what you actually need. This is the correct approach.
Common Docker Capability Patterns
Web server that needs port 80:
docker run \
--cap-drop=ALL \
--cap-add=NET_BIND_SERVICE \
--cap-add=SETGID \
--cap-add=SETUID \
nginx
Application that needs to change file ownership:
docker run \
--cap-drop=ALL \
--cap-add=CHOWN \
--cap-add=DAC_OVERRIDE \
myapp
Debugging container that needs strace:
docker run \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
myapp-debug
Note: capability names in Docker drop the CAP_ prefix.
In Docker Compose
services:
nginx:
image: nginx:alpine
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE
- SETGID
- SETUID
read_only: true
security_opt:
- no-new-privileges:true
The no-new-privileges security option is worth adding alongside capability dropping — it prevents processes inside the container from gaining new privileges through setuid binaries or file capabilities.
Practical Dockerfile Examples
Rather than relying on runtime flags, you can bake capability requirements into your application design. The goal is to drop privileges after startup:
FROM golang:1.22-alpine AS builder
WORKDIR /app
COPY . .
RUN go build -o server .
FROM alpine:3.19
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
COPY --from=builder /app/server /usr/local/bin/server
# Set capability so non-root user can bind to port 80
RUN apk add --no-cache libcap && \
setcap cap_net_bind_service=ep /usr/local/bin/server
USER appuser
EXPOSE 80
CMD ["server"]
Now your container runs as a non-root user but can still bind to port 80. No --cap-add required at runtime.
The “Just Use a High Port” Pattern
The even simpler approach: don’t fight the system. Run your app on port 8080, put Traefik or nginx in front on port 80:
FROM node:20-alpine
RUN addgroup -S app && adduser -S app -G app
USER app
EXPOSE 8080
CMD ["node", "server.js"]
# docker-compose.yml
services:
app:
build: .
cap_drop:
- ALL
# No cap_add needed - port 8080 doesn't require NET_BIND_SERVICE
traefik:
image: traefik:v3
ports:
- "80:80"
- "443:443"
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE
Traefik handles port 80/443, needs NET_BIND_SERVICE. Your app handles port 8080, needs nothing. Clean separation.
Reading Capability State from /proc
When debugging capability issues, /proc is your friend:
# Check capabilities of process with PID 1234
cat /proc/1234/status | grep -i cap
# Decode the hex values
# CapInh: inheritable set
# CapPrm: permitted set
# CapEff: effective set (what actually matters)
# CapBnd: bounding set
# CapAmb: ambient set
Inside a container, check what your process actually has:
docker exec mycontainer cat /proc/1/status | grep Cap
If CapEff is all zeros, your process has no capabilities. If it’s ffffffffffffffff or similar, you’re basically root. Somewhere sensible in between is the goal.
Identifying What Capabilities Your App Actually Needs
The hard part isn’t dropping capabilities — it’s knowing which ones to keep. A few approaches:
Run with full capabilities, audit what’s used:
# Use strace to see what syscalls need privilege
strace -e trace=all myapp 2>&1 | grep -i "EPERM\|EACCES"
Use audit rules to catch capability checks:
auditctl -a always,exit -F arch=b64 -S capset -k capability_check
Trial and error with minimal sets — start with --cap-drop=ALL, add capabilities back as your app fails, document why each is needed. Yes, this is tedious. Security usually is.
The Broader Security Picture
Capabilities are one layer in a defense-in-depth strategy. They work alongside:
- Seccomp profiles: Restrict which syscalls a process can make
- AppArmor/SELinux: Mandatory access control for files and networks
- Namespaces: Process, network, and mount isolation
- User namespaces: Map container root to unprivileged host user
A container with --cap-drop=ALL but running as root in the default user namespace is still more dangerous than a container running as UID 1000. Combine all the layers.
The mental model: capabilities control what a process can do in terms of privileged operations. Seccomp controls which system calls it can invoke. AppArmor controls which files and network resources it can access. Use all three.
Quick Reference
# List all available capabilities
man capabilities
# Check capabilities of a binary
getcap /path/to/binary
# Set capability on a binary
sudo setcap cap_net_bind_service=ep /path/to/binary
# Remove capabilities from a binary
sudo setcap -r /path/to/binary
# Check process capabilities
getpcaps <PID>
cat /proc/<PID>/status | grep Cap
# Docker: drop all, add specific
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myimage
# Docker Compose equivalent
# cap_drop: [ALL]
# cap_add: [NET_BIND_SERVICE]
The first time you successfully run a web server as non-root with zero capabilities except NET_BIND_SERVICE, you’ll feel unreasonably smug. That smugness is earned. You’ve correctly understood that “I need root” usually means “I need one specific thing that root can do” — and now you know how to give it exactly that, nothing more.