Skip to content
SumGuy's Ramblings
Go back

Linux Capabilities: Drop Root Without Breaking Everything

Your Container Is Running as Root and That Should Terrify You

Picture this: your containerized nginx needs to bind to port 80. So you run it as root. Fine, right? It’s just a container. Isolated. Safe.

Except now that nginx process — the one that accepts raw internet traffic, parses untrusted input, and runs third-party modules — has the theoretical ability to load kernel modules, modify system time, bypass file permissions, and reboot your host if the container escape goes just right. You handed your web server the master keys to the building because it needed to unlock one specific door.

This is the problem Linux capabilities solve. And once you understand them, running anything as full root starts to feel like a design smell.

What Are Linux Capabilities, Actually?

Traditionally, Unix had two privilege levels: root (UID 0, can do anything) and everyone else (can do almost nothing interesting). This binary model is fine until you need a process to do one privileged thing — like bind to a low-numbered port — without giving it everything else.

Linux capabilities break the monolithic root privilege into about 40 discrete units. Each capability grants a specific superpower. Your process can have exactly the capabilities it needs and nothing more.

Some capabilities you’ll encounter constantly:

CapabilityWhat It Allows
CAP_NET_BIND_SERVICEBind to ports below 1024
CAP_NET_ADMINConfigure network interfaces, routing, firewall rules
CAP_SYS_PTRACETrace processes (debuggers, strace)
CAP_CHOWNChange file ownership arbitrarily
CAP_DAC_OVERRIDEBypass file read/write/execute permission checks
CAP_SYS_TIMESet system clock
CAP_KILLSend signals to arbitrary processes
CAP_SYS_BOOTReboot or load kernel modules
CAP_SETUIDSwitch UIDs (what sudo uses)
CAP_AUDIT_WRITEWrite to kernel audit log

A process running as root has all of these by default. A non-root process has none of them by default. The interesting space is in between.

The Capability Sets

Each process actually maintains multiple sets of capabilities:

You can inspect your current process’s capabilities:

cat /proc/self/status | grep Cap

You’ll see hex values like CapEff: 0000000000000000. You can decode these:

capsh --decode=0000003fffffffff

Or use the more human-friendly:

# Install libcap2-bin
apt install libcap2-bin

# Check capabilities of a running process
getpcaps <PID>

# Check capabilities of a binary
getcap /usr/bin/ping

Setting Capabilities on Binaries with setcap

The classic example is ping. Old-school ping was setuid root because it needed raw socket access. Modern systems use capabilities instead:

getcap /usr/bin/ping
# /usr/bin/ping cap_net_raw=ep

cap_net_raw=ep means: grant CAP_NET_RAW in the effective (e) and permitted (p) sets when this binary is executed. No root needed.

You can set capabilities on your own binaries:

# Allow your Go binary to bind to port 443 without root
sudo setcap cap_net_bind_service=ep /usr/local/bin/myapp

# Remove all capabilities
sudo setcap -r /usr/local/bin/myapp

# Verify
getcap /usr/local/bin/myapp

This is particularly useful for services that traditionally required root just for port binding. Your Python web app can now listen on port 80 without running as root:

sudo setcap cap_net_bind_service=ep /usr/bin/python3.11

Though honestly, a better solution is to run on a high port and put a reverse proxy in front. But setcap is there when you need it.

Capabilities in Docker: cap_add and cap_drop

Docker containers run with a default capability set that’s a subset of full root. It’s not terrible, but it’s not minimal either. The Docker default set includes things like CAP_NET_BIND_SERVICE, CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FOWNER, and about a dozen others.

The nuclear option that security-conscious folks recommend:

docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE nginx

Drop everything. Add back only what you actually need. This is the correct approach.

Common Docker Capability Patterns

Web server that needs port 80:

docker run \
  --cap-drop=ALL \
  --cap-add=NET_BIND_SERVICE \
  --cap-add=SETGID \
  --cap-add=SETUID \
  nginx

Application that needs to change file ownership:

docker run \
  --cap-drop=ALL \
  --cap-add=CHOWN \
  --cap-add=DAC_OVERRIDE \
  myapp

Debugging container that needs strace:

docker run \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  myapp-debug

Note: capability names in Docker drop the CAP_ prefix.

In Docker Compose

services:
  nginx:
    image: nginx:alpine
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE
      - SETGID
      - SETUID
    read_only: true
    security_opt:
      - no-new-privileges:true

The no-new-privileges security option is worth adding alongside capability dropping — it prevents processes inside the container from gaining new privileges through setuid binaries or file capabilities.

Practical Dockerfile Examples

Rather than relying on runtime flags, you can bake capability requirements into your application design. The goal is to drop privileges after startup:

FROM golang:1.22-alpine AS builder
WORKDIR /app
COPY . .
RUN go build -o server .

FROM alpine:3.19
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
COPY --from=builder /app/server /usr/local/bin/server

# Set capability so non-root user can bind to port 80
RUN apk add --no-cache libcap && \
    setcap cap_net_bind_service=ep /usr/local/bin/server

USER appuser
EXPOSE 80
CMD ["server"]

Now your container runs as a non-root user but can still bind to port 80. No --cap-add required at runtime.

The “Just Use a High Port” Pattern

The even simpler approach: don’t fight the system. Run your app on port 8080, put Traefik or nginx in front on port 80:

FROM node:20-alpine
RUN addgroup -S app && adduser -S app -G app
USER app
EXPOSE 8080
CMD ["node", "server.js"]
# docker-compose.yml
services:
  app:
    build: .
    cap_drop:
      - ALL
    # No cap_add needed - port 8080 doesn't require NET_BIND_SERVICE

  traefik:
    image: traefik:v3
    ports:
      - "80:80"
      - "443:443"
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE

Traefik handles port 80/443, needs NET_BIND_SERVICE. Your app handles port 8080, needs nothing. Clean separation.

Reading Capability State from /proc

When debugging capability issues, /proc is your friend:

# Check capabilities of process with PID 1234
cat /proc/1234/status | grep -i cap

# Decode the hex values
# CapInh: inheritable set
# CapPrm: permitted set  
# CapEff: effective set (what actually matters)
# CapBnd: bounding set
# CapAmb: ambient set

Inside a container, check what your process actually has:

docker exec mycontainer cat /proc/1/status | grep Cap

If CapEff is all zeros, your process has no capabilities. If it’s ffffffffffffffff or similar, you’re basically root. Somewhere sensible in between is the goal.

Identifying What Capabilities Your App Actually Needs

The hard part isn’t dropping capabilities — it’s knowing which ones to keep. A few approaches:

Run with full capabilities, audit what’s used:

# Use strace to see what syscalls need privilege
strace -e trace=all myapp 2>&1 | grep -i "EPERM\|EACCES"

Use audit rules to catch capability checks:

auditctl -a always,exit -F arch=b64 -S capset -k capability_check

Trial and error with minimal sets — start with --cap-drop=ALL, add capabilities back as your app fails, document why each is needed. Yes, this is tedious. Security usually is.

The Broader Security Picture

Capabilities are one layer in a defense-in-depth strategy. They work alongside:

A container with --cap-drop=ALL but running as root in the default user namespace is still more dangerous than a container running as UID 1000. Combine all the layers.

The mental model: capabilities control what a process can do in terms of privileged operations. Seccomp controls which system calls it can invoke. AppArmor controls which files and network resources it can access. Use all three.

Quick Reference

# List all available capabilities
man capabilities

# Check capabilities of a binary
getcap /path/to/binary

# Set capability on a binary
sudo setcap cap_net_bind_service=ep /path/to/binary

# Remove capabilities from a binary
sudo setcap -r /path/to/binary

# Check process capabilities
getpcaps <PID>
cat /proc/<PID>/status | grep Cap

# Docker: drop all, add specific
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myimage

# Docker Compose equivalent
# cap_drop: [ALL]
# cap_add: [NET_BIND_SERVICE]

The first time you successfully run a web server as non-root with zero capabilities except NET_BIND_SERVICE, you’ll feel unreasonably smug. That smugness is earned. You’ve correctly understood that “I need root” usually means “I need one specific thing that root can do” — and now you know how to give it exactly that, nothing more.


Share this post on:

Previous Post
Nextcloud Advanced: Federation, Backups, and Making It Actually Performant
Next Post
Docker Security Hardening: 15 Things You're Doing Wrong Right Now