Container Escape: How to Stop It

Containers Don’t Isolate. They Pretend To.

Here’s the uncomfortable truth: if you’ve been thinking of containers as lightweight virtual machines, you’re playing with fire and you just don’t know it yet. That Docker container running on your homelabbed server? It’s not a fortress. It’s more like a cardboard box with a “KEEP OUT” sign on your home lab network. Someone determined enough, or a vulnerability sneaky enough, and that app breaks free to your host kernel like it was never there.

Containers use Linux namespaces and cgroups to create the illusion of isolation. Keyword: illusion. Underneath, every container on your host is sharing the same kernel. That’s not a feature. That’s a liability with excellent marketing.

Let me walk you through the real escape vectors, the ones that keep security-conscious homelabbers up at 2 AM, and then we’ll talk about what actually helps.

The Bad Decisions: Classic Escape Vectors

Privileged Mode: The Nuclear Option

You’ve probably seen this in some guide you found at 11 PM trying to get your volume mounts working:

services:
  bad-idea:
    image: some-app:latest
    privileged: true

Don’t. Just… don’t. --privileged mode is like handing someone the keys to your house because they asked nicely.

When you run a container in privileged mode, it gets nearly all Linux capabilities: CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_MODULE, and more. That CAP_SYS_ADMIN alone is a skeleton key. With it, an attacker can:

Mount filesystems directly (including the host’s root)
Load kernel modules
Fiddle with device nodes
Manipulate namespaces directly

A compromised app with CAP_SYS_ADMIN doesn’t need to exploit a kernel CVE. It just… escapes. It walks out the front door with your host’s filesystem hanging off its arm.

When you actually need privileged mode: Almost never. If you think you do, step back. You probably need one specific capability or a better device mounting strategy.

The Docker Socket: Self-Destruct Button

This one’s sneaky because it seems innocent:

services:
  docker-client:
    image: myapp:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

You mounted the Docker socket so your app can orchestrate other containers, right? Classic move. Totally understandable. Also totally game-over if that container gets compromised.

The Docker daemon runs as root (or in a dedicated unprivileged group, if you’re fancy). When you mount /var/run/docker.sock into a container, you’re giving that container a direct line to the daemon. A compromised app can:

Spawn a new container with --privileged
Mount the host filesystem into a new container
Execute commands as root via container exec
Access secrets in environment variables from other containers
Basically do anything the daemon can do, which is everything

There’s no “escape” here because the app doesn’t need to break out, it just instructs the daemon to create a jailbreak tool and hand it back.

Mounted Host Filesystems and Device Files

Some people get creative:

services:
  sketchy-app:
    image: sketchy:latest
    volumes:
      - /proc:/host_proc
      - /dev:/host_dev
      - /sys:/host_sys

Why? Usually because they needed access to one thing and just mounted the entire /proc filesystem to be safe. Classic DevOps move: “Yeah, I’ll just give it everything and lock it down later.” (Narrator: they never locked it down.)

Now your container has read/write access to:

/proc: Kernel memory, process info, security parameters
/dev: Raw device access, potentially to the host’s storage
/sys: Kernel parameters and hardware info

With these, an attacker can manipulate kernel memory, access host block devices, or abuse the mem device to read/write arbitrary kernel memory. memtester anyone? It’s not a fun time.

Capability Misuse: The Slow Burn

Containers run with a default set of Linux capabilities. Some apps claim they need specific ones. Let’s talk about dangerous ones:

CAP_SYS_PTRACE: Attach a debugger to any process. Yes, including the host kernel’s processes if you’re clever.
CAP_NET_ADMIN: Reconfigure networking, set up tunnels, manipulate routing. Useful for network tools. Dangerous if you’re not careful.
CAP_SYS_MODULE: Load/unload kernel modules. Goodbye host kernel integrity.
CAP_SYS_BOOT: Reboot or power-off the host. Have fun explaining why production went down at 3 AM.

I’ve seen containers running with capabilities they don’t need, “just in case”. This is how you end up with a compromised web server that can inject kernel modules or hijack all network traffic on your host.

Kernel CVEs: The Luck Factor

Sometimes, none of this matters. A vulnerability in the kernel itself can blow holes straight through the namespace boundaries. You can’t mitigate these with clever container configs. You need kernel patches.

Some infamous ones:

Dirty Pipe (CVE-2022-0847): Overwrite read-only files via the pipe buffer, allowing privilege escalation from a container.
runc escape (CVE-2019-5736): A file descriptor leak in runc let containers modify the host’s runc binary itself. Instant root access.
Leaky Vessels (CVE-2024-21626): File descriptor leak in runc that allows escaping to the host filesystem via WORKDIR manipulation. Newer, still scary.

These aren’t configuration problems. They’re code problems. Your container setup could be flawless, and a kernel CVE still ruins your day.

User Namespace Shenanigans

Containers run as a UID/GID inside their namespace. By default, that maps to the same UID/GID on the host. So root (UID 0) in a container is root on the host in terms of file ownership.

User namespace remapping helps, but only if it’s enabled. Many container runtimes don’t enable it by default, so a compromised container running as root can read/write host files owned by root. Not a kernel escape, but a data escape, often just as bad.

How to Actually Protect Yourself

Okay, enough doom. Here’s what actually works.

1. Drop Capabilities Aggressively

Start with the principle of least privilege. Docker containers come with these default capabilities:

CHOWN, DAC_OVERRIDE, FSETID, FOWNER, MKNOD, NET_RAW, SETGID, SETUID,
SETFCAP, SETPCAP, NET_BIND_SERVICE, SYS_CHROOT, KILL, AUDIT_WRITE

Most apps don’t need all of these. Drop what you don’t use:

services:
  hardened-app:
    image: myapp:latest
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE

If your app only needs to bind to a network port, drop everything and add back only NET_BIND_SERVICE. If it needs nothing? cap_drop: [ALL] and call it a day. Yes, some apps will break. That’s intentional, it means they were asking for more than they needed.

2. Use User Namespaces / Rootless Mode

If you’re running your own host, enable user namespace remapping. It maps container UIDs to unprivileged host UIDs:

# On the host, create a subordinate UID/GID range
echo "dockremap:231072:65536" >> /etc/subuid
echo "dockremap:231072:65536" >> /etc/subgid

# Configure Docker to use userns-remap
# In /etc/docker/daemon.json:
{
  "userns-remap": "dockremap"
}

Now even if container root escapes, it’s actually UID 231072 on the host, unprivileged. Game-changer.

3. Read-Only Root and no-new-privileges

Make the container’s root filesystem read-only. It forces the app to use tmpfs or volumes for writable state, which is how it should be anyway:

services:
  hardened-app:
    image: myapp:latest
    read_only: true
    tmpfs:
      - /tmp
      - /var/log
    security_opt:
      - no-new-privileges:true

read_only: true prevents even root from writing to the filesystem. no-new-privileges prevents setuid/setgid binaries from escalating privileges. Together, they force good hygiene.

4. seccomp and AppArmor/SELinux Profiles

seccomp filters the syscalls a container can make. Docker ships with a default profile that blocks ~50 dangerous syscalls (like reboot, syslog, keyctl). Use it:

services:
  app:
    image: myapp:latest
    security_opt:
      - seccomp=unconfined  # DON'T DO THIS

Actually, don’t use seccomp=unconfined. Use the default. If you need a custom profile, build one:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 1,
  "archMap": [
    {
      "architecture": "SCMP_ARCH_X86_64",
      "subArchitectures": ["SCMP_ARCH_X86", "SCMP_ARCH_X32"]
    }
  ],
  "syscalls": [
    {
      "names": ["accept4", "arch_specific_syscall", "bind"],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "names": ["reboot", "syslog"],
      "action": "SCMP_ACT_ERRNO"
    }
  ]
}

AppArmor/SELinux profiles go deeper. They whitelist what files and resources a process can access. It’s more powerful but also more complex to maintain.

5. Hardened Compose Example

Here’s what a reasonably hardened container looks like:

services:
  secure-app:
    image: myapp:latest
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE
    read_only: true
    tmpfs:
      - /tmp
      - /var/log
      - /var/cache
    security_opt:
      - no-new-privileges:true
    environment:
      - LOG_LEVEL=warn
    user: "1000:1000"
    volumes:
      - config-volume:/etc/myapp:ro
    networks:
      - private-net
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  config-volume:
    driver: local

networks:
  private-net:
    driver: bridge

What this does:

Drops all capabilities, adds back only what’s needed
Read-only root with tmpfs mounts for logs/cache
no-new-privileges prevents escalation tricks
Non-root user (UID 1000) runs the process
Isolated network so lateral movement is harder
Health checks so you know when something’s wrong

6. Kernel Patching Is Non-Negotiable

Even with all the above, a kernel CVE can still hurt you. Keep your host kernel patched. Period. Not “when you get around to it”I mean actually doing it on a schedule.

# Check for updates
apt update && apt list --upgradable | grep linux-image

# Install them
apt install -y linux-image-generic

# Reboot to activate
sudo reboot

Kernel CVEs are rare but brutal. Dirty Pipe took down thousands of container deployments. Leaky Vessels hit buildkit. Stay current.

7. Go Deeper: gVisor and Kata Containers

For critical workloads, consider stronger isolation:

gVisor: A userspace kernel that intercepts syscalls. Every syscall from a container goes through gVisor’s Go-based kernel. Slower, but much safer. Great for untrusted code.
Kata Containers: Lightweight VMs using QEMU or similar. Near-native performance with actual hardware-level isolation.

Both trade speed for safety. gVisor is ~10-20% slower. Kata is ~5-10% slower. Worth it for untrusted code.

# Example with gVisor (assuming it's installed)
docker run --runtime=runsc --cap-drop=ALL my-app:latest

The Honest Take

There’s no magic bullet. Container isolation is defense-in-depth:

Drop unnecessary capabilities
Use rootless mode / user namespaces
Make filesystems read-only
Use seccomp profiles
Keep the kernel patched
Monitor what’s running

If someone gets access to your host kernel or the Docker daemon, the game’s over. If they compromise an app inside the container, they should be stuck inside it. Should be. With the right config, they will be.

The reason so many container escapes work is because people skip the basics. They run with --privileged, mount the Docker socket, and wonder why they got pwned. Don’t be that person.

You’re running containers on your home lab or small infrastructure. You can afford to be paranoid. Do it.

Quick Checklist

Before deploying any container:

No --privileged or privileged: true
No /var/run/docker.sock mounted (unless absolutely necessary, and you know why)
No broad host filesystem mounts (/proc, /dev, /sys)
cap_drop: [ALL] and explicit cap_add for what’s needed
read_only: true with tmpfs for logs/cache
no-new-privileges: true
Non-root user running the app
Seccomp enabled (default Docker config is fine)
Host kernel is current

Get those right, and you’ve eliminated 90% of the low-hanging fruit. The determined hacker with a kernel CVE exploit will still be a problem. But casual lateral movement? Privilege escalation via sloppy config? That’s done.

Stay paranoid. Your 2 AM self will thank you.

Container Escape: How to Stop It

Containers Don’t Isolate. They Pretend To.

The Bad Decisions: Classic Escape Vectors

Privileged Mode: The Nuclear Option

The Docker Socket: Self-Destruct Button

Mounted Host Filesystems and Device Files

Capability Misuse: The Slow Burn

Kernel CVEs: The Luck Factor

User Namespace Shenanigans

How to Actually Protect Yourself

1. Drop Capabilities Aggressively

2. Use User Namespaces / Rootless Mode

3. Read-Only Root and no-new-privileges

4. seccomp and AppArmor/SELinux Profiles

5. Hardened Compose Example

6. Kernel Patching Is Non-Negotiable

7. Go Deeper: gVisor and Kata Containers

The Honest Take

Quick Checklist

Responses from around the web

Discussion

Related Posts

Sysbox vs gVisor vs Kata

Cosign Keyless: Sign Without Keys

Trivy vs Grype vs Docker Scout

Distroless Images: When Minimal Goes Too Far

Container Escape: How to Stop It

Containers Don’t Isolate. They Pretend To.

The Bad Decisions: Classic Escape Vectors

Privileged Mode: The Nuclear Option

The Docker Socket: Self-Destruct Button

Mounted Host Filesystems and Device Files

Capability Misuse: The Slow Burn

Kernel CVEs: The Luck Factor

User Namespace Shenanigans

How to Actually Protect Yourself

1. Drop Capabilities Aggressively

2. Use User Namespaces / Rootless Mode

3. Read-Only Root and no-new-privileges

4. seccomp and AppArmor/SELinux Profiles

5. Hardened Compose Example

6. Kernel Patching Is Non-Negotiable

7. Go Deeper: gVisor and Kata Containers

The Honest Take

Quick Checklist

Related Reading

Responses from around the web

Discussion

Related Posts

Sysbox vs gVisor vs Kata

Cosign Keyless: Sign Without Keys

Trivy vs Grype vs Docker Scout

Distroless Images: When Minimal Goes Too Far