Skip to content
Go back

Container Escape: How to Stop It

By SumGuy 10 min read
Container Escape: How to Stop It

Containers Don’t Isolate. They Pretend To.

Here’s the uncomfortable truth: if you’ve been thinking of containers as lightweight virtual machines, you’re playing with fire and you just don’t know it yet. That Docker container running on your homelabbed server? It’s not a fortress. It’s more like a cardboard box with a “KEEP OUT” sign on your home lab network. Someone determined enough—or a vulnerability sneaky enough—and that app breaks free to your host kernel like it was never there.

Containers use Linux namespaces and cgroups to create the illusion of isolation. Keyword: illusion. Underneath, every container on your host is sharing the same kernel. That’s not a feature. That’s a liability with excellent marketing.

Let me walk you through the real escape vectors—the ones that keep security-conscious homelabbers up at 2 AM—and then we’ll talk about what actually helps.

The Bad Decisions: Classic Escape Vectors

Privileged Mode: The Nuclear Option

You’ve probably seen this in some guide you found at 11 PM trying to get your volume mounts working:

docker-compose.yml
services:
bad-idea:
image: some-app:latest
privileged: true

Don’t. Just… don’t. --privileged mode is like handing someone the keys to your house because they asked nicely.

When you run a container in privileged mode, it gets nearly all Linux capabilities: CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_MODULE, and more. That CAP_SYS_ADMIN alone is a skeleton key. With it, an attacker can:

A compromised app with CAP_SYS_ADMIN doesn’t need to exploit a kernel CVE. It just… escapes. It walks out the front door with your host’s filesystem hanging off its arm.

When you actually need privileged mode: Almost never. If you think you do, step back. You probably need one specific capability or a better device mounting strategy.

The Docker Socket: Self-Destruct Button

This one’s sneaky because it seems innocent:

docker-compose.yml
services:
docker-client:
image: myapp:latest
volumes:
- /var/run/docker.sock:/var/run/docker.sock

You mounted the Docker socket so your app can orchestrate other containers, right? Classic move. Totally understandable. Also totally game-over if that container gets compromised.

The Docker daemon runs as root (or in a dedicated unprivileged group, if you’re fancy). When you mount /var/run/docker.sock into a container, you’re giving that container a direct line to the daemon. A compromised app can:

There’s no “escape” here because the app doesn’t need to break out—it just instructs the daemon to create a jailbreak tool and hand it back.

Mounted Host Filesystems and Device Files

Some people get creative:

docker-compose.yml
services:
sketchy-app:
image: sketchy:latest
volumes:
- /proc:/host_proc
- /dev:/host_dev
- /sys:/host_sys

Why? Usually because they needed access to one thing and just mounted the entire /proc filesystem to be safe. Classic DevOps move: “Yeah, I’ll just give it everything and lock it down later.” (Narrator: they never locked it down.)

Now your container has read/write access to:

With these, an attacker can manipulate kernel memory, access host block devices, or abuse the mem device to read/write arbitrary kernel memory. memtester anyone? It’s not a fun time.

Capability Misuse: The Slow Burn

Containers run with a default set of Linux capabilities. Some apps claim they need specific ones. Let’s talk about dangerous ones:

I’ve seen containers running with capabilities they don’t need, “just in case”. This is how you end up with a compromised web server that can inject kernel modules or hijack all network traffic on your host.

Kernel CVEs: The Luck Factor

Sometimes, none of this matters. A vulnerability in the kernel itself can blow holes straight through the namespace boundaries. You can’t mitigate these with clever container configs. You need kernel patches.

Some infamous ones:

These aren’t configuration problems. They’re code problems. Your container setup could be flawless, and a kernel CVE still ruins your day.

User Namespace Shenanigans

Containers run as a UID/GID inside their namespace. By default, that maps to the same UID/GID on the host. So root (UID 0) in a container is root on the host in terms of file ownership.

User namespace remapping helps, but only if it’s enabled. Many container runtimes don’t enable it by default, so a compromised container running as root can read/write host files owned by root. Not a kernel escape, but a data escape—often just as bad.

How to Actually Protect Yourself

Okay, enough doom. Here’s what actually works.

1. Drop Capabilities Aggressively

Start with the principle of least privilege. Docker containers come with these default capabilities:

CHOWN, DAC_OVERRIDE, SETFCAP, SETGID, SETUID, NET_RAW, SYS_CHROOT

Most apps don’t need all of these. Drop what you don’t use:

docker-compose.yml
services:
hardened-app:
image: myapp:latest
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE

If your app only needs to bind to a network port, drop everything and add back only NET_BIND_SERVICE. If it needs nothing? cap_drop: [ALL] and call it a day. Yes, some apps will break. That’s intentional—it means they were asking for more than they needed.

2. Use User Namespaces / Rootless Mode

If you’re running your own host, enable user namespace remapping. It maps container UIDs to unprivileged host UIDs:

Terminal window
# On the host, create a subordinate UID/GID range
echo "dockremap:231072:65536" >> /etc/subuid
echo "dockremap:231072:65536" >> /etc/subgid
# Configure Docker to use userns-remap
# In /etc/docker/daemon.json:
{
"userns-remap": "dockremap"
}

Now even if container root escapes, it’s actually UID 231072 on the host—unprivileged. Game-changer.

3. Read-Only Root and no-new-privileges

Make the container’s root filesystem read-only. It forces the app to use tmpfs or volumes for writable state, which is how it should be anyway:

docker-compose.yml
services:
hardened-app:
image: myapp:latest
read_only: true
tmpfs:
- /tmp
- /var/log
security_opt:
- no-new-privileges:true

read_only: true prevents even root from writing to the filesystem. no-new-privileges prevents setuid/setgid binaries from escalating privileges. Together, they force good hygiene.

4. seccomp and AppArmor/SELinux Profiles

seccomp filters the syscalls a container can make. Docker ships with a default profile that blocks ~50 dangerous syscalls (like reboot, syslog, keyctl). Use it:

docker-compose.yml
services:
app:
image: myapp:latest
security_opt:
- seccomp=unconfined # DON'T DO THIS

Actually, don’t use seccomp=unconfined. Use the default. If you need a custom profile, build one:

seccomp-profile.json
{
"defaultAction": "SCMP_ACT_ERRNO",
"defaultErrnoRet": 1,
"archMap": [
{
"architecture": "SCMP_ARCH_X86_64",
"subArchitectures": ["SCMP_ARCH_X86", "SCMP_ARCH_X32"]
}
],
"syscalls": [
{
"names": ["accept4", "arch_specific_syscall", "bind"],
"action": "SCMP_ACT_ALLOW"
},
{
"names": ["reboot", "syslog"],
"action": "SCMP_ACT_ERRNO"
}
]
}

AppArmor/SELinux profiles go deeper. They whitelist what files and resources a process can access. It’s more powerful but also more complex to maintain.

5. Hardened Compose Example

Here’s what a reasonably hardened container looks like:

docker-compose.yml
services:
secure-app:
image: myapp:latest
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE
read_only: true
tmpfs:
- /tmp
- /var/log
- /var/cache
security_opt:
- no-new-privileges:true
environment:
- LOG_LEVEL=warn
user: "1000:1000"
volumes:
- config-volume:/etc/myapp:ro
networks:
- private-net
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
volumes:
config-volume:
driver: local
networks:
private-net:
driver: bridge

What this does:

6. Kernel Patching Is Non-Negotiable

Even with all the above, a kernel CVE can still hurt you. Keep your host kernel patched. Period. Not “when you get around to it”—I mean actually doing it on a schedule.

Terminal window
# Check for updates
apt update && apt list --upgradable | grep linux-image
# Install them
apt install -y linux-image-generic
# Reboot to activate
sudo reboot

Kernel CVEs are rare but brutal. Dirty Pipe took down thousands of container deployments. Leaky Vessels hit buildkit. Stay current.

7. Go Deeper: gVisor and Kata Containers

For critical workloads, consider stronger isolation:

Both trade speed for safety. gVisor is ~10-20% slower. Kata is ~5-10% slower. Worth it for untrusted code.

Terminal window
# Example with gVisor (assuming it's installed)
docker run --runtime=runsc --cap-drop=ALL my-app:latest

The Honest Take

There’s no magic bullet. Container isolation is defense-in-depth:

  1. Drop unnecessary capabilities
  2. Use rootless mode / user namespaces
  3. Make filesystems read-only
  4. Use seccomp profiles
  5. Keep the kernel patched
  6. Monitor what’s running

If someone gets access to your host kernel or the Docker daemon, the game’s over. If they compromise an app inside the container, they should be stuck inside it. Should be. With the right config, they will be.

The reason so many container escapes work is because people skip the basics. They run with --privileged, mount the Docker socket, and wonder why they got pwned. Don’t be that person.

You’re running containers on your home lab or small infrastructure. You can afford to be paranoid. Do it.

Quick Checklist

Before deploying any container:

Get those right, and you’ve eliminated 90% of the low-hanging fruit. The determined hacker with a kernel CVE exploit will still be a problem. But casual lateral movement? Privilege escalation via sloppy config? That’s done.

Stay paranoid. Your 2 AM self will thank you.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Previous Post
tini vs dumb-init vs --init
Next Post
ko vs Jib vs Buildpacks

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts