Containers Don’t Isolate. They Pretend To.
Here’s the uncomfortable truth: if you’ve been thinking of containers as lightweight virtual machines, you’re playing with fire and you just don’t know it yet. That Docker container running on your homelabbed server? It’s not a fortress. It’s more like a cardboard box with a “KEEP OUT” sign on your home lab network. Someone determined enough—or a vulnerability sneaky enough—and that app breaks free to your host kernel like it was never there.
Containers use Linux namespaces and cgroups to create the illusion of isolation. Keyword: illusion. Underneath, every container on your host is sharing the same kernel. That’s not a feature. That’s a liability with excellent marketing.
Let me walk you through the real escape vectors—the ones that keep security-conscious homelabbers up at 2 AM—and then we’ll talk about what actually helps.
The Bad Decisions: Classic Escape Vectors
Privileged Mode: The Nuclear Option
You’ve probably seen this in some guide you found at 11 PM trying to get your volume mounts working:
services: bad-idea: image: some-app:latest privileged: trueDon’t. Just… don’t. --privileged mode is like handing someone the keys to your house because they asked nicely.
When you run a container in privileged mode, it gets nearly all Linux capabilities: CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_MODULE, and more. That CAP_SYS_ADMIN alone is a skeleton key. With it, an attacker can:
- Mount filesystems directly (including the host’s root)
- Load kernel modules
- Fiddle with device nodes
- Manipulate namespaces directly
A compromised app with CAP_SYS_ADMIN doesn’t need to exploit a kernel CVE. It just… escapes. It walks out the front door with your host’s filesystem hanging off its arm.
When you actually need privileged mode: Almost never. If you think you do, step back. You probably need one specific capability or a better device mounting strategy.
The Docker Socket: Self-Destruct Button
This one’s sneaky because it seems innocent:
services: docker-client: image: myapp:latest volumes: - /var/run/docker.sock:/var/run/docker.sockYou mounted the Docker socket so your app can orchestrate other containers, right? Classic move. Totally understandable. Also totally game-over if that container gets compromised.
The Docker daemon runs as root (or in a dedicated unprivileged group, if you’re fancy). When you mount /var/run/docker.sock into a container, you’re giving that container a direct line to the daemon. A compromised app can:
- Spawn a new container with
--privileged - Mount the host filesystem into a new container
- Execute commands as root via container exec
- Access secrets in environment variables from other containers
- Basically do anything the daemon can do, which is everything
There’s no “escape” here because the app doesn’t need to break out—it just instructs the daemon to create a jailbreak tool and hand it back.
Mounted Host Filesystems and Device Files
Some people get creative:
services: sketchy-app: image: sketchy:latest volumes: - /proc:/host_proc - /dev:/host_dev - /sys:/host_sysWhy? Usually because they needed access to one thing and just mounted the entire /proc filesystem to be safe. Classic DevOps move: “Yeah, I’ll just give it everything and lock it down later.” (Narrator: they never locked it down.)
Now your container has read/write access to:
/proc: Kernel memory, process info, security parameters/dev: Raw device access, potentially to the host’s storage/sys: Kernel parameters and hardware info
With these, an attacker can manipulate kernel memory, access host block devices, or abuse the mem device to read/write arbitrary kernel memory. memtester anyone? It’s not a fun time.
Capability Misuse: The Slow Burn
Containers run with a default set of Linux capabilities. Some apps claim they need specific ones. Let’s talk about dangerous ones:
CAP_SYS_PTRACE: Attach a debugger to any process. Yes, including the host kernel’s processes if you’re clever.CAP_NET_ADMIN: Reconfigure networking, set up tunnels, manipulate routing. Useful for network tools. Dangerous if you’re not careful.CAP_SYS_MODULE: Load/unload kernel modules. Goodbye host kernel integrity.CAP_SYS_BOOT: Reboot or power-off the host. Have fun explaining why production went down at 3 AM.
I’ve seen containers running with capabilities they don’t need, “just in case”. This is how you end up with a compromised web server that can inject kernel modules or hijack all network traffic on your host.
Kernel CVEs: The Luck Factor
Sometimes, none of this matters. A vulnerability in the kernel itself can blow holes straight through the namespace boundaries. You can’t mitigate these with clever container configs. You need kernel patches.
Some infamous ones:
- Dirty Pipe (CVE-2022-0847): Overwrite read-only files via the pipe buffer, allowing privilege escalation from a container.
- runc escape (CVE-2019-5736): A file descriptor leak in
runclet containers modify the host’s runc binary itself. Instant root access. - Leaky Vessels (CVE-2024-21626): Volume mount race condition in buildkit. Newer, still scary.
These aren’t configuration problems. They’re code problems. Your container setup could be flawless, and a kernel CVE still ruins your day.
User Namespace Shenanigans
Containers run as a UID/GID inside their namespace. By default, that maps to the same UID/GID on the host. So root (UID 0) in a container is root on the host in terms of file ownership.
User namespace remapping helps, but only if it’s enabled. Many container runtimes don’t enable it by default, so a compromised container running as root can read/write host files owned by root. Not a kernel escape, but a data escape—often just as bad.
How to Actually Protect Yourself
Okay, enough doom. Here’s what actually works.
1. Drop Capabilities Aggressively
Start with the principle of least privilege. Docker containers come with these default capabilities:
CHOWN, DAC_OVERRIDE, SETFCAP, SETGID, SETUID, NET_RAW, SYS_CHROOTMost apps don’t need all of these. Drop what you don’t use:
services: hardened-app: image: myapp:latest cap_drop: - ALL cap_add: - NET_BIND_SERVICEIf your app only needs to bind to a network port, drop everything and add back only NET_BIND_SERVICE. If it needs nothing? cap_drop: [ALL] and call it a day. Yes, some apps will break. That’s intentional—it means they were asking for more than they needed.
2. Use User Namespaces / Rootless Mode
If you’re running your own host, enable user namespace remapping. It maps container UIDs to unprivileged host UIDs:
# On the host, create a subordinate UID/GID rangeecho "dockremap:231072:65536" >> /etc/subuidecho "dockremap:231072:65536" >> /etc/subgid
# Configure Docker to use userns-remap# In /etc/docker/daemon.json:{ "userns-remap": "dockremap"}Now even if container root escapes, it’s actually UID 231072 on the host—unprivileged. Game-changer.
3. Read-Only Root and no-new-privileges
Make the container’s root filesystem read-only. It forces the app to use tmpfs or volumes for writable state, which is how it should be anyway:
services: hardened-app: image: myapp:latest read_only: true tmpfs: - /tmp - /var/log security_opt: - no-new-privileges:trueread_only: true prevents even root from writing to the filesystem. no-new-privileges prevents setuid/setgid binaries from escalating privileges. Together, they force good hygiene.
4. seccomp and AppArmor/SELinux Profiles
seccomp filters the syscalls a container can make. Docker ships with a default profile that blocks ~50 dangerous syscalls (like reboot, syslog, keyctl). Use it:
services: app: image: myapp:latest security_opt: - seccomp=unconfined # DON'T DO THISActually, don’t use seccomp=unconfined. Use the default. If you need a custom profile, build one:
{ "defaultAction": "SCMP_ACT_ERRNO", "defaultErrnoRet": 1, "archMap": [ { "architecture": "SCMP_ARCH_X86_64", "subArchitectures": ["SCMP_ARCH_X86", "SCMP_ARCH_X32"] } ], "syscalls": [ { "names": ["accept4", "arch_specific_syscall", "bind"], "action": "SCMP_ACT_ALLOW" }, { "names": ["reboot", "syslog"], "action": "SCMP_ACT_ERRNO" } ]}AppArmor/SELinux profiles go deeper. They whitelist what files and resources a process can access. It’s more powerful but also more complex to maintain.
5. Hardened Compose Example
Here’s what a reasonably hardened container looks like:
services: secure-app: image: myapp:latest cap_drop: - ALL cap_add: - NET_BIND_SERVICE read_only: true tmpfs: - /tmp - /var/log - /var/cache security_opt: - no-new-privileges:true environment: - LOG_LEVEL=warn user: "1000:1000" volumes: - config-volume:/etc/myapp:ro networks: - private-net restart: unless-stopped healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 10s retries: 3
volumes: config-volume: driver: local
networks: private-net: driver: bridgeWhat this does:
- Drops all capabilities, adds back only what’s needed
- Read-only root with tmpfs mounts for logs/cache
- no-new-privileges prevents escalation tricks
- Non-root user (UID 1000) runs the process
- Isolated network so lateral movement is harder
- Health checks so you know when something’s wrong
6. Kernel Patching Is Non-Negotiable
Even with all the above, a kernel CVE can still hurt you. Keep your host kernel patched. Period. Not “when you get around to it”—I mean actually doing it on a schedule.
# Check for updatesapt update && apt list --upgradable | grep linux-image
# Install themapt install -y linux-image-generic
# Reboot to activatesudo rebootKernel CVEs are rare but brutal. Dirty Pipe took down thousands of container deployments. Leaky Vessels hit buildkit. Stay current.
7. Go Deeper: gVisor and Kata Containers
For critical workloads, consider stronger isolation:
- gVisor: A userspace kernel that intercepts syscalls. Every syscall from a container goes through gVisor’s Go-based kernel. Slower, but much safer. Great for untrusted code.
- Kata Containers: Lightweight VMs using QEMU or similar. Near-native performance with actual hardware-level isolation.
Both trade speed for safety. gVisor is ~10-20% slower. Kata is ~5-10% slower. Worth it for untrusted code.
# Example with gVisor (assuming it's installed)docker run --runtime=runsc --cap-drop=ALL my-app:latestThe Honest Take
There’s no magic bullet. Container isolation is defense-in-depth:
- Drop unnecessary capabilities
- Use rootless mode / user namespaces
- Make filesystems read-only
- Use seccomp profiles
- Keep the kernel patched
- Monitor what’s running
If someone gets access to your host kernel or the Docker daemon, the game’s over. If they compromise an app inside the container, they should be stuck inside it. Should be. With the right config, they will be.
The reason so many container escapes work is because people skip the basics. They run with --privileged, mount the Docker socket, and wonder why they got pwned. Don’t be that person.
You’re running containers on your home lab or small infrastructure. You can afford to be paranoid. Do it.
Quick Checklist
Before deploying any container:
- No
--privilegedorprivileged: true - No
/var/run/docker.sockmounted (unless absolutely necessary, and you know why) - No broad host filesystem mounts (
/proc,/dev,/sys) -
cap_drop: [ALL]and explicitcap_addfor what’s needed -
read_only: truewith tmpfs for logs/cache -
no-new-privileges: true - Non-root user running the app
- Seccomp enabled (default Docker config is fine)
- Host kernel is current
Get those right, and you’ve eliminated 90% of the low-hanging fruit. The determined hacker with a kernel CVE exploit will still be a problem. But casual lateral movement? Privilege escalation via sloppy config? That’s done.
Stay paranoid. Your 2 AM self will thank you.