Chaos Engineering: Break Things on Purpose Before They Break Themselves

Your App Has Never Actually Been Tested

I don’t mean unit tested or integration tested. I mean tested under the conditions it will eventually face: a network partition, a slow database, a memory-starved container, a dependency that returns 503 for 30 seconds.

Your tests run in clean, fast, cooperative environments. Production is chaotic, slow, and occasionally on fire. The gap between those two environments is where bugs live. Specifically, it’s where the bugs that take down your system at 11pm on a Friday live.

Chaos engineering is the discipline of deliberately introducing failures in controlled ways to discover how your system actually behaves when things go wrong — before your users find out the embarrassing answer.

The Netflix Origin Story

Netflix popularized chaos engineering with Chaos Monkey, a tool that randomly terminated EC2 instances in production. The reasoning: if your system could go down randomly at any time, you had two choices — build it to handle that, or get paged constantly. Netflix chose to build resilience.

The Simian Army followed: Chaos Gorilla (took down entire availability zones), Latency Monkey (introduced artificial delays), Conformity Monkey (checked instances against best practices). The entire Netflix approach was “find the weaknesses before the weaknesses find you.”

Most of us aren’t running Netflix-scale infrastructure. But the principle is universal: the failure you practice is the failure you survive. The failure you’ve never seen is the one that pages you at 2am.

Blast Radius: Don’t Blow Up Everything at Once

The most important concept in chaos engineering isn’t the chaos itself — it’s the blast radius.

Blast radius is the scope of impact when your experiment goes wrong. And chaos experiments do go wrong. That’s the point. You need to contain how wrong they can go.

Reducing blast radius:

Run experiments in staging first, always
Start with individual containers, not entire services
Run experiments during business hours when your team is awake
Have a clearly defined “abort” procedure before starting
Monitor during the experiment, not after
Start small: kill one container, not the whole pod

The goal is to discover failures, not create new production incidents. “We found out our retry logic is broken” is a win. “We took down prod because we got overconfident about blast radius” is not.

Pumba: Chaos for Docker Containers

Pumba is a chaos engineering tool specifically built for Docker. It can kill containers, pause them, delay their network traffic, corrupt packets, and limit bandwidth. All configurable, all targeted.

Installing Pumba

# Pull the Docker image (easiest approach)
docker pull gaiaadm/pumba

# Or binary install
wget https://github.com/alexei-led/pumba/releases/latest/download/pumba_linux_amd64
chmod +x pumba_linux_amd64
sudo mv pumba_linux_amd64 /usr/local/bin/pumba

Basic Container Kill

# Kill a container randomly (let Docker restart it)
pumba kill myapp

# Kill with specific signal
pumba kill --signal SIGTERM myapp

# Kill random containers matching a pattern every 30 seconds
pumba --random kill re2:myapp.*

# Kill and remove the container
pumba rm myapp

Test how your application handles container restarts. Does it reconnect to databases? Does it drain in-flight requests? Does anything upstream notice?

Pausing Containers

# Pause a container for 10 seconds (simulates a hung process)
pumba pause --duration 10s myapp

This is nastier than killing — a paused process doesn’t fail, it just stops responding. HTTP connections time out instead of getting refused. This is how you find out your timeouts are set to 60 seconds when they should be 5.

Network Chaos

This is where Pumba gets interesting. It uses tc (traffic control) under the hood to introduce real network problems:

# Add 200ms latency to all traffic from a container
pumba netem --duration 1m delay --time 200 myapp

# Add 200ms latency with 50ms jitter (more realistic)
pumba netem --duration 1m delay --time 200 --jitter 50 myapp

# Add packet loss (10% of packets dropped)
pumba netem --duration 1m loss --percent 10 myapp

# Corrupt packets (2% corruption)
pumba netem --duration 1m corrupt --percent 2 myapp

# Rate limit bandwidth to 100kbps
pumba netem --duration 1m rate --rate 100kbit myapp

# Apply chaos to specific targets only
pumba netem --tc-image "gaiaadm/tc-netem" \
            --duration 1m \
            --target 10.0.0.5/32 \
            delay --time 300 myapp

Running Pumba Against a Compose Stack

# Target all containers with "api" in the name
pumba netem --duration 2m delay --time 100 re2:.*api.*

# Kill the database randomly, see if the API handles reconnection
pumba --interval 30s kill myproject_db_1

What to watch during network chaos:

Error rates in your application logs
Client-facing response times
Whether retries are working (or causing thundering herd)
Whether circuit breakers trip (if you have them)
Whether timeouts are appropriately configured

Toxiproxy: Network Failure Simulation with Surgical Precision

Toxiproxy (from Shopify) takes a different approach. Instead of injecting chaos into existing containers, it sits as a proxy between your services and lets you configure failure conditions via an HTTP API.

This is excellent for testing specific service-to-service communication. “What happens when the database is slow?” becomes a repeatable, scriptable test.

Running Toxiproxy

# docker-compose.yml (add to your existing stack)
services:
  toxiproxy:
    image: ghcr.io/shopify/toxiproxy
    ports:
      - "8474:8474"   # API port
      - "5432:5432"   # Proxied PostgreSQL
      - "6379:6379"   # Proxied Redis
    restart: unless-stopped

Configuring Proxies via CLI

# Install toxiproxy-cli
wget https://github.com/Shopify/toxiproxy/releases/latest/download/toxiproxy-cli-linux-amd64
chmod +x toxiproxy-cli-linux-amd64
sudo mv toxiproxy-cli-linux-amd64 /usr/local/bin/toxiproxy-cli

# Create a proxy: myapp connects to toxiproxy:5432, which forwards to real db:5432
toxiproxy-cli create --listen 0.0.0.0:5432 --upstream db:5432 postgres_proxy

# Create Redis proxy
toxiproxy-cli create --listen 0.0.0.0:6379 --upstream redis:6379 redis_proxy

In your application’s config, point database connections at toxiproxy:5432 instead of db:5432. Everything passes through transparently until you add toxics.

Adding Failure Conditions (Toxics)

# Add 500ms latency to postgres connections
toxiproxy-cli toxic add --type latency --attribute latency=500 postgres_proxy

# Add 100ms latency with 50ms jitter
toxiproxy-cli toxic add \
  --type latency \
  --attribute latency=100 \
  --attribute jitter=50 \
  postgres_proxy

# Limit bandwidth to 56kbps (dial-up nostalgia)
toxiproxy-cli toxic add \
  --type bandwidth \
  --attribute rate=56 \
  postgres_proxy

# Randomly close connections (10% of the time)
toxiproxy-cli toxic add \
  --type reset_peer \
  --toxicity 0.1 \
  postgres_proxy

# Simulate timeout: stop sending data after 1 second
toxiproxy-cli toxic add \
  --type timeout \
  --attribute timeout=1000 \
  postgres_proxy

# Make downstream connections time out (nothing gets through)
toxiproxy-cli toxic add \
  --type slow_close \
  --attribute delay=5000 \
  postgres_proxy

Remove Toxics When Done

# List active toxics
toxiproxy-cli inspect postgres_proxy

# Remove a specific toxic
toxiproxy-cli toxic remove --toxicName latency_downstream postgres_proxy

# Reset all toxics on a proxy
toxiproxy-cli toxic reset postgres_proxy

Toxiproxy via HTTP API

For automated testing, the API is cleaner than the CLI:

# Create proxy
curl -X POST http://toxiproxy:8474/proxies \
  -H "Content-Type: application/json" \
  -d "{"name":"db","listen":"0.0.0.0:5432","upstream":"db:5432","enabled":true}"

# Add latency toxic
curl -X POST http://toxiproxy:8474/proxies/db/toxics \
  -H "Content-Type: application/json" \
  -d "{"name":"db_latency","type":"latency","attributes":{"latency":500}}"

# Remove it
curl -X DELETE http://toxiproxy:8474/proxies/db/toxics/db_latency

# Disable proxy entirely (complete connection failure)
curl -X POST http://toxiproxy:8474/proxies/db \
  -H "Content-Type: application/json" \
  -d "{"enabled":false}"

This lets you write integration tests that inject failures:

import requests
import pytest

TOXIPROXY_API = "http://localhost:8474"

def add_db_latency(ms=500):
    requests.post(f"{TOXIPROXY_API}/proxies/db/toxics", json={
        "name": "db_latency",
        "type": "latency",
        "attributes": {"latency": ms}
    })

def remove_db_latency():
    requests.delete(f"{TOXIPROXY_API}/proxies/db/toxics/db_latency")

def test_api_handles_slow_database():
    add_db_latency(2000)
    try:
        response = requests.get("http://app:8080/api/users", timeout=5)
        assert response.status_code == 200
        assert response.elapsed.total_seconds() < 4
    finally:
        remove_db_latency()

Your app discovering it never properly handled a 2-second database response is better discovered in a test than in production.

Writing a Chaos Runbook

Before running any chaos experiment, write a runbook. This forces you to think through the experiment clearly and gives you an abort path.

## Chaos Experiment: Database Latency Spike

**Hypothesis**: When database response time exceeds 1 second, the API
should return cached responses for reads and queue writes,
with <5% error rate and median latency under 500ms.

**Blast Radius**: API service in staging environment only.

**Duration**: 10 minutes.

**Pre-conditions**:
- [ ] Staging traffic is under 10 req/s baseline
- [ ] Monitoring dashboard is open
- [ ] Team channel is notified

**Experiment Steps**:
1. Establish baseline: capture error rate, p50/p95 latency
2. Add Toxiproxy toxic: db latency 1000ms
3. Run load test for 5 minutes
4. Record observed behavior
5. Remove toxic
6. Verify system recovers

**Abort Criteria**:
- Error rate exceeds 50% for >30 seconds
- Experiment escapes staging

**Expected Results**:
- Read endpoints serve cached data within 100ms
- Write endpoints queue and eventually drain
- Error rate < 5%

**Actual Results**: [fill in after]

**Follow-up Actions**: [fill in after]

The runbook is living documentation. When the experiment reveals something unexpected (it will), the follow-up actions column is how chaos engineering actually improves your system.

Starting Small: The Chaos Engineering Maturity Ladder

Don’t start with “randomly kill production databases.” Work up to it.

Level 1: Staging, manual, single component Kill one container in staging. Watch what happens. Write it down.

Level 2: Staging, manual, network conditions Add latency between services in staging. Check your dashboards.

Level 3: Staging, scheduled, multiple scenarios Run a suite of chaos experiments as part of your staging pipeline.

Level 4: Production, controlled, with feature flags Run experiments in production against a percentage of traffic, with kill switches.

Level 5: GameDays Scheduled exercises where the team practices responding to failure. Netflix runs these regularly. The goal is making failure response as practiced as deployment.

Most home labs and small teams live at Level 1-2, and that’s completely fine. The first time you run Pumba against your Docker Compose stack and discover your Postgres container can take 45 seconds to restart and nothing handles it gracefully, you’ve gotten real value from chaos engineering.

The chaos was always there. Now you can see it.