Your App Has Never Actually Been Tested
I don’t mean unit tested or integration tested. I mean tested under the conditions it will eventually face: a network partition, a slow database, a memory-starved container, a dependency that returns 503 for 30 seconds.
Your tests run in clean, fast, cooperative environments. Production is chaotic, slow, and occasionally on fire. The gap between those two environments is where bugs live. Specifically, it’s where the bugs that take down your system at 11pm on a Friday live.
Chaos engineering is the discipline of deliberately introducing failures in controlled ways to discover how your system actually behaves when things go wrong — before your users find out the embarrassing answer.
The Netflix Origin Story
Netflix popularized chaos engineering with Chaos Monkey, a tool that randomly terminated EC2 instances in production. The reasoning: if your system could go down randomly at any time, you had two choices — build it to handle that, or get paged constantly. Netflix chose to build resilience.
The Simian Army followed: Chaos Gorilla (took down entire availability zones), Latency Monkey (introduced artificial delays), Conformity Monkey (checked instances against best practices). The entire Netflix approach was “find the weaknesses before the weaknesses find you.”
Most of us aren’t running Netflix-scale infrastructure. But the principle is universal: the failure you practice is the failure you survive. The failure you’ve never seen is the one that pages you at 2am.
Blast Radius: Don’t Blow Up Everything at Once
The most important concept in chaos engineering isn’t the chaos itself — it’s the blast radius.
Blast radius is the scope of impact when your experiment goes wrong. And chaos experiments do go wrong. That’s the point. You need to contain how wrong they can go.
Reducing blast radius:
- Run experiments in staging first, always
- Start with individual containers, not entire services
- Run experiments during business hours when your team is awake
- Have a clearly defined “abort” procedure before starting
- Monitor during the experiment, not after
- Start small: kill one container, not the whole pod
The goal is to discover failures, not create new production incidents. “We found out our retry logic is broken” is a win. “We took down prod because we got overconfident about blast radius” is not.
Pumba: Chaos for Docker Containers
Pumba is a chaos engineering tool specifically built for Docker. It can kill containers, pause them, delay their network traffic, corrupt packets, and limit bandwidth. All configurable, all targeted.
Installing Pumba
# Pull the Docker image (easiest approach)docker pull gaiaadm/pumba
# Or binary installwget https://github.com/alexei-led/pumba/releases/latest/download/pumba_linux_amd64chmod +x pumba_linux_amd64sudo mv pumba_linux_amd64 /usr/local/bin/pumbaBasic Container Kill
# Kill a container randomly (let Docker restart it)pumba kill myapp
# Kill with specific signalpumba kill --signal SIGTERM myapp
# Kill random containers matching a pattern every 30 secondspumba --random kill re2:myapp.*
# Kill and remove the containerpumba rm myappTest how your application handles container restarts. Does it reconnect to databases? Does it drain in-flight requests? Does anything upstream notice?
Pausing Containers
# Pause a container for 10 seconds (simulates a hung process)pumba pause --duration 10s myappThis is nastier than killing — a paused process doesn’t fail, it just stops responding. HTTP connections time out instead of getting refused. This is how you find out your timeouts are set to 60 seconds when they should be 5.
Network Chaos
This is where Pumba gets interesting. It uses tc (traffic control) under the hood to introduce real network problems:
# Add 200ms latency to all traffic from a containerpumba netem --duration 1m delay --time 200 myapp
# Add 200ms latency with 50ms jitter (more realistic)pumba netem --duration 1m delay --time 200 --jitter 50 myapp
# Add packet loss (10% of packets dropped)pumba netem --duration 1m loss --percent 10 myapp
# Corrupt packets (2% corruption)pumba netem --duration 1m corrupt --percent 2 myapp
# Rate limit bandwidth to 100kbpspumba netem --duration 1m rate --rate 100kbit myapp
# Apply chaos to specific targets onlypumba netem --tc-image "gaiaadm/tc-netem" \ --duration 1m \ --target 10.0.0.5/32 \ delay --time 300 myappRunning Pumba Against a Compose Stack
# Target all containers with "api" in the namepumba netem --duration 2m delay --time 100 re2:.*api.*
# Kill the database randomly, see if the API handles reconnectionpumba --interval 30s kill myproject_db_1What to watch during network chaos:
- Error rates in your application logs
- Client-facing response times
- Whether retries are working (or causing thundering herd)
- Whether circuit breakers trip (if you have them)
- Whether timeouts are appropriately configured
Toxiproxy: Network Failure Simulation with Surgical Precision
Toxiproxy (from Shopify) takes a different approach. Instead of injecting chaos into existing containers, it sits as a proxy between your services and lets you configure failure conditions via an HTTP API.
This is excellent for testing specific service-to-service communication. “What happens when the database is slow?” becomes a repeatable, scriptable test.
Running Toxiproxy
# docker-compose.yml (add to your existing stack)services: toxiproxy: image: ghcr.io/shopify/toxiproxy ports: - "8474:8474" # API port - "5432:5432" # Proxied PostgreSQL - "6379:6379" # Proxied Redis restart: unless-stoppedConfiguring Proxies via CLI
# Install toxiproxy-cliwget https://github.com/Shopify/toxiproxy/releases/latest/download/toxiproxy-cli-linux-amd64chmod +x toxiproxy-cli-linux-amd64sudo mv toxiproxy-cli-linux-amd64 /usr/local/bin/toxiproxy-cli
# Create a proxy: myapp connects to toxiproxy:5432, which forwards to real db:5432toxiproxy-cli create --listen 0.0.0.0:5432 --upstream db:5432 postgres_proxy
# Create Redis proxytoxiproxy-cli create --listen 0.0.0.0:6379 --upstream redis:6379 redis_proxyIn your application’s config, point database connections at toxiproxy:5432 instead of db:5432. Everything passes through transparently until you add toxics.
Adding Failure Conditions (Toxics)
# Add 500ms latency to postgres connectionstoxiproxy-cli toxic add --type latency --attribute latency=500 postgres_proxy
# Add 100ms latency with 50ms jittertoxiproxy-cli toxic add \ --type latency \ --attribute latency=100 \ --attribute jitter=50 \ postgres_proxy
# Limit bandwidth to 56kbps (dial-up nostalgia)toxiproxy-cli toxic add \ --type bandwidth \ --attribute rate=56 \ postgres_proxy
# Randomly close connections (10% of the time)toxiproxy-cli toxic add \ --type reset_peer \ --toxicity 0.1 \ postgres_proxy
# Simulate timeout: stop sending data after 1 secondtoxiproxy-cli toxic add \ --type timeout \ --attribute timeout=1000 \ postgres_proxy
# Make downstream connections time out (nothing gets through)toxiproxy-cli toxic add \ --type slow_close \ --attribute delay=5000 \ postgres_proxyRemove Toxics When Done
# List active toxicstoxiproxy-cli inspect postgres_proxy
# Remove a specific toxictoxiproxy-cli toxic remove --toxicName latency_downstream postgres_proxy
# Reset all toxics on a proxytoxiproxy-cli toxic reset postgres_proxyToxiproxy via HTTP API
For automated testing, the API is cleaner than the CLI:
# Create proxycurl -X POST http://toxiproxy:8474/proxies \ -H "Content-Type: application/json" \ -d "{"name":"db","listen":"0.0.0.0:5432","upstream":"db:5432","enabled":true}"
# Add latency toxiccurl -X POST http://toxiproxy:8474/proxies/db/toxics \ -H "Content-Type: application/json" \ -d "{"name":"db_latency","type":"latency","attributes":{"latency":500}}"
# Remove itcurl -X DELETE http://toxiproxy:8474/proxies/db/toxics/db_latency
# Disable proxy entirely (complete connection failure)curl -X POST http://toxiproxy:8474/proxies/db \ -H "Content-Type: application/json" \ -d "{"enabled":false}"This lets you write integration tests that inject failures:
import requestsimport pytest
TOXIPROXY_API = "http://localhost:8474"
def add_db_latency(ms=500): requests.post(f"{TOXIPROXY_API}/proxies/db/toxics", json={ "name": "db_latency", "type": "latency", "attributes": {"latency": ms} })
def remove_db_latency(): requests.delete(f"{TOXIPROXY_API}/proxies/db/toxics/db_latency")
def test_api_handles_slow_database(): add_db_latency(2000) try: response = requests.get("http://app:8080/api/users", timeout=5) assert response.status_code == 200 assert response.elapsed.total_seconds() < 4 finally: remove_db_latency()Your app discovering it never properly handled a 2-second database response is better discovered in a test than in production.
Writing a Chaos Runbook
Before running any chaos experiment, write a runbook. This forces you to think through the experiment clearly and gives you an abort path.
## Chaos Experiment: Database Latency Spike
**Hypothesis**: When database response time exceeds 1 second, the APIshould return cached responses for reads and queue writes,with <5% error rate and median latency under 500ms.
**Blast Radius**: API service in staging environment only.
**Duration**: 10 minutes.
**Pre-conditions**:- [ ] Staging traffic is under 10 req/s baseline- [ ] Monitoring dashboard is open- [ ] Team channel is notified
**Experiment Steps**:1. Establish baseline: capture error rate, p50/p95 latency2. Add Toxiproxy toxic: db latency 1000ms3. Run load test for 5 minutes4. Record observed behavior5. Remove toxic6. Verify system recovers
**Abort Criteria**:- Error rate exceeds 50% for >30 seconds- Experiment escapes staging
**Expected Results**:- Read endpoints serve cached data within 100ms- Write endpoints queue and eventually drain- Error rate < 5%
**Actual Results**: [fill in after]
**Follow-up Actions**: [fill in after]The runbook is living documentation. When the experiment reveals something unexpected (it will), the follow-up actions column is how chaos engineering actually improves your system.
Starting Small: The Chaos Engineering Maturity Ladder
Don’t start with “randomly kill production databases.” Work up to it.
Level 1: Staging, manual, single component Kill one container in staging. Watch what happens. Write it down.
Level 2: Staging, manual, network conditions Add latency between services in staging. Check your dashboards.
Level 3: Staging, scheduled, multiple scenarios Run a suite of chaos experiments as part of your staging pipeline.
Level 4: Production, controlled, with feature flags Run experiments in production against a percentage of traffic, with kill switches.
Level 5: GameDays Scheduled exercises where the team practices responding to failure. Netflix runs these regularly. The goal is making failure response as practiced as deployment.
Most home labs and small teams live at Level 1-2, and that’s completely fine. The first time you run Pumba against your Docker Compose stack and discover your Postgres container can take 45 seconds to restart and nothing handles it gracefully, you’ve gotten real value from chaos engineering.
The chaos was always there. Now you can see it.