The Confusion
You’ve got a container running. It crashes sometimes. You add a restart policy: restart_policy: always. Now it restarts automatically. Problem solved.
Then you notice something weird. The container is running, but it’s not working. It responds with 502 errors. You check the logs — it’s stuck in an infinite restart loop, barely staying up long enough for your health checks to pass.
You needed a healthcheck, not just a restart policy. Or maybe both, but they’re doing different things.
Restart Policy: “Is The Container Running?”
A restart policy answers one question: if the container exits, should we start it again?
services: app: image: myapp:latest restart_policy: condition: on-failure max_retries: 5This says: if the container exits, restart it. But only retry 5 times.
The key word: exits. The container process actually stops. The container itself terminates.
Common restart policies:
no— don’t restartalways— always restart, no matter whaton-failure— only restart if the exit code is non-zerounless-stopped— always restart unless explicitly stopped
This is blunt. It’s “if the process dies, bring it back up.” But what if the process is running but completely broken? What if it’s stuck in an infinite loop? What if it’s consuming 100% CPU and hanging?
The restart policy won’t help. The container is still running.
Healthcheck: “Is The Container Healthy?”
A healthcheck answers a different question: is the container actually functioning?
services: app: image: myapp:latest healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 10s retries: 3 start_period: 40sThis healthcheck runs curl http://localhost:8080/health every 30 seconds. If it fails 3 times in a row, the container is marked as “unhealthy.”
But note: marking it unhealthy doesn’t automatically restart the container. You need a restart policy for that.
The states:
starting— first 40 seconds (start_period)healthy— health checks passunhealthy— health checks fail consistently- (no
exitedstate — the container is still running)
Why You Need Both
A real example: your Node.js app has a memory leak. It’s running. It’s accepting connections. But it’s using 5GB of RAM and responding slowly.
- Restart policy alone: useless. The container is running. Exit code is 0.
- Healthcheck alone: the health check fails. The container gets marked unhealthy. But nothing happens.
You need both:
services: app: image: myapp:latest
# If the process dies, restart it restart_policy: condition: on-failure max_retries: 5
# Monitor if it's actually healthy healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 10s retries: 3
# When healthcheck fails, kill and restart # (Docker doesn't do this automatically — you need orchestration)Wait, there’s a problem. Docker marks the container unhealthy, but it doesn’t automatically restart it. You need container orchestration (Docker Swarm, Kubernetes, etc.) to actually restart unhealthy containers.
Without orchestration, a failing healthcheck just sets a flag.
Docker Compose Only (No Orchestration)
If you’re just using Docker Compose locally or on a single server, you can only rely on restart policies, not healthchecks.
services: app: image: myapp:latest restart_policy: condition: on-failure max_retries: 5The healthcheck tells you (via docker ps) that something’s wrong, but the container won’t restart on its own.
To get automatic restarts based on health, you need:
- Docker Swarm (with
--health-cmdin the service definition) - Kubernetes (with liveness probes that trigger pod replacement)
- Watchtower (can monitor healthchecks and restart)
- Custom scripts that poll
docker inspectand restart unhealthy containers
Docker Swarm + Healthcheck
If you’re using Swarm (not Compose), you can get automatic restarts:
docker service create \ --name myapp \ --health-cmd="curl -f http://localhost:8080/health || exit 1" \ --health-interval=30s \ --health-timeout=10s \ --health-retries=3 \ myapp:latestSwarm monitors the healthcheck and replaces unhealthy tasks.
Kubernetes Version
In Kubernetes, this is called a “liveness probe”:
apiVersion: v1kind: Podmetadata: name: myappspec: containers: - name: app image: myapp:latest livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 40 periodSeconds: 30 timeoutSeconds: 10 failureThreshold: 3If the probe fails 3 times, Kubernetes kills the pod and starts a new one.
Writing A Good Healthcheck
Your healthcheck should be realistic. It should test the actual thing that matters.
Bad healthcheck:
healthcheck: test: ["CMD", "test", "-f", "/tmp/app.pid"]This just checks if a PID file exists. The app could be hung, and this would still pass.
Good healthcheck:
healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"]Makes an actual HTTP request. The app has to respond.
Better healthcheck:
#!/bin/bash# Check if the process is runningif ! pgrep -f "python app.py" > /dev/null; then exit 1fi
# Check if HTTP endpoint respondsif ! curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health | grep -q "200"; then exit 1fi
# Check if database is responsiveif ! python -c "import psycopg2; psycopg2.connect('dbname=mydb')" 2>/dev/null; then exit 1fi
exit 0Then in Compose:
healthcheck: test: /app/health.sh interval: 30s timeout: 10s retries: 3Debugging Healthcheck Issues
Check the status:
docker psLook for the STATUS column. It shows like: Up 5 minutes (healthy) or Up 2 minutes (unhealthy).
Or inspect:
docker inspect <container> --format='{{json .State.Health}}'Shows something like:
{ "Status": "unhealthy", "FailingStreak": 3, "Log": [ { "Start": "2026-01-18T10:30:00Z", "End": "2026-01-18T10:30:05Z", "ExitCode": 1, "Output": "curl: (7) Failed to connect" } ]}The Output tells you why it failed.
The Bottom Line
- Restart policy: answers “did the process exit?” Handles crashes.
- Healthcheck: answers “is the app actually working?” Detects stuck processes.
- For single-server setups: restart policy is enough, but add healthchecks so you can monitor them manually.
- For Swarm/Kubernetes: healthchecks trigger automatic restarts.
- For everything: a good healthcheck is the hardest part. Make it test what actually matters.
Set them both up, and your containers will recover from most failures without human intervention.