Healthcheck vs Restart Policy: The Difference Matters

The Confusion

You’ve got a container running. It crashes sometimes. You add a restart policy: restart_policy: always. Now it restarts automatically. Problem solved.

Then you notice something weird. The container is running, but it’s not working. It responds with 502 errors. You check the logs — it’s stuck in an infinite restart loop, barely staying up long enough for your health checks to pass.

You needed a healthcheck, not just a restart policy. Or maybe both, but they’re doing different things.

Restart Policy: “Is The Container Running?”

A restart policy answers one question: if the container exits, should we start it again?

services:
  app:
    image: myapp:latest
    restart_policy:
      condition: on-failure
      max_retries: 5

This says: if the container exits, restart it. But only retry 5 times.

The key word: exits. The container process actually stops. The container itself terminates.

Common restart policies:

no — don’t restart
always — always restart, no matter what
on-failure — only restart if the exit code is non-zero
unless-stopped — always restart unless explicitly stopped

This is blunt. It’s “if the process dies, bring it back up.” But what if the process is running but completely broken? What if it’s stuck in an infinite loop? What if it’s consuming 100% CPU and hanging?

The restart policy won’t help. The container is still running.

Healthcheck: “Is The Container Healthy?”

A healthcheck answers a different question: is the container actually functioning?

services:
  app:
    image: myapp:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

This healthcheck runs curl http://localhost:8080/health every 30 seconds. If it fails 3 times in a row, the container is marked as “unhealthy.”

But note: marking it unhealthy doesn’t automatically restart the container. You need a restart policy for that.

The states:

starting — first 40 seconds (start_period)
healthy — health checks pass
unhealthy — health checks fail consistently
(no exited state — the container is still running)

Why You Need Both

A real example: your Node.js app has a memory leak. It’s running. It’s accepting connections. But it’s using 5GB of RAM and responding slowly.

Restart policy alone: useless. The container is running. Exit code is 0.
Healthcheck alone: the health check fails. The container gets marked unhealthy. But nothing happens.

You need both:

services:
  app:
    image: myapp:latest

    # If the process dies, restart it
    restart_policy:
      condition: on-failure
      max_retries: 5

    # Monitor if it's actually healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3

    # When healthcheck fails, kill and restart
    # (Docker doesn't do this automatically — you need orchestration)

Wait, there’s a problem. Docker marks the container unhealthy, but it doesn’t automatically restart it. You need container orchestration (Docker Swarm, Kubernetes, etc.) to actually restart unhealthy containers.

Without orchestration, a failing healthcheck just sets a flag.

Docker Compose Only (No Orchestration)

If you’re just using Docker Compose locally or on a single server, you can only rely on restart policies, not healthchecks.

services:
  app:
    image: myapp:latest
    restart_policy:
      condition: on-failure
      max_retries: 5

The healthcheck tells you (via docker ps) that something’s wrong, but the container won’t restart on its own.

To get automatic restarts based on health, you need:

Docker Swarm (with --health-cmd in the service definition)
Kubernetes (with liveness probes that trigger pod replacement)
Watchtower (can monitor healthchecks and restart)
Custom scripts that poll docker inspect and restart unhealthy containers

Docker Swarm + Healthcheck

If you’re using Swarm (not Compose), you can get automatic restarts:

docker service create \
  --name myapp \
  --health-cmd="curl -f http://localhost:8080/health || exit 1" \
  --health-interval=30s \
  --health-timeout=10s \
  --health-retries=3 \
  myapp:latest

Swarm monitors the healthcheck and replaces unhealthy tasks.

Kubernetes Version

In Kubernetes, this is called a “liveness probe”:

apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  containers:
  - name: app
    image: myapp:latest
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 40
      periodSeconds: 30
      timeoutSeconds: 10
      failureThreshold: 3

If the probe fails 3 times, Kubernetes kills the pod and starts a new one.

Writing A Good Healthcheck

Your healthcheck should be realistic. It should test the actual thing that matters.

Bad healthcheck:

healthcheck:
  test: ["CMD", "test", "-f", "/tmp/app.pid"]

This just checks if a PID file exists. The app could be hung, and this would still pass.

Good healthcheck:

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8080/health"]

Makes an actual HTTP request. The app has to respond.

Better healthcheck:

#!/bin/bash
# Check if the process is running
if ! pgrep -f "python app.py" > /dev/null; then
  exit 1
fi

# Check if HTTP endpoint responds
if ! curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health | grep -q "200"; then
  exit 1
fi

# Check if database is responsive
if ! python -c "import psycopg2; psycopg2.connect('dbname=mydb')" 2>/dev/null; then
  exit 1
fi

exit 0

Then in Compose:

healthcheck:
  test: /app/health.sh
  interval: 30s
  timeout: 10s
  retries: 3

Debugging Healthcheck Issues

Check the status:

docker ps

Look for the STATUS column. It shows like: Up 5 minutes (healthy) or Up 2 minutes (unhealthy).

Or inspect:

docker inspect <container> --format='{{json .State.Health}}'

Shows something like:

{
  "Status": "unhealthy",
  "FailingStreak": 3,
  "Log": [
    {
      "Start": "2026-01-18T10:30:00Z",
      "End": "2026-01-18T10:30:05Z",
      "ExitCode": 1,
      "Output": "curl: (7) Failed to connect"
    }
  ]
}

The Output tells you why it failed.

The Bottom Line

Restart policy: answers “did the process exit?” Handles crashes.
Healthcheck: answers “is the app actually working?” Detects stuck processes.
For single-server setups: restart policy is enough, but add healthchecks so you can monitor them manually.
For Swarm/Kubernetes: healthchecks trigger automatic restarts.
For everything: a good healthcheck is the hardest part. Make it test what actually matters.

Set them both up, and your containers will recover from most failures without human intervention.