Skip to content
SumGuy's Ramblings
Go back

Docker Health Checks: Because "It's Running" Doesn't Mean "It's Working"

The Lies Docker Tells You

You check your containers. docker ps says everything is Up 3 hours. You breathe a sigh of relief and go grab a coffee.

Meanwhile, your PostgreSQL container has been refusing connections for the past 45 minutes. Your API container is returning 500 errors to everyone who tries to use it. Your Redis instance is alive in the same way a houseplant you forgot about is alive — technically present, contributing nothing.

The problem is simple: Docker’s default idea of “running” means “the main process hasn’t exited.” That’s it. That’s the whole check. Your app could be stuck in an infinite loop, deadlocked, out of memory but not quite enough to crash, or just sitting there contemplating the void. As long as PID 1 hasn’t died, Docker gives it the green light.

This is like asking “is the patient alive?” and only checking if they have a pulse. Technically useful, but you might want to ask a few more questions before declaring them fit for duty.

That’s where health checks come in.

What Docker Health Checks Actually Do

A Docker health check is a command that runs inside your container on a schedule. If the command exits with code 0, the container is “healthy.” If it exits with code 1, it’s “unhealthy.” If it doesn’t respond in time, that counts as a failure too.

Docker tracks three states:

That’s it. Simple concept, massive impact. Once you have real health information, you can do things like:

HEALTHCHECK in Dockerfiles

The first place you can define a health check is directly in your Dockerfile using the HEALTHCHECK instruction. This bakes the check into the image itself, which means every container spawned from that image gets the check automatically.

Basic Syntax

HEALTHCHECK [OPTIONS] CMD command

The options you can set:

OptionDefaultWhat It Does
--interval30sTime between health checks
--timeout30sMax time to wait for a check to complete
--start-period0sGrace period for container startup
--start-interval5sInterval during the start period (Docker 25+)
--retries3Consecutive failures before “unhealthy”

A Simple Web App Example

FROM node:20-alpine

WORKDIR /app
COPY . .
RUN npm ci --production

HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

CMD ["node", "server.js"]

This checks every 30 seconds whether your Node app responds to HTTP requests on the /health endpoint. It gives the app 10 seconds to boot up before it starts counting failures, and it needs 3 consecutive failures before declaring the container unhealthy.

Why wget Over curl?

You’ll notice I used wget there instead of curl. Here’s the thing — Alpine-based images (which a lot of Docker images use) include wget by default but not curl. If you’re using a Debian-based image, curl is usually available. Use whatever’s already in your image to avoid installing extra packages just for health checks.

# Alpine-based (wget available)
HEALTHCHECK CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

# Debian-based (curl available)
HEALTHCHECK CMD curl --fail http://localhost:3000/health || exit 1

The --fail flag on curl is important. Without it, curl exits 0 even on HTTP 500 responses, which completely defeats the purpose. The --spider flag on wget makes it check without downloading the body. Both keep things lightweight.

Disabling a Health Check

If a parent image has a HEALTHCHECK and you want to remove it:

HEALTHCHECK NONE

You probably shouldn’t do this unless you have a very good reason, but the option exists.

Health Checks in Docker Compose

The more common (and flexible) approach is defining health checks in your docker-compose.yml. This is where most people should start, especially if you’re not building custom images.

Basic Compose Health Check

services:
  api:
    image: my-api:latest
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s

The test field supports two formats:

# Shell form (runs in /bin/sh -c)
test: curl --fail http://localhost:8080/health || exit 1

# Exec form (no shell involved, slightly more reliable)
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]

# CMD-SHELL form (like shell form but explicit)
test: ["CMD-SHELL", "curl --fail http://localhost:8080/health || exit 1"]

The exec form (CMD) is generally preferred because it doesn’t spawn a shell process and is more predictable. Use CMD-SHELL when you need shell features like pipes or ||.

Tuning Interval, Timeout, and Retries

Getting these values right matters more than people think. Bad tuning leads to either containers marked unhealthy during temporary blips, or problems going undetected for way too long.

Interval

This is how often the health check runs. The default of 30 seconds is fine for most things, but consider:

Don’t set this to 1 second. You’re not monitoring the Space Shuttle. You’re running Nextcloud in your closet.

Timeout

How long to wait for the health check command to complete before calling it a failure. The default 30 seconds is usually too generous. If your health endpoint takes 30 seconds to respond, your app has bigger problems.

Retries

How many consecutive failures before the container is marked unhealthy. Default is 3, which is sensible. One failure could be a hiccup. Two might be a coincidence. Three is a pattern.

Start Period

The grace period after a container starts during which health check failures don’t count toward the retry limit. This is critical for apps with slow startup times.

healthcheck:
  test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 60s  # Give the app a full minute to boot

If your Java app takes 45 seconds to start (and let’s be honest, it probably does), a start_period of 60 seconds prevents it from being marked unhealthy before it even finishes loading its 47 Spring Boot dependencies.

Start Interval (Docker 25+)

This is a newer addition. During the start period, you might want to check more frequently so you know the moment the app becomes ready, rather than waiting for the regular interval.

healthcheck:
  test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 60s
  start_interval: 5s  # Check every 5s during startup, then switch to 30s

Neat feature. Use it.

Common Health Check Patterns

Here’s where the rubber meets the road. Let’s look at health check configurations for the services you’re actually running.

PostgreSQL

services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: myapp
      POSTGRES_USER: myuser
      POSTGRES_PASSWORD: supersecret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U myuser -d myapp"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 30s

pg_isready is a utility that ships with PostgreSQL specifically for this purpose. It checks whether the server is accepting connections. No need to install anything extra, no need to run actual queries. It’s fast, lightweight, and exactly what you want.

If you need a deeper check that verifies the database is actually functional (not just accepting connections):

    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U myuser -d myapp && psql -U myuser -d myapp -c 'SELECT 1'"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

The SELECT 1 confirms you can actually execute queries. Overkill for most setups, but useful if you’ve ever had Postgres accept connections while silently being in recovery mode.

Redis

services:
  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s

Redis ships with redis-cli, and ping returns PONG if the server is healthy. Redis starts fast, so a 10-second start period is plenty.

If you’re using Redis with authentication:

    healthcheck:
      test: ["CMD-SHELL", "redis-cli -a $${REDIS_PASSWORD} ping | grep PONG"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s

Note the $$ to escape the dollar sign in Compose — this ensures the variable is expanded inside the container, not by Compose.

MySQL / MariaDB

services:
  mysql:
    image: mysql:8
    environment:
      MYSQL_ROOT_PASSWORD: supersecret
      MYSQL_DATABASE: myapp
    healthcheck:
      test: ["CMD", "mysqladmin", "ping", "-h", "localhost", "-u", "root", "-p$$MYSQL_ROOT_PASSWORD"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

MySQL is notoriously slow to start, so that generous start_period is doing important work. The mysqladmin ping command checks if the server is alive without running queries.

Nginx

services:
  nginx:
    image: nginx:alpine
    healthcheck:
      test: ["CMD-SHELL", "curl --fail http://localhost:80/ || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 5s

Wait, Nginx on Alpine doesn’t have curl by default. Let’s fix that:

    healthcheck:
      test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:80/ || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 5s

Or if you want to avoid even that overhead, Nginx has a built-in approach. Add a simple health location to your config:

location /health {
    access_log off;
    return 200 "healthy\n";
    add_header Content-Type text/plain;
}

Then check that endpoint specifically. The access_log off prevents your health checks from flooding your logs — because nobody wants 2,880 “GET /health 200” entries per day cluttering things up.

Node.js / Express

For the Dockerfile approach:

HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD node -e "require('http').get('http://localhost:3000/health', (r) => { process.exit(r.statusCode === 200 ? 0 : 1) })" || exit 1

For Compose:

services:
  api:
    build: .
    healthcheck:
      test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 15s

And in your Express app, create the health endpoint:

app.get('/health', (req, res) => {
  // Basic check
  res.status(200).json({ status: 'ok' });
});

For a more thorough check that verifies your app can actually do its job:

app.get('/health', async (req, res) => {
  try {
    // Check database connection
    await db.query('SELECT 1');
    // Check Redis connection
    await redis.ping();
    res.status(200).json({ 
      status: 'ok',
      db: 'connected',
      cache: 'connected'
    });
  } catch (error) {
    res.status(503).json({ 
      status: 'error',
      message: error.message 
    });
  }
});

This kind of deep health check tells you not just “the process is running” but “the app can actually serve requests and reach its dependencies.” That’s the good stuff.

depends_on with condition: service_healthy

This is arguably the most practical reason to set up health checks. Docker Compose’s depends_on by itself only waits for a container to start, not for it to be ready. Your API container will happily try to connect to a database that hasn’t finished initializing yet, crash, and leave you staring at logs wondering why.

The Problem

services:
  api:
    build: .
    depends_on:
      - postgres  # Only waits for the container to start, not for Postgres to be ready
  
  postgres:
    image: postgres:16

This is the “I told you to wait for me!” problem. The API starts, tries to connect to Postgres, Postgres is still in its initialization phase, connection refused, app crashes. Classic.

The Solution

services:
  api:
    build: .
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 15s

  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: myapp
      POSTGRES_USER: myuser
      POSTGRES_PASSWORD: supersecret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U myuser -d myapp"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 10s

Now Compose will wait until both Postgres and Redis report as healthy before starting the API. No more race conditions. No more “retry connecting in a loop and hope for the best” logic in your application code (though you should still have that — belt and suspenders).

Other Conditions

Besides service_healthy, you can also use:

services:
  api:
    depends_on:
      postgres:
        condition: service_healthy
      migrations:
        condition: service_completed_successfully

  migrations:
    build: .
    command: npm run migrate
    depends_on:
      postgres:
        condition: service_healthy

  postgres:
    image: postgres:16
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U myuser -d myapp"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

This pattern is chef’s kiss. Postgres starts and becomes healthy, migrations run and complete, then the API starts. Proper sequencing without hacky sleep scripts.

Custom Health Check Scripts

Sometimes a simple curl or ping isn’t enough. Maybe you need to check multiple things, or your health logic is complex enough that cramming it into a one-liner is a crime against readability.

Writing a Custom Script

Create a healthcheck.sh in your project:

#!/bin/sh
set -e

# Check if the web server responds
wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1

# Check if the worker process is running
pgrep -f "worker" > /dev/null || exit 1

# Check if disk space is above threshold
USAGE=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$USAGE" -gt 90 ]; then
  exit 1
fi

exit 0

Then in your Dockerfile:

COPY healthcheck.sh /usr/local/bin/healthcheck.sh
RUN chmod +x /usr/local/bin/healthcheck.sh

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD /usr/local/bin/healthcheck.sh

Or in Compose, mount it as a volume:

services:
  app:
    image: my-app:latest
    volumes:
      - ./healthcheck.sh:/usr/local/bin/healthcheck.sh:ro
    healthcheck:
      test: ["CMD", "/usr/local/bin/healthcheck.sh"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 15s

Custom scripts let you check whatever you want — disk space, file existence, queue depth, connection counts, whether Mercury is in retrograde. Go wild (within reason).

Monitoring Integration

Health checks are great, but they’re only half the story if nobody’s watching. Here’s how to integrate them with monitoring.

Checking Health Status from the CLI

# See health status for all containers
docker ps --format "table {{.Names}}\t{{.Status}}"

# Get detailed health info for a specific container
docker inspect --format='{{json .State.Health}}' container_name | jq .

# Watch health status in real-time
watch -n 5 'docker ps --format "table {{.Names}}\t{{.Status}}"'

The docker inspect output gives you the last few health check results, including stdout/stderr from each check. Incredibly useful for debugging why something is being marked unhealthy.

Autoheal: Auto-Restart Unhealthy Containers

Autoheal watches for containers marked as unhealthy and restarts them automatically:

services:
  autoheal:
    image: willfarrell/autoheal
    container_name: autoheal
    restart: unless-stopped
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - AUTOHEAL_CONTAINER_LABEL=all
      - AUTOHEAL_INTERVAL=60
      - AUTOHEAL_START_PERIOD=300

This is the “have you tried turning it off and on again” approach, automated. It works surprisingly well for transient issues — memory leaks that build up slowly, connections that go stale, processes that get wedged.

A word of caution: if your container is unhealthy because of a configuration error, autoheal will dutifully restart it every 60 seconds forever. You’ll end up with a container that’s been restarted 1,440 times in a day and still doesn’t work. Check your logs.

Webhook Notifications

You can combine health monitoring with notification tools. Here’s a pattern using a simple script:

#!/bin/bash
UNHEALTHY=$(docker ps --filter "health=unhealthy" --format "{{.Names}}" 2>/dev/null)

if [ -n "$UNHEALTHY" ]; then
  curl -X POST "https://your-webhook-url" \
    -H "Content-Type: application/json" \
    -d "{\"text\": \"Unhealthy containers detected: $UNHEALTHY\"}"
fi

Throw that in a cron job and you’ve got poor man’s container monitoring. It’s not Datadog, but it’ll wake you up when things go sideways.

Integration with Uptime Kuma

If you’re running Uptime Kuma (and if you’re not, you probably should be), you can point it at your health endpoints directly:

  1. Add a new monitor in Uptime Kuma
  2. Set the type to HTTP(s)
  3. Point it at http://your-service:port/health
  4. Set your check interval and alert thresholds
  5. Configure notifications (Discord, Slack, email, carrier pigeon)

Now you’ve got external health monitoring on top of Docker’s internal checks. Defense in depth, baby.

Putting It All Together: A Full Stack Example

Here’s a complete docker-compose.yml for a typical web application stack with proper health checks everywhere:

services:
  postgres:
    image: postgres:16
    container_name: myapp-db
    environment:
      POSTGRES_DB: myapp
      POSTGRES_USER: myuser
      POSTGRES_PASSWORD: supersecret
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U myuser -d myapp"]
      interval: 30s
      timeout: 5s
      retries: 5
      start_period: 30s
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    container_name: myapp-cache
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s
    restart: unless-stopped

  api:
    build: ./api
    container_name: myapp-api
    environment:
      DATABASE_URL: postgres://myuser:supersecret@postgres:5432/myapp
      REDIS_URL: redis://redis:6379
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 20s
    restart: unless-stopped

  nginx:
    image: nginx:alpine
    container_name: myapp-proxy
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      api:
        condition: service_healthy
    healthcheck:
      test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:80/health || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 5s
    restart: unless-stopped

  autoheal:
    image: willfarrell/autoheal
    container_name: myapp-autoheal
    restart: unless-stopped
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - AUTOHEAL_CONTAINER_LABEL=all
      - AUTOHEAL_INTERVAL=60

volumes:
  postgres_data:

Look at that startup chain: Postgres and Redis start first, the API waits until both are healthy, and Nginx waits until the API is healthy. Everything comes up in the right order, every time. No race conditions, no retry hacks, no “just add a sleep 10” nonsense.

Common Mistakes (and How to Avoid Them)

Let’s wrap up with the greatest hits of health check misconfigurations.

1. No Start Period

# Bad: App takes 30 seconds to start, fails health check immediately
healthcheck:
  test: curl --fail http://localhost:8080/health
  interval: 10s
  retries: 3

Your app starts, the first three checks fail before it’s ready, Docker marks it unhealthy, autoheal restarts it, and now you’re in a restart loop. Always set start_period to at least as long as your app takes to boot.

2. Timeout Longer Than Interval

# Bad: Check runs every 10s but waits 30s for a response
healthcheck:
  test: curl --fail http://localhost:8080/health
  interval: 10s
  timeout: 30s

If the check hangs, you’ll stack up multiple checks running simultaneously. Keep the timeout shorter than the interval.

3. Checking the Wrong Thing

# Bad: This only checks if a port is open, not if the app works
healthcheck:
  test: ["CMD-SHELL", "nc -z localhost 8080"]

A port being open means the process is listening. It doesn’t mean it can serve requests. Use an actual HTTP request to a health endpoint that exercises real functionality.

4. Health Check That’s Too Expensive

# Bad: Running a full database query with joins on every check
healthcheck:
  test: ["CMD-SHELL", "psql -U user -d db -c 'SELECT COUNT(*) FROM users JOIN orders ON ...'"]

Your health check should be fast and cheap. SELECT 1 or pg_isready, not a query that takes 5 seconds and hits every table. The health check runs constantly — treat it like a heartbeat, not a stress test.

5. Forgetting to Handle Authentication

# Bad: Redis has a password but health check doesn't use it
healthcheck:
  test: ["CMD", "redis-cli", "ping"]

If your service requires authentication, your health check needs those credentials too. Otherwise it’ll fail every time and you’ll spend 20 minutes debugging before the face-palm moment.

The Bottom Line

Docker health checks take about 5 minutes to set up and save you hours of “why is this broken and I didn’t notice” debugging. They’re one of those things that feel optional until the day they would have saved your weekend.

Start simple:

  1. Add pg_isready to your Postgres containers
  2. Add redis-cli ping to your Redis containers
  3. Add a /health endpoint to your web apps
  4. Use depends_on with condition: service_healthy
  5. Consider autoheal for auto-recovery

Then iterate. Tune your intervals. Add deeper checks. Integrate with monitoring. Before you know it, you’ll have a stack that not only runs but actually works — and tells you the second it doesn’t.

Because “it’s running” was never good enough. You just didn’t have a better answer until now.


Share this post on:

Previous Post
ArgoCD vs Flux: GitOps — When Your Git Repo Is the Source of Truth
Next Post
Prometheus + Grafana on Docker: Know When Your Server Is Crying Before It Dies