The Lies Docker Tells You
You check your containers. docker ps says everything is Up 3 hours. You breathe a sigh of relief and go grab a coffee.
Meanwhile, your PostgreSQL container has been refusing connections for the past 45 minutes. Your API container is returning 500 errors to everyone who tries to use it. Your Redis instance is alive in the same way a houseplant you forgot about is alive — technically present, contributing nothing.
The problem is simple: Docker’s default idea of “running” means “the main process hasn’t exited.” That’s it. That’s the whole check. Your app could be stuck in an infinite loop, deadlocked, out of memory but not quite enough to crash, or just sitting there contemplating the void. As long as PID 1 hasn’t died, Docker gives it the green light.
This is like asking “is the patient alive?” and only checking if they have a pulse. Technically useful, but you might want to ask a few more questions before declaring them fit for duty.
That’s where health checks come in.
What Docker Health Checks Actually Do
A Docker health check is a command that runs inside your container on a schedule. If the command exits with code 0, the container is “healthy.” If it exits with code 1, it’s “unhealthy.” If it doesn’t respond in time, that counts as a failure too.
Docker tracks three states:
- starting — The container just started and is in its grace period
- healthy — The health check is passing
- unhealthy — The health check has failed enough consecutive times
That’s it. Simple concept, massive impact. Once you have real health information, you can do things like:
- Prevent traffic from hitting broken containers
- Make other containers wait until dependencies are actually ready
- Get alerted when something silently breaks
- Trigger automatic restarts through orchestration tools
- Stop pretending everything is fine at 3 AM
HEALTHCHECK in Dockerfiles
The first place you can define a health check is directly in your Dockerfile using the HEALTHCHECK instruction. This bakes the check into the image itself, which means every container spawned from that image gets the check automatically.
Basic Syntax
HEALTHCHECK [OPTIONS] CMD command
The options you can set:
| Option | Default | What It Does |
|---|---|---|
--interval | 30s | Time between health checks |
--timeout | 30s | Max time to wait for a check to complete |
--start-period | 0s | Grace period for container startup |
--start-interval | 5s | Interval during the start period (Docker 25+) |
--retries | 3 | Consecutive failures before “unhealthy” |
A Simple Web App Example
FROM node:20-alpine
WORKDIR /app
COPY . .
RUN npm ci --production
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
CMD ["node", "server.js"]
This checks every 30 seconds whether your Node app responds to HTTP requests on the /health endpoint. It gives the app 10 seconds to boot up before it starts counting failures, and it needs 3 consecutive failures before declaring the container unhealthy.
Why wget Over curl?
You’ll notice I used wget there instead of curl. Here’s the thing — Alpine-based images (which a lot of Docker images use) include wget by default but not curl. If you’re using a Debian-based image, curl is usually available. Use whatever’s already in your image to avoid installing extra packages just for health checks.
# Alpine-based (wget available)
HEALTHCHECK CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
# Debian-based (curl available)
HEALTHCHECK CMD curl --fail http://localhost:3000/health || exit 1
The --fail flag on curl is important. Without it, curl exits 0 even on HTTP 500 responses, which completely defeats the purpose. The --spider flag on wget makes it check without downloading the body. Both keep things lightweight.
Disabling a Health Check
If a parent image has a HEALTHCHECK and you want to remove it:
HEALTHCHECK NONE
You probably shouldn’t do this unless you have a very good reason, but the option exists.
Health Checks in Docker Compose
The more common (and flexible) approach is defining health checks in your docker-compose.yml. This is where most people should start, especially if you’re not building custom images.
Basic Compose Health Check
services:
api:
image: my-api:latest
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
The test field supports two formats:
# Shell form (runs in /bin/sh -c)
test: curl --fail http://localhost:8080/health || exit 1
# Exec form (no shell involved, slightly more reliable)
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
# CMD-SHELL form (like shell form but explicit)
test: ["CMD-SHELL", "curl --fail http://localhost:8080/health || exit 1"]
The exec form (CMD) is generally preferred because it doesn’t spawn a shell process and is more predictable. Use CMD-SHELL when you need shell features like pipes or ||.
Tuning Interval, Timeout, and Retries
Getting these values right matters more than people think. Bad tuning leads to either containers marked unhealthy during temporary blips, or problems going undetected for way too long.
Interval
This is how often the health check runs. The default of 30 seconds is fine for most things, but consider:
- High-traffic APIs: 10-15 seconds. You want to know fast.
- Databases: 30-60 seconds. They’re usually stable, and checking too often adds overhead.
- Background workers: 60 seconds or more. If they process jobs on a queue, momentary delays are expected.
Don’t set this to 1 second. You’re not monitoring the Space Shuttle. You’re running Nextcloud in your closet.
Timeout
How long to wait for the health check command to complete before calling it a failure. The default 30 seconds is usually too generous. If your health endpoint takes 30 seconds to respond, your app has bigger problems.
- Web apps: 3-5 seconds
- Databases: 5-10 seconds (connections can take a moment)
- Services under heavy load: 10-15 seconds
Retries
How many consecutive failures before the container is marked unhealthy. Default is 3, which is sensible. One failure could be a hiccup. Two might be a coincidence. Three is a pattern.
- Critical services: 2-3 retries (detect fast)
- Services with known jitter: 5 retries (avoid false alarms)
- Anything behind a load balancer: 3 retries is the sweet spot
Start Period
The grace period after a container starts during which health check failures don’t count toward the retry limit. This is critical for apps with slow startup times.
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 60s # Give the app a full minute to boot
If your Java app takes 45 seconds to start (and let’s be honest, it probably does), a start_period of 60 seconds prevents it from being marked unhealthy before it even finishes loading its 47 Spring Boot dependencies.
Start Interval (Docker 25+)
This is a newer addition. During the start period, you might want to check more frequently so you know the moment the app becomes ready, rather than waiting for the regular interval.
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 60s
start_interval: 5s # Check every 5s during startup, then switch to 30s
Neat feature. Use it.
Common Health Check Patterns
Here’s where the rubber meets the road. Let’s look at health check configurations for the services you’re actually running.
PostgreSQL
services:
postgres:
image: postgres:16
environment:
POSTGRES_DB: myapp
POSTGRES_USER: myuser
POSTGRES_PASSWORD: supersecret
healthcheck:
test: ["CMD-SHELL", "pg_isready -U myuser -d myapp"]
interval: 30s
timeout: 5s
retries: 3
start_period: 30s
pg_isready is a utility that ships with PostgreSQL specifically for this purpose. It checks whether the server is accepting connections. No need to install anything extra, no need to run actual queries. It’s fast, lightweight, and exactly what you want.
If you need a deeper check that verifies the database is actually functional (not just accepting connections):
healthcheck:
test: ["CMD-SHELL", "pg_isready -U myuser -d myapp && psql -U myuser -d myapp -c 'SELECT 1'"]
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
The SELECT 1 confirms you can actually execute queries. Overkill for most setups, but useful if you’ve ever had Postgres accept connections while silently being in recovery mode.
Redis
services:
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
Redis ships with redis-cli, and ping returns PONG if the server is healthy. Redis starts fast, so a 10-second start period is plenty.
If you’re using Redis with authentication:
healthcheck:
test: ["CMD-SHELL", "redis-cli -a $${REDIS_PASSWORD} ping | grep PONG"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
Note the $$ to escape the dollar sign in Compose — this ensures the variable is expanded inside the container, not by Compose.
MySQL / MariaDB
services:
mysql:
image: mysql:8
environment:
MYSQL_ROOT_PASSWORD: supersecret
MYSQL_DATABASE: myapp
healthcheck:
test: ["CMD", "mysqladmin", "ping", "-h", "localhost", "-u", "root", "-p$$MYSQL_ROOT_PASSWORD"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
MySQL is notoriously slow to start, so that generous start_period is doing important work. The mysqladmin ping command checks if the server is alive without running queries.
Nginx
services:
nginx:
image: nginx:alpine
healthcheck:
test: ["CMD-SHELL", "curl --fail http://localhost:80/ || exit 1"]
interval: 30s
timeout: 5s
retries: 3
start_period: 5s
Wait, Nginx on Alpine doesn’t have curl by default. Let’s fix that:
healthcheck:
test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:80/ || exit 1"]
interval: 30s
timeout: 5s
retries: 3
start_period: 5s
Or if you want to avoid even that overhead, Nginx has a built-in approach. Add a simple health location to your config:
location /health {
access_log off;
return 200 "healthy\n";
add_header Content-Type text/plain;
}
Then check that endpoint specifically. The access_log off prevents your health checks from flooding your logs — because nobody wants 2,880 “GET /health 200” entries per day cluttering things up.
Node.js / Express
For the Dockerfile approach:
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD node -e "require('http').get('http://localhost:3000/health', (r) => { process.exit(r.statusCode === 200 ? 0 : 1) })" || exit 1
For Compose:
services:
api:
build: .
healthcheck:
test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1"]
interval: 30s
timeout: 5s
retries: 3
start_period: 15s
And in your Express app, create the health endpoint:
app.get('/health', (req, res) => {
// Basic check
res.status(200).json({ status: 'ok' });
});
For a more thorough check that verifies your app can actually do its job:
app.get('/health', async (req, res) => {
try {
// Check database connection
await db.query('SELECT 1');
// Check Redis connection
await redis.ping();
res.status(200).json({
status: 'ok',
db: 'connected',
cache: 'connected'
});
} catch (error) {
res.status(503).json({
status: 'error',
message: error.message
});
}
});
This kind of deep health check tells you not just “the process is running” but “the app can actually serve requests and reach its dependencies.” That’s the good stuff.
depends_on with condition: service_healthy
This is arguably the most practical reason to set up health checks. Docker Compose’s depends_on by itself only waits for a container to start, not for it to be ready. Your API container will happily try to connect to a database that hasn’t finished initializing yet, crash, and leave you staring at logs wondering why.
The Problem
services:
api:
build: .
depends_on:
- postgres # Only waits for the container to start, not for Postgres to be ready
postgres:
image: postgres:16
This is the “I told you to wait for me!” problem. The API starts, tries to connect to Postgres, Postgres is still in its initialization phase, connection refused, app crashes. Classic.
The Solution
services:
api:
build: .
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
healthcheck:
test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1"]
interval: 30s
timeout: 5s
retries: 3
start_period: 15s
postgres:
image: postgres:16
environment:
POSTGRES_DB: myapp
POSTGRES_USER: myuser
POSTGRES_PASSWORD: supersecret
healthcheck:
test: ["CMD-SHELL", "pg_isready -U myuser -d myapp"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
start_period: 10s
Now Compose will wait until both Postgres and Redis report as healthy before starting the API. No more race conditions. No more “retry connecting in a loop and hope for the best” logic in your application code (though you should still have that — belt and suspenders).
Other Conditions
Besides service_healthy, you can also use:
service_started— The default, just waits for the container to start (barely useful)service_completed_successfully— Waits for a container to finish with exit code 0 (useful for migrations or seed scripts)
services:
api:
depends_on:
postgres:
condition: service_healthy
migrations:
condition: service_completed_successfully
migrations:
build: .
command: npm run migrate
depends_on:
postgres:
condition: service_healthy
postgres:
image: postgres:16
healthcheck:
test: ["CMD-SHELL", "pg_isready -U myuser -d myapp"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
This pattern is chef’s kiss. Postgres starts and becomes healthy, migrations run and complete, then the API starts. Proper sequencing without hacky sleep scripts.
Custom Health Check Scripts
Sometimes a simple curl or ping isn’t enough. Maybe you need to check multiple things, or your health logic is complex enough that cramming it into a one-liner is a crime against readability.
Writing a Custom Script
Create a healthcheck.sh in your project:
#!/bin/sh
set -e
# Check if the web server responds
wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1
# Check if the worker process is running
pgrep -f "worker" > /dev/null || exit 1
# Check if disk space is above threshold
USAGE=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$USAGE" -gt 90 ]; then
exit 1
fi
exit 0
Then in your Dockerfile:
COPY healthcheck.sh /usr/local/bin/healthcheck.sh
RUN chmod +x /usr/local/bin/healthcheck.sh
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD /usr/local/bin/healthcheck.sh
Or in Compose, mount it as a volume:
services:
app:
image: my-app:latest
volumes:
- ./healthcheck.sh:/usr/local/bin/healthcheck.sh:ro
healthcheck:
test: ["CMD", "/usr/local/bin/healthcheck.sh"]
interval: 30s
timeout: 10s
retries: 3
start_period: 15s
Custom scripts let you check whatever you want — disk space, file existence, queue depth, connection counts, whether Mercury is in retrograde. Go wild (within reason).
Monitoring Integration
Health checks are great, but they’re only half the story if nobody’s watching. Here’s how to integrate them with monitoring.
Checking Health Status from the CLI
# See health status for all containers
docker ps --format "table {{.Names}}\t{{.Status}}"
# Get detailed health info for a specific container
docker inspect --format='{{json .State.Health}}' container_name | jq .
# Watch health status in real-time
watch -n 5 'docker ps --format "table {{.Names}}\t{{.Status}}"'
The docker inspect output gives you the last few health check results, including stdout/stderr from each check. Incredibly useful for debugging why something is being marked unhealthy.
Autoheal: Auto-Restart Unhealthy Containers
Autoheal watches for containers marked as unhealthy and restarts them automatically:
services:
autoheal:
image: willfarrell/autoheal
container_name: autoheal
restart: unless-stopped
volumes:
- /var/run/docker.sock:/var/run/docker.sock
environment:
- AUTOHEAL_CONTAINER_LABEL=all
- AUTOHEAL_INTERVAL=60
- AUTOHEAL_START_PERIOD=300
This is the “have you tried turning it off and on again” approach, automated. It works surprisingly well for transient issues — memory leaks that build up slowly, connections that go stale, processes that get wedged.
A word of caution: if your container is unhealthy because of a configuration error, autoheal will dutifully restart it every 60 seconds forever. You’ll end up with a container that’s been restarted 1,440 times in a day and still doesn’t work. Check your logs.
Webhook Notifications
You can combine health monitoring with notification tools. Here’s a pattern using a simple script:
#!/bin/bash
UNHEALTHY=$(docker ps --filter "health=unhealthy" --format "{{.Names}}" 2>/dev/null)
if [ -n "$UNHEALTHY" ]; then
curl -X POST "https://your-webhook-url" \
-H "Content-Type: application/json" \
-d "{\"text\": \"Unhealthy containers detected: $UNHEALTHY\"}"
fi
Throw that in a cron job and you’ve got poor man’s container monitoring. It’s not Datadog, but it’ll wake you up when things go sideways.
Integration with Uptime Kuma
If you’re running Uptime Kuma (and if you’re not, you probably should be), you can point it at your health endpoints directly:
- Add a new monitor in Uptime Kuma
- Set the type to HTTP(s)
- Point it at
http://your-service:port/health - Set your check interval and alert thresholds
- Configure notifications (Discord, Slack, email, carrier pigeon)
Now you’ve got external health monitoring on top of Docker’s internal checks. Defense in depth, baby.
Putting It All Together: A Full Stack Example
Here’s a complete docker-compose.yml for a typical web application stack with proper health checks everywhere:
services:
postgres:
image: postgres:16
container_name: myapp-db
environment:
POSTGRES_DB: myapp
POSTGRES_USER: myuser
POSTGRES_PASSWORD: supersecret
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U myuser -d myapp"]
interval: 30s
timeout: 5s
retries: 5
start_period: 30s
restart: unless-stopped
redis:
image: redis:7-alpine
container_name: myapp-cache
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
restart: unless-stopped
api:
build: ./api
container_name: myapp-api
environment:
DATABASE_URL: postgres://myuser:supersecret@postgres:5432/myapp
REDIS_URL: redis://redis:6379
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
healthcheck:
test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1"]
interval: 30s
timeout: 5s
retries: 3
start_period: 20s
restart: unless-stopped
nginx:
image: nginx:alpine
container_name: myapp-proxy
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
api:
condition: service_healthy
healthcheck:
test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:80/health || exit 1"]
interval: 30s
timeout: 5s
retries: 3
start_period: 5s
restart: unless-stopped
autoheal:
image: willfarrell/autoheal
container_name: myapp-autoheal
restart: unless-stopped
volumes:
- /var/run/docker.sock:/var/run/docker.sock
environment:
- AUTOHEAL_CONTAINER_LABEL=all
- AUTOHEAL_INTERVAL=60
volumes:
postgres_data:
Look at that startup chain: Postgres and Redis start first, the API waits until both are healthy, and Nginx waits until the API is healthy. Everything comes up in the right order, every time. No race conditions, no retry hacks, no “just add a sleep 10” nonsense.
Common Mistakes (and How to Avoid Them)
Let’s wrap up with the greatest hits of health check misconfigurations.
1. No Start Period
# Bad: App takes 30 seconds to start, fails health check immediately
healthcheck:
test: curl --fail http://localhost:8080/health
interval: 10s
retries: 3
Your app starts, the first three checks fail before it’s ready, Docker marks it unhealthy, autoheal restarts it, and now you’re in a restart loop. Always set start_period to at least as long as your app takes to boot.
2. Timeout Longer Than Interval
# Bad: Check runs every 10s but waits 30s for a response
healthcheck:
test: curl --fail http://localhost:8080/health
interval: 10s
timeout: 30s
If the check hangs, you’ll stack up multiple checks running simultaneously. Keep the timeout shorter than the interval.
3. Checking the Wrong Thing
# Bad: This only checks if a port is open, not if the app works
healthcheck:
test: ["CMD-SHELL", "nc -z localhost 8080"]
A port being open means the process is listening. It doesn’t mean it can serve requests. Use an actual HTTP request to a health endpoint that exercises real functionality.
4. Health Check That’s Too Expensive
# Bad: Running a full database query with joins on every check
healthcheck:
test: ["CMD-SHELL", "psql -U user -d db -c 'SELECT COUNT(*) FROM users JOIN orders ON ...'"]
Your health check should be fast and cheap. SELECT 1 or pg_isready, not a query that takes 5 seconds and hits every table. The health check runs constantly — treat it like a heartbeat, not a stress test.
5. Forgetting to Handle Authentication
# Bad: Redis has a password but health check doesn't use it
healthcheck:
test: ["CMD", "redis-cli", "ping"]
If your service requires authentication, your health check needs those credentials too. Otherwise it’ll fail every time and you’ll spend 20 minutes debugging before the face-palm moment.
The Bottom Line
Docker health checks take about 5 minutes to set up and save you hours of “why is this broken and I didn’t notice” debugging. They’re one of those things that feel optional until the day they would have saved your weekend.
Start simple:
- Add
pg_isreadyto your Postgres containers - Add
redis-cli pingto your Redis containers - Add a
/healthendpoint to your web apps - Use
depends_onwithcondition: service_healthy - Consider autoheal for auto-recovery
Then iterate. Tune your intervals. Add deeper checks. Integrate with monitoring. Before you know it, you’ll have a stack that not only runs but actually works — and tells you the second it doesn’t.
Because “it’s running” was never good enough. You just didn’t have a better answer until now.