The Lies Docker Tells You
You check your containers. docker ps says everything is Up 3 hours. You breathe a sigh of relief and go grab a coffee.
Meanwhile, your PostgreSQL container has been refusing connections for the past 45 minutes. Your API container is returning 500 errors to everyone who tries to use it. Your Redis instance is alive in the same way a houseplant you forgot about is alive — technically present, contributing nothing.
The problem is simple: Docker’s default idea of “running” means “the main process hasn’t exited.” That’s it. That’s the whole check. Your app could be stuck in an infinite loop, deadlocked, out of memory but not quite enough to crash, or just sitting there contemplating the void. As long as PID 1 hasn’t died, Docker gives it the green light.
This is like asking “is the patient alive?” and only checking if they have a pulse. Technically useful, but you might want to ask a few more questions before declaring them fit for duty.
That’s where health checks come in.
What Docker Health Checks Actually Do
A Docker health check is a command that runs inside your container on a schedule. If the command exits with code 0, the container is “healthy.” If it exits with code 1, it’s “unhealthy.” If it doesn’t respond in time, that counts as a failure too.
Docker tracks three states:
- starting — The container just started and is in its grace period
- healthy — The health check is passing
- unhealthy — The health check has failed enough consecutive times
That’s it. Simple concept, massive impact. Once you have real health information, you can do things like:
- Prevent traffic from hitting broken containers
- Make other containers wait until dependencies are actually ready
- Get alerted when something silently breaks
- Trigger automatic restarts through orchestration tools
- Stop pretending everything is fine at 3 AM
HEALTHCHECK in Dockerfiles
The first place you can define a health check is directly in your Dockerfile using the HEALTHCHECK instruction. This bakes the check into the image itself, which means every container spawned from that image gets the check automatically.
Basic Syntax
HEALTHCHECK [OPTIONS] CMD commandThe options you can set:
| Option | Default | What It Does |
|---|---|---|
--interval | 30s | Time between health checks |
--timeout | 30s | Max time to wait for a check to complete |
--start-period | 0s | Grace period for container startup |
--start-interval | 5s | Interval during the start period (Docker 25+) |
--retries | 3 | Consecutive failures before “unhealthy” |
A Simple Web App Example
FROM node:20-alpine
WORKDIR /appCOPY . .RUN npm ci --production
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \ CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
CMD ["node", "server.js"]This checks every 30 seconds whether your Node app responds to HTTP requests on the /health endpoint. It gives the app 10 seconds to boot up before it starts counting failures, and it needs 3 consecutive failures before declaring the container unhealthy.
Why wget Over curl?
You’ll notice I used wget there instead of curl. Here’s the thing — Alpine-based images (which a lot of Docker images use) include wget by default but not curl. If you’re using a Debian-based image, curl is usually available. Use whatever’s already in your image to avoid installing extra packages just for health checks.
# Alpine-based (wget available)HEALTHCHECK CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
# Debian-based (curl available)HEALTHCHECK CMD curl --fail http://localhost:3000/health || exit 1The --fail flag on curl is important. Without it, curl exits 0 even on HTTP 500 responses, which completely defeats the purpose. The --spider flag on wget makes it check without downloading the body. Both keep things lightweight.
Disabling a Health Check
If a parent image has a HEALTHCHECK and you want to remove it:
HEALTHCHECK NONEYou probably shouldn’t do this unless you have a very good reason, but the option exists.
Health Checks in Docker Compose
The more common (and flexible) approach is defining health checks in your docker-compose.yml. This is where most people should start, especially if you’re not building custom images.
Basic Compose Health Check
services: api: image: my-api:latest healthcheck: test: ["CMD", "curl", "--fail", "http://localhost:8080/health"] interval: 30s timeout: 5s retries: 3 start_period: 10sThe test field supports two formats:
# Shell form (runs in /bin/sh -c)test: curl --fail http://localhost:8080/health || exit 1
# Exec form (no shell involved, slightly more reliable)test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
# CMD-SHELL form (like shell form but explicit)test: ["CMD-SHELL", "curl --fail http://localhost:8080/health || exit 1"]The exec form (CMD) is generally preferred because it doesn’t spawn a shell process and is more predictable. Use CMD-SHELL when you need shell features like pipes or ||.
Tuning Interval, Timeout, and Retries
Getting these values right matters more than people think. Bad tuning leads to either containers marked unhealthy during temporary blips, or problems going undetected for way too long.
Interval
This is how often the health check runs. The default of 30 seconds is fine for most things, but consider:
- High-traffic APIs: 10-15 seconds. You want to know fast.
- Databases: 30-60 seconds. They’re usually stable, and checking too often adds overhead.
- Background workers: 60 seconds or more. If they process jobs on a queue, momentary delays are expected.
Don’t set this to 1 second. You’re not monitoring the Space Shuttle. You’re running Nextcloud in your closet.
Timeout
How long to wait for the health check command to complete before calling it a failure. The default 30 seconds is usually too generous. If your health endpoint takes 30 seconds to respond, your app has bigger problems.
- Web apps: 3-5 seconds
- Databases: 5-10 seconds (connections can take a moment)
- Services under heavy load: 10-15 seconds
Retries
How many consecutive failures before the container is marked unhealthy. Default is 3, which is sensible. One failure could be a hiccup. Two might be a coincidence. Three is a pattern.
- Critical services: 2-3 retries (detect fast)
- Services with known jitter: 5 retries (avoid false alarms)
- Anything behind a load balancer: 3 retries is the sweet spot
Start Period
The grace period after a container starts during which health check failures don’t count toward the retry limit. This is critical for apps with slow startup times.
healthcheck: test: ["CMD", "curl", "--fail", "http://localhost:8080/health"] interval: 30s timeout: 5s retries: 3 start_period: 60s # Give the app a full minute to bootIf your Java app takes 45 seconds to start (and let’s be honest, it probably does), a start_period of 60 seconds prevents it from being marked unhealthy before it even finishes loading its 47 Spring Boot dependencies.
Start Interval (Docker 25+)
This is a newer addition. During the start period, you might want to check more frequently so you know the moment the app becomes ready, rather than waiting for the regular interval.
healthcheck: test: ["CMD", "curl", "--fail", "http://localhost:8080/health"] interval: 30s timeout: 5s retries: 3 start_period: 60s start_interval: 5s # Check every 5s during startup, then switch to 30sNeat feature. Use it.
Common Health Check Patterns
Here’s where the rubber meets the road. Let’s look at health check configurations for the services you’re actually running.
PostgreSQL
services: postgres: image: postgres:16 environment: POSTGRES_DB: myapp POSTGRES_USER: myuser POSTGRES_PASSWORD: supersecret healthcheck: test: ["CMD-SHELL", "pg_isready -U myuser -d myapp"] interval: 30s timeout: 5s retries: 3 start_period: 30spg_isready is a utility that ships with PostgreSQL specifically for this purpose. It checks whether the server is accepting connections. No need to install anything extra, no need to run actual queries. It’s fast, lightweight, and exactly what you want.
If you need a deeper check that verifies the database is actually functional (not just accepting connections):
healthcheck: test: ["CMD-SHELL", "pg_isready -U myuser -d myapp && psql -U myuser -d myapp -c "SELECT 1""] interval: 30s timeout: 10s retries: 3 start_period: 30sThe SELECT 1 confirms you can actually execute queries. Overkill for most setups, but useful if you’ve ever had Postgres accept connections while silently being in recovery mode.
Redis
services: redis: image: redis:7-alpine healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 30s timeout: 5s retries: 3 start_period: 10sRedis ships with redis-cli, and ping returns PONG if the server is healthy. Redis starts fast, so a 10-second start period is plenty.
If you’re using Redis with authentication:
healthcheck: test: ["CMD-SHELL", "redis-cli -a $${REDIS_PASSWORD} ping | grep PONG"] interval: 30s timeout: 5s retries: 3 start_period: 10sNote the $$ to escape the dollar sign in Compose — this ensures the variable is expanded inside the container, not by Compose.
MySQL / MariaDB
services: mysql: image: mysql:8 environment: MYSQL_ROOT_PASSWORD: supersecret MYSQL_DATABASE: myapp healthcheck: test: ["CMD", "mysqladmin", "ping", "-h", "localhost", "-u", "root", "-p$$MYSQL_ROOT_PASSWORD"] interval: 30s timeout: 10s retries: 3 start_period: 60sMySQL is notoriously slow to start, so that generous start_period is doing important work. The mysqladmin ping command checks if the server is alive without running queries.
Nginx
services: nginx: image: nginx:alpine healthcheck: test: ["CMD-SHELL", "curl --fail http://localhost:80/ || exit 1"] interval: 30s timeout: 5s retries: 3 start_period: 5sWait, Nginx on Alpine doesn’t have curl by default. Let’s fix that:
healthcheck: test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:80/ || exit 1"] interval: 30s timeout: 5s retries: 3 start_period: 5sOr if you want to avoid even that overhead, Nginx has a built-in approach. Add a simple health location to your config:
location /health { access_log off; return 200 "healthy\n"; add_header Content-Type text/plain;}Then check that endpoint specifically. The access_log off prevents your health checks from flooding your logs — because nobody wants 2,880 “GET /health 200” entries per day cluttering things up.
Node.js / Express
For the Dockerfile approach:
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \ CMD node -e "require("http").get("http://localhost:3000/health", (r) => { process.exit(r.statusCode === 200 ? 0 : 1) })" || exit 1For Compose:
services: api: build: . healthcheck: test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1"] interval: 30s timeout: 5s retries: 3 start_period: 15sAnd in your Express app, create the health endpoint:
app.get("/health", (req, res) => { // Basic check res.status(200).json({ status: "ok" });});For a more thorough check that verifies your app can actually do its job:
app.get("/health", async (req, res) => { try { // Check database connection await db.query("SELECT 1"); // Check Redis connection await redis.ping(); res.status(200).json({ status: "ok", db: "connected", cache: "connected" }); } catch (error) { res.status(503).json({ status: "error", message: error.message }); }});This kind of deep health check tells you not just “the process is running” but “the app can actually serve requests and reach its dependencies.” That’s the good stuff.
depends_on with condition: service_healthy
This is arguably the most practical reason to set up health checks. Docker Compose’s depends_on by itself only waits for a container to start, not for it to be ready. Your API container will happily try to connect to a database that hasn’t finished initializing yet, crash, and leave you staring at logs wondering why.
The Problem
services: api: build: . depends_on: - postgres # Only waits for the container to start, not for Postgres to be ready
postgres: image: postgres:16This is the “I told you to wait for me!” problem. The API starts, tries to connect to Postgres, Postgres is still in its initialization phase, connection refused, app crashes. Classic.
The Solution
services: api: build: . depends_on: postgres: condition: service_healthy redis: condition: service_healthy healthcheck: test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1"] interval: 30s timeout: 5s retries: 3 start_period: 15s
postgres: image: postgres:16 environment: POSTGRES_DB: myapp POSTGRES_USER: myuser POSTGRES_PASSWORD: supersecret healthcheck: test: ["CMD-SHELL", "pg_isready -U myuser -d myapp"] interval: 10s timeout: 5s retries: 5 start_period: 30s
redis: image: redis:7-alpine healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 10s timeout: 5s retries: 5 start_period: 10sNow Compose will wait until both Postgres and Redis report as healthy before starting the API. No more race conditions. No more “retry connecting in a loop and hope for the best” logic in your application code (though you should still have that — belt and suspenders).
Other Conditions
Besides service_healthy, you can also use:
service_started— The default, just waits for the container to start (barely useful)service_completed_successfully— Waits for a container to finish with exit code 0 (useful for migrations or seed scripts)
services: api: depends_on: postgres: condition: service_healthy migrations: condition: service_completed_successfully
migrations: build: . command: npm run migrate depends_on: postgres: condition: service_healthy
postgres: image: postgres:16 healthcheck: test: ["CMD-SHELL", "pg_isready -U myuser -d myapp"] interval: 10s timeout: 5s retries: 5 start_period: 30sThis pattern is chef’s kiss. Postgres starts and becomes healthy, migrations run and complete, then the API starts. Proper sequencing without hacky sleep scripts.
Custom Health Check Scripts
Sometimes a simple curl or ping isn’t enough. Maybe you need to check multiple things, or your health logic is complex enough that cramming it into a one-liner is a crime against readability.
Writing a Custom Script
Create a healthcheck.sh in your project:
#!/bin/shset -e
# Check if the web server respondswget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1
# Check if the worker process is runningpgrep -f "worker" > /dev/null || exit 1
# Check if disk space is above thresholdUSAGE=$(df / | tail -1 | awk "{print $5}" | sed "s/%//")if [ "$USAGE" -gt 90 ]; then exit 1fi
exit 0Then in your Dockerfile:
COPY healthcheck.sh /usr/local/bin/healthcheck.shRUN chmod +x /usr/local/bin/healthcheck.sh
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \ CMD /usr/local/bin/healthcheck.shOr in Compose, mount it as a volume:
services: app: image: my-app:latest volumes: - ./healthcheck.sh:/usr/local/bin/healthcheck.sh:ro healthcheck: test: ["CMD", "/usr/local/bin/healthcheck.sh"] interval: 30s timeout: 10s retries: 3 start_period: 15sCustom scripts let you check whatever you want — disk space, file existence, queue depth, connection counts, whether Mercury is in retrograde. Go wild (within reason).
Monitoring Integration
Health checks are great, but they’re only half the story if nobody’s watching. Here’s how to integrate them with monitoring.
Checking Health Status from the CLI
# See health status for all containersdocker ps --format "table {{.Names}}\t{{.Status}}"
# Get detailed health info for a specific containerdocker inspect --format="{{json .State.Health}}" container_name | jq .
# Watch health status in real-timewatch -n 5 "docker ps --format "table {{.Names}}\t{{.Status}}""The docker inspect output gives you the last few health check results, including stdout/stderr from each check. Incredibly useful for debugging why something is being marked unhealthy.
Autoheal: Auto-Restart Unhealthy Containers
Autoheal watches for containers marked as unhealthy and restarts them automatically:
services: autoheal: image: willfarrell/autoheal container_name: autoheal restart: unless-stopped volumes: - /var/run/docker.sock:/var/run/docker.sock environment: - AUTOHEAL_CONTAINER_LABEL=all - AUTOHEAL_INTERVAL=60 - AUTOHEAL_START_PERIOD=300This is the “have you tried turning it off and on again” approach, automated. It works surprisingly well for transient issues — memory leaks that build up slowly, connections that go stale, processes that get wedged.
A word of caution: if your container is unhealthy because of a configuration error, autoheal will dutifully restart it every 60 seconds forever. You’ll end up with a container that’s been restarted 1,440 times in a day and still doesn’t work. Check your logs.
Webhook Notifications
You can combine health monitoring with notification tools. Here’s a pattern using a simple script:
#!/bin/bashUNHEALTHY=$(docker ps --filter "health=unhealthy" --format "{{.Names}}" 2>/dev/null)
if [ -n "$UNHEALTHY" ]; then curl -X POST "https://your-webhook-url" \ -H "Content-Type: application/json" \ -d "{\"text\": \"Unhealthy containers detected: $UNHEALTHY\"}"fiThrow that in a cron job and you’ve got poor man’s container monitoring. It’s not Datadog, but it’ll wake you up when things go sideways.
Integration with Uptime Kuma
If you’re running Uptime Kuma (and if you’re not, you probably should be), you can point it at your health endpoints directly:
- Add a new monitor in Uptime Kuma
- Set the type to HTTP(s)
- Point it at
http://your-service:port/health - Set your check interval and alert thresholds
- Configure notifications (Discord, Slack, email, carrier pigeon)
Now you’ve got external health monitoring on top of Docker’s internal checks. Defense in depth, baby.
Putting It All Together: A Full Stack Example
Here’s a complete docker-compose.yml for a typical web application stack with proper health checks everywhere:
services: postgres: image: postgres:16 container_name: myapp-db environment: POSTGRES_DB: myapp POSTGRES_USER: myuser POSTGRES_PASSWORD: supersecret volumes: - postgres_data:/var/lib/postgresql/data healthcheck: test: ["CMD-SHELL", "pg_isready -U myuser -d myapp"] interval: 30s timeout: 5s retries: 5 start_period: 30s restart: unless-stopped
redis: image: redis:7-alpine container_name: myapp-cache healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 30s timeout: 5s retries: 3 start_period: 10s restart: unless-stopped
api: build: ./api container_name: myapp-api environment: DATABASE_URL: postgres://myuser:supersecret@postgres:5432/myapp REDIS_URL: redis://redis:6379 depends_on: postgres: condition: service_healthy redis: condition: service_healthy healthcheck: test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1"] interval: 30s timeout: 5s retries: 3 start_period: 20s restart: unless-stopped
nginx: image: nginx:alpine container_name: myapp-proxy ports: - "80:80" - "443:443" volumes: - ./nginx.conf:/etc/nginx/nginx.conf:ro depends_on: api: condition: service_healthy healthcheck: test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:80/health || exit 1"] interval: 30s timeout: 5s retries: 3 start_period: 5s restart: unless-stopped
autoheal: image: willfarrell/autoheal container_name: myapp-autoheal restart: unless-stopped volumes: - /var/run/docker.sock:/var/run/docker.sock environment: - AUTOHEAL_CONTAINER_LABEL=all - AUTOHEAL_INTERVAL=60
volumes: postgres_data:Look at that startup chain: Postgres and Redis start first, the API waits until both are healthy, and Nginx waits until the API is healthy. Everything comes up in the right order, every time. No race conditions, no retry hacks, no “just add a sleep 10” nonsense.
Common Mistakes (and How to Avoid Them)
Let’s wrap up with the greatest hits of health check misconfigurations.
1. No Start Period
# Bad: App takes 30 seconds to start, fails health check immediatelyhealthcheck: test: curl --fail http://localhost:8080/health interval: 10s retries: 3Your app starts, the first three checks fail before it’s ready, Docker marks it unhealthy, autoheal restarts it, and now you’re in a restart loop. Always set start_period to at least as long as your app takes to boot.
2. Timeout Longer Than Interval
# Bad: Check runs every 10s but waits 30s for a responsehealthcheck: test: curl --fail http://localhost:8080/health interval: 10s timeout: 30sIf the check hangs, you’ll stack up multiple checks running simultaneously. Keep the timeout shorter than the interval.
3. Checking the Wrong Thing
# Bad: This only checks if a port is open, not if the app workshealthcheck: test: ["CMD-SHELL", "nc -z localhost 8080"]A port being open means the process is listening. It doesn’t mean it can serve requests. Use an actual HTTP request to a health endpoint that exercises real functionality.
4. Health Check That’s Too Expensive
# Bad: Running a full database query with joins on every checkhealthcheck: test: ["CMD-SHELL", "psql -U user -d db -c "SELECT COUNT(*) FROM users JOIN orders ON ...""]Your health check should be fast and cheap. SELECT 1 or pg_isready, not a query that takes 5 seconds and hits every table. The health check runs constantly — treat it like a heartbeat, not a stress test.
5. Forgetting to Handle Authentication
# Bad: Redis has a password but health check doesn't use ithealthcheck: test: ["CMD", "redis-cli", "ping"]If your service requires authentication, your health check needs those credentials too. Otherwise it’ll fail every time and you’ll spend 20 minutes debugging before the face-palm moment.
The Bottom Line
Docker health checks take about 5 minutes to set up and save you hours of “why is this broken and I didn’t notice” debugging. They’re one of those things that feel optional until the day they would have saved your weekend.
Start simple:
- Add
pg_isreadyto your Postgres containers - Add
redis-cli pingto your Redis containers - Add a
/healthendpoint to your web apps - Use
depends_onwithcondition: service_healthy - Consider autoheal for auto-recovery
Then iterate. Tune your intervals. Add deeper checks. Integrate with monitoring. Before you know it, you’ll have a stack that not only runs but actually works — and tells you the second it doesn’t.
Because “it’s running” was never good enough. You just didn’t have a better answer until now.