Your container’s serving requests, but the health check thinks it’s dead. So Docker kills it. Then it restarts. Then it checks again. Restart. Repeat.
Health checks matter. Let’s do them right.
The HEALTHCHECK Instruction
Every Dockerfile can define a health check:
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \ CMD curl -f http://localhost:8080/health || exit 1What each flag does:
- —interval=30s — Run the check every 30 seconds (default)
- —timeout=3s — The check must complete within 3 seconds (default)
- —start-period=5s — Grace period before checks start (app startup time)
- —retries=3 — Fail after 3 consecutive check failures (default)
After --retries failures, Docker marks the container as unhealthy. Note: Docker doesn’t automatically kill unhealthy containers. Orchestrators (Swarm, Kubernetes) do that. For single containers, unhealthy just means “the status says so.”
Check Types: What Works
curl (Most Common)
HEALTHCHECK --interval=10s --timeout=2s --retries=2 \ CMD curl -f http://localhost:8080/health || exit 1The -f flag exits nonzero if HTTP status is >= 400. Simple and effective.
Gotcha: curl must be installed in the image. Alpine images usually have it, but lightweight Python images don’t.
wget (Lightweight)
HEALTHCHECK --interval=10s --timeout=2s --retries=2 \ CMD wget --quiet --tries=1 --spider http://localhost:8080/health || exit 1Exists in most lightweight images. --spider doesn’t download the body, just checks the status.
native tools (Best)
If your app has a built-in health endpoint in the binary, use it:
HEALTHCHECK --interval=10s --timeout=2s --retries=2 \ CMD ["/app/server", "health-check"]Or if your language has a standard check tool:
# PythonHEALTHCHECK --interval=10s --timeout=2s --retries=2 \ CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8080/health').getcode() == 200"
# Node.jsHEALTHCHECK --interval=10s --timeout=2s --retries=2 \ CMD node -e "require('http').get('http://localhost:8080/health', (r) => {if (r.statusCode !== 200) throw new Error(r.statusCode)})"nc (netcat) for TCP
Just checking if a port is open:
HEALTHCHECK --interval=10s --timeout=2s --retries=2 \ CMD nc -z localhost 8080 || exit 1Works but doesn’t verify the app actually works, just that the port is listening.
Good vs Bad Health Checks
Bad: Checking too much
# Don't do thisHEALTHCHECK CMD curl http://localhost/users && curl http://localhost/products && curl http://localhost/ordersIf one endpoint is slow, the whole check times out and the container crashes. Overkill.
Good: Simple and focused
HEALTHCHECK --interval=10s --timeout=2s --retries=3 \ CMD curl -f http://localhost:8080/health || exit 1One lightweight endpoint that returns quickly.
Bad: Checking external dependencies
# Don't do thisHEALTHCHECK CMD curl http://external-api.example.com/status || exit 1If the external service is down, your container gets killed. That’s not a health issue, that’s a dependency issue.
Good: Check yourself
HEALTHCHECK --interval=10s --timeout=2s --retries=3 \ CMD curl -f http://localhost:8080/health || exit 1The /health endpoint can check internal state (database connection, queues, etc.) without external calls.
Tuning Parameters
Interval: How often to check?
- API servers: 10-15s
- Databases: 30s
- Batch workers: 60s (they might be working on something)
- Long-running tasks: 120s+
Default 30s is usually fine.
Timeout: How long to wait?
- Quick endpoints: 1-2s
- Database checks: 5s
- Slow queries: 10s+
Default 3s works for most HTTP endpoints.
Start-period: Grace period on startup
- Java apps: 30-60s (JVM startup is slow)
- Python: 5-10s
- Go: 1-2s
- Databases: 10-30s (init + recovery)
This is critical. If checks start before the app is ready, you’ll see false failures.
# Java app with slow startupHEALTHCHECK --interval=10s --timeout=5s --start-period=60s --retries=3 \ CMD curl -f http://localhost:8080/health || exit 1Retries: How many failures trigger unhealthy?
- API servers: 2-3 (fail fast, but tolerate brief hiccups)
- Databases: 5+ (tolerate maintenance operations)
- Critical services: 1-2 (fail immediately)
Using Health Status in Compose
depends_on with health checks:
version: '3.8'services: db: image: postgres:latest healthcheck: test: ["CMD", "pg_isready", "-U", "postgres"] interval: 10s timeout: 5s retries: 5
api: image: myapi:latest depends_on: db: condition: service_healthyNow the API won’t start until Postgres is actually ready, not just running.
Checking Health Status
docker ps --format='table {{.Names}}\t{{.Status}}'# NAMES STATUS# api Up 2 minutes (healthy)# db Up 2 minutes (unhealthy)
docker inspect mycontainer | jq '.State.Health'# {# "Status": "healthy",# "FailingStreak": 0,# "Runs": [# {# "Start": "2025-02-26T15:30:00.123Z",# "End": "2025-02-26T15:30:01.456Z",# "ExitCode": 0,# "Output": ""# }# ]# }A Real Example: Node.js API
FROM node:20-alpineWORKDIR /appCOPY package*.json ./RUN npm ci --only=productionCOPY . .
# Health endpoint built into the appHEALTHCHECK --interval=15s --timeout=3s --start-period=10s --retries=3 \ CMD curl -f http://localhost:3000/health || exit 1
EXPOSE 3000CMD ["node", "server.js"]In your Node app, include a simple health endpoint:
app.get('/health', (req, res) => { res.status(200).json({ status: 'ok' });});In docker-compose:
services: api: build: . depends_on: redis: condition: service_healthy environment: REDIS_HOST: redis
redis: image: redis:alpine healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 10s timeout: 2s retries: 3Common Mistakes
No start-period: Container fails health checks while booting. Add --start-period.
Timeout too short: Check always times out. Increase --timeout.
Checking external services: Container dies when your ISP hiccups. Check yourself only.
Not implementing an endpoint: If your app doesn’t have /health, add one. It’s 3 lines of code.
Health checks are your first line of defense. Get them right, and you won’t wake up to a page at 2 AM because Docker killed your container for no reason.