Skip to content
SumGuy's Ramblings
Go back

Prometheus + Grafana on Docker: Know When Your Server Is Crying Before It Dies

Your Server Was Screaming and You Didn’t Hear It

Here’s a scenario that should feel uncomfortably familiar: you’re going about your day, completely unbothered, maybe even smug about your self-hosted setup, when someone pings you asking why your app has been down for two hours. You check. It has been down for two hours. The disk filled up. You had no idea.

This is what flying blind looks like. And most home lab setups run exactly like this — totally dark until something explodes and users (or worse, yourself, at 11pm) discover the carnage.

The fix is a proper monitoring stack. Specifically: Prometheus for collecting metrics and Grafana for turning those metrics into dashboards so beautiful you’ll spend three hours tuning the colors instead of doing actual work.

This guide gets you a full working stack in Docker Compose: Prometheus, Grafana, Node Exporter (host metrics), and cAdvisor (container metrics). By the end, you’ll have CPU, RAM, disk, and container stats flowing into a live dashboard — plus a basic alert so something yells at you before the disk fills up again.


What Prometheus Actually Does

Prometheus is a pull-based time series database. That’s the key word: pull. Instead of your apps shoving metrics at a central collector, Prometheus goes out on a schedule and scrapes metrics from targets that expose an HTTP endpoint (usually /metrics).

Every scrape interval (default: 15 seconds), Prometheus hits your configured targets and stores whatever they return as time-stamped key-value data. That data lives in its own embedded time series database (TSDB) — no Postgres, no Redis, no drama.

You query this data with PromQL — Prometheus Query Language. PromQL is basically SQL if SQL was written by someone who really hates joins but loves functions. It looks weird at first, but it’s surprisingly readable once you stop fighting it.

A few quick examples to demystify it:

# CPU usage percentage (across all cores)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Available memory in bytes
node_memory_MemAvailable_bytes

# Disk space used percentage
100 - ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes)

# Container CPU usage rate
rate(container_cpu_usage_seconds_total[5m])

You don’t need to memorize these. The community dashboards (more on that later) come pre-loaded with all the queries you’ll ever need.


What Grafana Does

Grafana is the visualization layer. It doesn’t store data — it connects to data sources (like Prometheus) and renders dashboards from queries you define.

Out of the box, Grafana gives you:

That last point is the cheat code for getting a gorgeous setup fast. The Node Exporter Full dashboard (ID: 1860) alone is worth the whole setup.


The Stack: Full Docker Compose Setup

Create a directory for your monitoring stack and drop these files in:

mkdir -p ~/monitoring/{prometheus,grafana}
cd ~/monitoring

docker-compose.yml

version: "3.8"

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data: {}
  grafana_data: {}

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=15d"
      - "--web.enable-lifecycle"
    ports:
      - "9090:9090"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=changeme
      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - "3000:3000"
    networks:
      - monitoring
    depends_on:
      - prometheus

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.rootfs=/rootfs"
      - "--path.sysfs=/host/sys"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
    ports:
      - "9100:9100"
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    ports:
      - "8080:8080"
    networks:
      - monitoring

prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: []

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]

  - job_name: "cadvisor"
    static_configs:
      - targets: ["cadvisor:8080"]

prometheus/alerts.yml

groups:
  - name: host_alerts
    rules:
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 85% for more than 5 minutes."

      - alert: LowDiskSpace
        expr: 100 - ((node_filesystem_avail_bytes{fstype!="tmpfs"} * 100) / node_filesystem_size_bytes{fstype!="tmpfs"}) > 85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk usage is above 85% on {{ $labels.mountpoint }}."

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 90%."

Spin It Up

docker compose up -d

Give it 30 seconds to settle, then check:

In Prometheus, go to Status > Targets and verify all three scrape targets show UP. If something is DOWN, check the container logs:

docker logs prometheus
docker logs node-exporter

Connecting Grafana to Prometheus

  1. Log into Grafana at port 3000
  2. Go to Connections > Data Sources > Add data source
  3. Select Prometheus
  4. Set the URL to http://prometheus:9090 (use the container name — they’re on the same Docker network)
  5. Click Save & Test — you should see a green checkmark

If you see a connection error, double-check you used prometheus as the hostname, not localhost. Localhost inside a container is the container itself.


Import the Node Exporter Full Dashboard

This is the cheat code. Instead of building a dashboard from scratch (which you can do later, after you’ve accepted the dashboard addiction), import the community masterpiece:

  1. In Grafana, go to Dashboards > Import
  2. Enter dashboard ID: 1860
  3. Click Load
  4. Select your Prometheus data source
  5. Click Import

You now have a fully loaded system metrics dashboard showing CPU, memory, disk I/O, network traffic, system load, and more. Fair warning: you will immediately start tweaking panel colors and rearranging tiles. This is normal. Schedule the three hours now.

For container metrics from cAdvisor, import dashboard ID 14282 (Docker container monitoring).


A Few Useful PromQL Queries to Know

Once you’re poking around in Prometheus or building custom panels, these come up constantly:

# System uptime in seconds
node_time_seconds - node_boot_time_seconds

# Network received bytes per second (eth0)
rate(node_network_receive_bytes_total{device="eth0"}[5m])

# Number of running containers
count(container_last_seen{name!=""})

# Disk read bytes per second
rate(node_disk_read_bytes_total[5m])

PromQL’s rate() function is your best friend. It takes a counter metric and calculates how fast it’s increasing over a time window. Most raw metrics from Node Exporter are counters (they only go up), so rate() converts them into something actually readable, like “bytes per second” instead of “total bytes since boot.”


Setting Up Grafana Alerts

The alert rules in alerts.yml tell Prometheus to fire alerts — but right now there’s nowhere to send them. For a basic setup, configure Grafana’s alerting to notify you directly:

  1. Go to Alerting > Contact Points
  2. Add a new contact point (email, Slack webhook, Telegram bot — whatever you use)
  3. Go to Alerting > Notification Policies and set your default policy to use that contact point

For Prometheus alerts to show up in Grafana’s alert list, you need Alertmanager — but that’s a whole separate container and config. For home lab use, Grafana’s built-in alerting on top of your Prometheus data source is usually enough. Create alert rules directly in Grafana under Alerting > Alert Rules, using the same PromQL queries from above.


Retention and Storage

By default, the compose file above keeps 15 days of metrics (--storage.tsdb.retention.time=15d). Prometheus is efficient — a typical home lab setup with a handful of scrape targets uses maybe 1-3 GB for 15 days of data. If you’re tight on disk, drop it to 7d. If you want more history, bump it up. Just make sure your volume has space.


What You Have Now

After following this guide, you have:

More importantly, the next time something goes wrong, you’ll have a timeline. You’ll know exactly when CPU spiked, when memory got eaten, or when disk started filling up. You go from “I have no idea what happened” to “disk usage climbed steadily from 3am and hit 100% at 6:47am” — which makes fixing problems dramatically less painful.

Stop flying blind. Your server is trying to tell you things. Now you can actually hear it.


Share this post on:

Previous Post
Docker Health Checks: Because "It's Running" Doesn't Mean "It's Working"
Next Post
LangGraph vs CrewAI vs AutoGen: AI Agent Frameworks for Mere Mortals