Prometheus + Grafana: Monitoring That Doesn't Lie to You

You’re running a Docker Compose stack in your home lab. Everything works great at 2 PM. Then at 2 AM your phone buzzes because a service died silently and nobody noticed. Sound familiar?

That’s what monitoring is for. Not to make fancy dashboards for your blog post. But to catch problems before they become “I should’ve checked logs four hours ago” problems.

Prometheus and Grafana are the boring, boring tools that actually make this work. Prometheus pulls metrics from your infrastructure. Grafana turns those metrics into graphs you can understand. Alertmanager screams at you when something’s on fire. Together, they’re the monitoring stack that scales from a Raspberry Pi to a data center without lying to you.

Most tutorials show you Prometheus + Grafana and stop. They don’t show you Alertmanager. They don’t show you Node Exporter. They definitely don’t show you how to actually use PromQL to find the problems that matter. This article builds the whole thing, and you’ll have a working stack you can copy-paste into your lab right now.

The Architecture (30 seconds)

Prometheus is a pull-based monitoring system. Your services expose metrics at an HTTP endpoint (typically /metrics on whatever port the service runs). Prometheus scrapes those endpoints on a schedule. Alertmanager watches those metrics for conditions you define. Grafana reads from Prometheus and renders graphs.

Exporters are the secret sauce. They’re small programs that sit in front of your infrastructure and translate their native metrics into Prometheus format. Node Exporter watches your Linux server. cAdvisor watches your Docker containers. Redis Exporter watches Redis. You get the idea.

Here’s what we’re building: Prometheus scrapes Node Exporter, cAdvisor, and itself. Alertmanager watches those metrics and can send alerts to email or Slack. Grafana dashboards visualize the data. All of it runs in Docker Compose.

The Full Stack (Docker Compose)

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert.rules.yml:/etc/prometheus/alert.rules.yml
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    command:
      - "--config.file=/etc/alertmanager/alertmanager.yml"
      - "--storage.path=/alertmanager"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_SECURITY_ADMIN_USER=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
    networks:
      - monitoring
    depends_on:
      - prometheus

  node_exporter:
    image: prom/node-exporter:latest
    container_name: node_exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    networks:
      - monitoring
    privileged: true

volumes:
  prometheus_data:
  alertmanager_data:
  grafana_data:

networks:
  monitoring:
    driver: bridge

Spin it up:

docker compose up -d

That’s it. You now have Prometheus listening at http://localhost:9090, Grafana at http://localhost:3000, and Node Exporter feeding data.

Prometheus Config (The Scrape Targets)

Prometheus needs to know what to scrape and how often. That’s prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: "homelab"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - "alert.rules.yml"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node_exporter"
    static_configs:
      - targets: ["node_exporter:9100"]

  - job_name: "cadvisor"
    static_configs:
      - targets: ["cadvisor:8080"]

That’s 15-second scrape intervals, Alertmanager targets defined, and three jobs: Prometheus itself, Node Exporter, and cAdvisor. Each target is a service in the Compose stack.

Alert Rules (Wake You Up at 2 AM)

Here’s what actually triggers an alert. Save this as alert.rules.yml:

groups:
  - name: homelab
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
        for: 5m
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% for the last 5 minutes"

      - alert: HighDiskUsage
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
        for: 5m
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Only {{ $value }}% disk available"

      - alert: MemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}%"

These are PromQL queries. The CPU one calculates 100 minus idle CPU (inverse = used). Disk one checks if free space is under 20%. Memory checks if available is under 15%. If any condition is true for 5 minutes, an alert fires.

Alertmanager Config (Who Gets Woken Up)

global:
  resolve_timeout: 5m

route:
  receiver: default
  group_by: ["alertname", "instance"]
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

receivers:
  - name: default
    email_configs:
      - to: "your-email@example.com"
        from: "alertmanager@sumguy.local"
        smarthost: "smtp.gmail.com:587"
        auth_username: "your-email@gmail.com"
        auth_password: "your-app-password"
        require_tls: true

Replace the email config with yours. If you use Slack, add a webhook receiver. Alertmanager groups related alerts and re-sends a still-firing alert every 12 hours (repeat_interval). Want critical alerts to nag you sooner? Add a severity-based route with a shorter repeat_interval, but that’s a config we didn’t write here.

Grafana: Add Prometheus as a Data Source

Log in to Grafana at http://localhost:3000 (admin/admin)
Configuration → Data Sources
Click Add Data Source
Pick Prometheus
URL: http://prometheus:9090
Click Save & Test

Done. Now Grafana can query Prometheus.

Building a Dashboard from Scratch

Create → Dashboard → Add a new panel
For a CPU graph:

Query: (100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))
Legend: CPU Usage
Panel title: “CPU Usage”

For memory:

Query: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Legend: Memory Usage %

For disk:

Query: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
Legend: Disk Usage %

Save the dashboard. Call it “Home Lab Overview.”

Alternatively, import a community dashboard: Dashboards → Import, paste dashboard ID 1860 (Node Exporter Full). It’s maintained, complete, and saves you hours.

Node Exporter: What’s Actually Exposed?

Node Exporter exposes metrics prefixed with node_. Key ones:

node_cpu_seconds_total: CPU time per core and mode (user, system, idle)
node_memory_MemTotal_bytes, node_memory_MemAvailable_bytes: RAM
node_filesystem_size_bytes, node_filesystem_avail_bytes: disk
node_load1, node_load5, node_load15: load averages
node_network_transmit_bytes_total, node_network_receive_bytes_total: network I/O
node_processes_running: running process count

Hit http://localhost:9100/metrics and you’ll see them all. Thousands of metrics. But you only care about a handful for a small homelab.

cAdvisor: Docker Container Metrics

cAdvisor watches Docker containers and exposes metrics like container_cpu_usage_seconds_total, container_memory_usage_bytes, container_network_transmit_bytes_total. Graph them the same way. cAdvisor is available at http://localhost:8080.

The PromQL You Need to Know

You don’t need to be a PromQL wizard, but here are the essentials:

rate(metric[5m]): per-second rate over 5 minutes (smooths spiky data)
irate(metric[5m]): instant rate (reacts faster, noisier)
avg(metric): average across all instances
max(metric): peak value
histogram_quantile(0.95, metric): 95th percentile
Operators: +, -, *, /, and, or

That’s 95% of what you’ll write.

What About Backups?

Prometheus stores TSDB data in /prometheus. That directory is a Docker volume. Back it up like any other volume: docker run --rm -v prometheus_data:/data -v /backup:/backup alpine tar czf /backup/prometheus.tar.gz -C /data .

Grafana dashboards are stored in /var/lib/grafana, also a volume. Same backup strategy.

The Truth About Monitoring

Monitoring is boring. But silence at 2 AM is worse. This stack finds problems before they cascade. Your users find out when you do, which is the right order.

No fancy ML. No predictive alerting. No dashboards with 47 panels. Just numbers, thresholds, and a loud bell.

That’s the whole thing. Fire it up.

Prometheus + Grafana: Monitoring That Doesn't Lie to You

The Architecture (30 seconds)

The Full Stack (Docker Compose)

Prometheus Config (The Scrape Targets)

Alert Rules (Wake You Up at 2 AM)

Alertmanager Config (Who Gets Woken Up)

Grafana: Add Prometheus as a Data Source

Building a Dashboard from Scratch

Node Exporter: What’s Actually Exposed?

cAdvisor: Docker Container Metrics

The PromQL You Need to Know

What About Backups?

The Truth About Monitoring

Responses from around the web

Discussion

Related Posts

Uptime Monitoring with Uptime Kuma

Beszel: Server Monitoring Without the Prometheus Tax

Dead Container Took Down Prod

Hoist: Label-Driven Docker Updates

Prometheus + Grafana: Monitoring That Doesn't Lie to You

The Architecture (30 seconds)

The Full Stack (Docker Compose)

Prometheus Config (The Scrape Targets)

Alert Rules (Wake You Up at 2 AM)

Alertmanager Config (Who Gets Woken Up)

Grafana: Add Prometheus as a Data Source

Building a Dashboard from Scratch

Node Exporter: What’s Actually Exposed?

cAdvisor: Docker Container Metrics

The PromQL You Need to Know

What About Backups?

The Truth About Monitoring

Related Reading

Responses from around the web

Discussion

Related Posts

Uptime Monitoring with Uptime Kuma

Beszel: Server Monitoring Without the Prometheus Tax

Dead Container Took Down Prod

Hoist: Label-Driven Docker Updates