You’re running a Docker Compose stack in your home lab. Everything works great at 2 PM. Then at 2 AM your phone buzzes because a service died silently and nobody noticed. Sound familiar?
That’s what monitoring is for. Not to make fancy dashboards for your blog post. But to catch problems before they become “I should’ve checked logs four hours ago” problems.
Prometheus and Grafana are the boring, boring tools that actually make this work. Prometheus pulls metrics from your infrastructure. Grafana turns those metrics into graphs you can understand. Alertmanager screams at you when something’s on fire. Together, they’re the monitoring stack that scales from a Raspberry Pi to a data center without lying to you.
Here’s the thing: most tutorials show you Prometheus + Grafana and stop. They don’t show you Alertmanager. They don’t show you Node Exporter. They definitely don’t show you how to actually use PromQL to find the problems that matter. This article builds the whole thing—and you’ll have a working stack you can copy-paste into your lab right now.
The Architecture (30 seconds)
Prometheus is a pull-based monitoring system. Your services expose metrics at an HTTP endpoint (:9090/metrics). Prometheus scrapes those endpoints on a schedule. Alertmanager watches those metrics for conditions you define. Grafana reads from Prometheus and renders graphs.
Exporters are the secret sauce. They’re small programs that sit in front of your infrastructure and translate their native metrics into Prometheus format. Node Exporter watches your Linux server. cAdvisor watches your Docker containers. Redis Exporter watches Redis. You get the idea.
Here’s what we’re building: Prometheus scrapes Node Exporter, cAdvisor, and itself. Alertmanager watches those metrics and can send alerts to email or Slack. Grafana dashboards visualize the data. All of it runs in Docker Compose.
The Full Stack (Docker Compose)
version: "3.8"
services: prometheus: image: prom/prometheus:latest container_name: prometheus ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - ./alert.rules.yml:/etc/prometheus/alert.rules.yml - prometheus_data:/prometheus command: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus" networks: - monitoring
alertmanager: image: prom/alertmanager:latest container_name: alertmanager ports: - "9093:9093" volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml - alertmanager_data:/alertmanager command: - "--config.file=/etc/alertmanager/alertmanager.yml" - "--storage.path=/alertmanager" networks: - monitoring
grafana: image: grafana/grafana:latest container_name: grafana ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin - GF_SECURITY_ADMIN_USER=admin - GF_USERS_ALLOW_SIGN_UP=false volumes: - grafana_data:/var/lib/grafana networks: - monitoring depends_on: - prometheus
node_exporter: image: prom/node-exporter:latest container_name: node_exporter ports: - "9100:9100" volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro command: - "--path.procfs=/host/proc" - "--path.sysfs=/host/sys" - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)" networks: - monitoring
cadvisor: image: gcr.io/cadvisor/cadvisor:latest container_name: cadvisor ports: - "8080:8080" volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro networks: - monitoring privileged: true
volumes: prometheus_data: alertmanager_data: grafana_data:
networks: monitoring: driver: bridgeSpin it up:
docker-compose up -dThat’s it. You now have Prometheus listening at http://localhost:9090, Grafana at http://localhost:3000, and Node Exporter feeding data.
Prometheus Config (The Scrape Targets)
Prometheus needs to know what to scrape and how often. That’s prometheus.yml:
global: scrape_interval: 15s evaluation_interval: 15s external_labels: monitor: "homelab"
alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
rule_files: - "alert.rules.yml"
scrape_configs: - job_name: "prometheus" static_configs: - targets: ["localhost:9090"]
- job_name: "node_exporter" static_configs: - targets: ["node_exporter:9100"]
- job_name: "cadvisor" static_configs: - targets: ["cadvisor:8080"]That’s 15-second scrape intervals, Alertmanager targets defined, and three jobs: Prometheus itself, Node Exporter, and cAdvisor. Each target is a service in the Compose stack.
Alert Rules (Wake You Up at 2 AM)
Here’s what actually triggers an alert. Save this as alert.rules.yml:
groups: - name: homelab interval: 30s rules: - alert: HighCPUUsage expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80 for: 5m annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}% for the last 5 minutes"
- alert: HighDiskUsage expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20 for: 5m annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Only {{ $value }}% disk available"
- alert: MemoryUsage expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85 for: 5m annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}%"These are PromQL queries. The CPU one calculates 100 minus idle CPU (inverse = used). Disk one checks if free space is under 20%. Memory checks if available is under 15%. If any condition is true for 5 minutes, an alert fires.
Alertmanager Config (Who Gets Woken Up)
global: resolve_timeout: 5m
route: receiver: default group_by: ["alertname", "instance"] group_wait: 10s group_interval: 10s repeat_interval: 12h
receivers: - name: default email_configs: - to: "your-email@example.com" from: "alertmanager@sumguy.local" smarthost: "smtp.gmail.com:587" auth_username: "your-email@gmail.com" auth_password: "your-app-password" require_tls: trueReplace the email config with yours. If you use Slack, add a webhook receiver. Alertmanager groups related alerts and sends them as a batch every 12 hours (unless a critical alert fires, which repeats faster).
Grafana: Add Prometheus as a Data Source
- Log in to Grafana at
http://localhost:3000(admin/admin) - Configuration → Data Sources
- Click Add Data Source
- Pick Prometheus
- URL:
http://prometheus:9090 - Click Save & Test
Done. Now Grafana can query Prometheus.
Building a Dashboard from Scratch
- Create → Dashboard → Add a new panel
- For a CPU graph:
- Query:
(100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) - Legend:
CPU Usage - Panel title: “CPU Usage”
- Query:
- For memory:
- Query:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 - Legend:
Memory Usage %
- Query:
- For disk:
- Query:
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 - Legend:
Disk Usage %
- Query:
Save the dashboard. Call it “Home Lab Overview.”
Alternatively, import a community dashboard: Dashboards → Import, paste dashboard ID 1860 (Node Exporter Full). It’s maintained, complete, and saves you hours.
Node Exporter: What’s Actually Exposed?
Node Exporter exposes metrics prefixed with node_. Key ones:
node_cpu_seconds_total— CPU time per core and mode (user, system, idle)node_memory_MemTotal_bytes,node_memory_MemAvailable_bytes— RAMnode_filesystem_size_bytes,node_filesystem_avail_bytes— disknode_load1,node_load5,node_load15— load averagesnode_network_transmit_bytes_total,node_network_receive_bytes_total— network I/Onode_processes_running— running process count
Hit http://localhost:9100/metrics and you’ll see them all. Thousands of metrics. But you only care about a handful for a small homelab.
cAdvisor: Docker Container Metrics
cAdvisor watches Docker containers and exposes metrics like container_cpu_usage_seconds_total, container_memory_usage_bytes, container_network_transmit_bytes_total. Graph them the same way. cAdvisor is available at http://localhost:8080.
The PromQL You Need to Know
You don’t need to be a PromQL wizard, but here are the essentials:
rate(metric[5m])— per-second rate over 5 minutes (smooths spiky data)irate(metric[5m])— instant rate (reacts faster, noisier)avg(metric)— average across all instancesmax(metric)— peak valuehistogram_quantile(0.95, metric)— 95th percentile- Operators:
+,-,*,/,and,or
That’s 95% of what you’ll write.
What About Backups?
Prometheus stores TSDB data in /prometheus. That directory is a Docker volume. Back it up like any other volume: docker run --rm -v prometheus_data:/data -v /backup:/backup alpine tar czf /backup/prometheus.tar.gz -C /data .
Grafana dashboards are stored in /var/lib/grafana, also a volume. Same backup strategy.
The Truth About Monitoring
Monitoring is boring. But silence at 2 AM is worse. This stack finds problems before they cascade. Your users find out when you do—which is the right order.
No fancy ML. No predictive alerting. No dashboards with 47 panels. Just numbers, thresholds, and a loud bell.
That’s the whole thing. Fire it up.