Your Server Was Screaming and You Didn’t Hear It
Here’s a scenario that should feel uncomfortably familiar: you’re going about your day, completely unbothered, maybe even smug about your self-hosted setup, when someone pings you asking why your app has been down for two hours. You check. It has been down for two hours. The disk filled up. You had no idea.
This is what flying blind looks like. And most home lab setups run exactly like this — totally dark until something explodes and users (or worse, yourself, at 11pm) discover the carnage.
The fix is a proper monitoring stack. Specifically: Prometheus for collecting metrics and Grafana for turning those metrics into dashboards so beautiful you’ll spend three hours tuning the colors instead of doing actual work.
This guide gets you a full working stack in Docker Compose: Prometheus, Grafana, Node Exporter (host metrics), and cAdvisor (container metrics). By the end, you’ll have CPU, RAM, disk, and container stats flowing into a live dashboard — plus a basic alert so something yells at you before the disk fills up again.
What Prometheus Actually Does
Prometheus is a pull-based time series database. That’s the key word: pull. Instead of your apps shoving metrics at a central collector, Prometheus goes out on a schedule and scrapes metrics from targets that expose an HTTP endpoint (usually /metrics).
Every scrape interval (default: 15 seconds), Prometheus hits your configured targets and stores whatever they return as time-stamped key-value data. That data lives in its own embedded time series database (TSDB) — no Postgres, no Redis, no drama.
You query this data with PromQL — Prometheus Query Language. PromQL is basically SQL if SQL was written by someone who really hates joins but loves functions. It looks weird at first, but it’s surprisingly readable once you stop fighting it.
A few quick examples to demystify it:
# CPU usage percentage (across all cores)
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Available memory in bytes
node_memory_MemAvailable_bytes
# Disk space used percentage
100 - ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes)
# Container CPU usage rate
rate(container_cpu_usage_seconds_total[5m])
You don’t need to memorize these. The community dashboards (more on that later) come pre-loaded with all the queries you’ll ever need.
What Grafana Does
Grafana is the visualization layer. It doesn’t store data — it connects to data sources (like Prometheus) and renders dashboards from queries you define.
Out of the box, Grafana gives you:
- A drag-and-drop dashboard builder
- Support for dozens of data sources beyond Prometheus
- A built-in alerting engine that can notify you via email, Slack, PagerDuty, webhooks, and more
- A massive community library of pre-built dashboards you can import with a single ID
That last point is the cheat code for getting a gorgeous setup fast. The Node Exporter Full dashboard (ID: 1860) alone is worth the whole setup.
The Stack: Full Docker Compose Setup
Create a directory for your monitoring stack and drop these files in:
mkdir -p ~/monitoring/{prometheus,grafana}
cd ~/monitoring
docker-compose.yml
version: "3.8"
networks:
monitoring:
driver: bridge
volumes:
prometheus_data: {}
grafana_data: {}
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=15d"
- "--web.enable-lifecycle"
ports:
- "9090:9090"
networks:
- monitoring
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3000:3000"
networks:
- monitoring
depends_on:
- prometheus
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- "--path.procfs=/host/proc"
- "--path.rootfs=/rootfs"
- "--path.sysfs=/host/sys"
- "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
ports:
- "9100:9100"
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
privileged: true
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
ports:
- "8080:8080"
networks:
- monitoring
prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: []
rule_files:
- "alerts.yml"
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
- job_name: "cadvisor"
static_configs:
- targets: ["cadvisor:8080"]
prometheus/alerts.yml
groups:
- name: host_alerts
rules:
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 85% for more than 5 minutes."
- alert: LowDiskSpace
expr: 100 - ((node_filesystem_avail_bytes{fstype!="tmpfs"} * 100) / node_filesystem_size_bytes{fstype!="tmpfs"}) > 85
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is above 85% on {{ $labels.mountpoint }}."
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90%."
Spin It Up
docker compose up -d
Give it 30 seconds to settle, then check:
- Prometheus:
http://your-server-ip:9090 - Grafana:
http://your-server-ip:3000(login: admin / changeme — change this immediately) - Node Exporter metrics:
http://your-server-ip:9100/metrics
In Prometheus, go to Status > Targets and verify all three scrape targets show UP. If something is DOWN, check the container logs:
docker logs prometheus
docker logs node-exporter
Connecting Grafana to Prometheus
- Log into Grafana at port 3000
- Go to Connections > Data Sources > Add data source
- Select Prometheus
- Set the URL to
http://prometheus:9090(use the container name — they’re on the same Docker network) - Click Save & Test — you should see a green checkmark
If you see a connection error, double-check you used prometheus as the hostname, not localhost. Localhost inside a container is the container itself.
Import the Node Exporter Full Dashboard
This is the cheat code. Instead of building a dashboard from scratch (which you can do later, after you’ve accepted the dashboard addiction), import the community masterpiece:
- In Grafana, go to Dashboards > Import
- Enter dashboard ID: 1860
- Click Load
- Select your Prometheus data source
- Click Import
You now have a fully loaded system metrics dashboard showing CPU, memory, disk I/O, network traffic, system load, and more. Fair warning: you will immediately start tweaking panel colors and rearranging tiles. This is normal. Schedule the three hours now.
For container metrics from cAdvisor, import dashboard ID 14282 (Docker container monitoring).
A Few Useful PromQL Queries to Know
Once you’re poking around in Prometheus or building custom panels, these come up constantly:
# System uptime in seconds
node_time_seconds - node_boot_time_seconds
# Network received bytes per second (eth0)
rate(node_network_receive_bytes_total{device="eth0"}[5m])
# Number of running containers
count(container_last_seen{name!=""})
# Disk read bytes per second
rate(node_disk_read_bytes_total[5m])
PromQL’s rate() function is your best friend. It takes a counter metric and calculates how fast it’s increasing over a time window. Most raw metrics from Node Exporter are counters (they only go up), so rate() converts them into something actually readable, like “bytes per second” instead of “total bytes since boot.”
Setting Up Grafana Alerts
The alert rules in alerts.yml tell Prometheus to fire alerts — but right now there’s nowhere to send them. For a basic setup, configure Grafana’s alerting to notify you directly:
- Go to Alerting > Contact Points
- Add a new contact point (email, Slack webhook, Telegram bot — whatever you use)
- Go to Alerting > Notification Policies and set your default policy to use that contact point
For Prometheus alerts to show up in Grafana’s alert list, you need Alertmanager — but that’s a whole separate container and config. For home lab use, Grafana’s built-in alerting on top of your Prometheus data source is usually enough. Create alert rules directly in Grafana under Alerting > Alert Rules, using the same PromQL queries from above.
Retention and Storage
By default, the compose file above keeps 15 days of metrics (--storage.tsdb.retention.time=15d). Prometheus is efficient — a typical home lab setup with a handful of scrape targets uses maybe 1-3 GB for 15 days of data. If you’re tight on disk, drop it to 7d. If you want more history, bump it up. Just make sure your volume has space.
What You Have Now
After following this guide, you have:
- A Prometheus instance scraping your host and container metrics every 15 seconds
- Grafana with a beautiful dashboard showing everything worth knowing about your server
- Alert rules that will fire if CPU, memory, or disk hit dangerous levels
- The ability to look at graphs and say “yep, that’s when the backup job ran” with complete confidence
More importantly, the next time something goes wrong, you’ll have a timeline. You’ll know exactly when CPU spiked, when memory got eaten, or when disk started filling up. You go from “I have no idea what happened” to “disk usage climbed steadily from 3am and hit 100% at 6:47am” — which makes fixing problems dramatically less painful.
Stop flying blind. Your server is trying to tell you things. Now you can actually hear it.