The Curse
Your phone buzzes. You check Slack. Another alert. You click it. Disk at 85%. You clear the notification.
Five minutes later: Memory at 78%. CPU at 92%. These aren’t emergencies. They’re just… data.
This is alert fatigue. Your team gets so many alerts that you’ve stopped responding to any of them. The one alert that actually matters goes unnoticed.
Why This Happens
Most teams start with a reasonable alerting strategy. Then something breaks. They create an alert so it doesn’t happen again. And again. And again.
Suddenly you’re alerting on:
- CPU > 70%
- Memory > 80%
- Disk > 85%
- High latency (undefined threshold)
- Any exception
- Any 5xx response
- Slow queries
You’re not alerting on problems. You’re alerting on… metrics.
The Difference
Alert: “The service is down and users are seeing errors.”
Metric to graph: “CPU at 73%.”
Do I need to wake someone up for high CPU? No. It’s data. Worth monitoring, not alerting on.
The Rule of Thumb
An alert should trigger when you’d actually get out of bed at 3 AM to fix it.
High CPU? Probably not. Your service still works. You could optimize tomorrow.
Database connection pool exhausted? Yes. New requests hang. Users see errors. Wake someone up.
Building Real Alerts
1. Alert on Symptoms, Not Causes
Bad:
# Alerting on symptoms of a problem, not the problemcpu > 70memory > 80disk_free < 1000000000 # 1GBGood:
# Alert when users are actually affectedrate(http_requests_total{status=~"5.."}[5m]) > 10 # >10 errors/sechttp_request_duration_seconds{quantile="0.95"} > 2 # 95th percentile > 2s latencydb_connections_available < 5 # Running out of DB capacityThe first set triggers all the time and doesn’t mean anything. The second set triggers when users are impacted.
2. Distinguish Between Pages and Notifications
PagerDuty pages (wake someone up immediately):
- Service down
- Cascading failures
- Data loss in progress
- User-impacting errors above threshold
Slack notifications (check when you get a chance):
- Disk usage trending up
- Slow queries increasing
- New security warnings
- Code deployment succeeded
# Prometheus alerting rulesgroups: - name: critical rules: - alert: ServiceDown expr: up{job="api"} == 0 for: 1m annotations: severity: critical # This goes to PagerDuty - alert: HighErrorRate expr: rate(http_500_errors[5m]) > 10 for: 2m annotations: severity: critical
- name: warnings rules: - alert: HighMemory expr: memory_usage_percent > 90 for: 10m annotations: severity: warning # This goes to Slack only - alert: DiskTrendingUp expr: rate(disk_usage_bytes[1h]) > 1000000000 annotations: severity: warningCritical alerts page. Warnings slack. Done.
3. Add Context to Alerts
- alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1 for: 5m labels: severity: warning annotations: summary: "High latency detected" description: | 95th percentile latency is {{ $value }}s on {{ $labels.instance }} Environment: {{ $labels.environment }} Service: {{ $labels.service }} Runbook: https://runbooks.internal/high-latencyWhen the alert fires, the on-call person sees:
- What’s wrong
- Where it’s happening
- A link to a runbook
Not: “latency alert fired” with no context.
4. Use Alerting Rules to Reduce Noise
Window with for:
- alert: HighCPU expr: cpu_usage_percent > 80 for: 10m # Only alert if high CPU persists for 10 minutesA 1-second spike doesn’t matter. But 10 minutes of high CPU? Investigate.
Track state changes, not absolute values:
- alert: ServiceRestarted expr: increase(service_restarts_total[5m]) > 0 annotations: description: "Service restarted {{ $value }} times in 5 minutes"This alerts on the event (restart happened), not the value. Much less noise.
Real Example: Database Alerts
groups: - name: database rules: # Page on this - alert: DatabaseDown expr: pg_up == 0 for: 1m labels: severity: critical
# Page on this - alert: ConnectionPoolExhausted expr: pg_stat_activity_count >= pg_settings_max_connections for: 2m labels: severity: critical
# Slack notification - alert: LongRunningQueries expr: max(pg_stat_statements_mean_time) > 5000 # 5+ seconds for: 5m labels: severity: warning annotations: summary: "Slow queries detected" description: "Longest query is {{ $value }}ms"
# Don't alert on this — just graph it # - alert: HighConnections # expr: pg_stat_activity_count > 50Database is down → page. Running out of connections → page. Slow queries → slack. High connection count → graph only, no alert.
The Test
For each alert: “Would I interrupt my weekend for this?”
If the answer is “no,” it goes to Slack, not PagerDuty.
If the answer is “I don’t know what I’d do about it,” delete the alert and add a dashboard instead.
The Payoff
Your team stops ignoring alerts because alerts actually mean something. When your phone buzzes at 3 AM, it’s worth getting up.
That’s not monitoring theater. That’s actual observability.
Start by cutting your alerts in half. You’ll be surprised how few you actually need.