Skip to content
Go back

Alert Fatigue: Why Your Alerts Are Meaningless

By SumGuy 4 min read
Alert Fatigue: Why Your Alerts Are Meaningless

The Curse

Your phone buzzes. You check Slack. Another alert. You click it. Disk at 85%. You clear the notification.

Five minutes later: Memory at 78%. CPU at 92%. These aren’t emergencies. They’re just… data.

This is alert fatigue. Your team gets so many alerts that you’ve stopped responding to any of them. The one alert that actually matters goes unnoticed.

Why This Happens

Most teams start with a reasonable alerting strategy. Then something breaks. They create an alert so it doesn’t happen again. And again. And again.

Suddenly you’re alerting on:

You’re not alerting on problems. You’re alerting on… metrics.

The Difference

Alert: “The service is down and users are seeing errors.”

Metric to graph: “CPU at 73%.”

Do I need to wake someone up for high CPU? No. It’s data. Worth monitoring, not alerting on.

The Rule of Thumb

An alert should trigger when you’d actually get out of bed at 3 AM to fix it.

High CPU? Probably not. Your service still works. You could optimize tomorrow.

Database connection pool exhausted? Yes. New requests hang. Users see errors. Wake someone up.

Building Real Alerts

1. Alert on Symptoms, Not Causes

Bad:

# Alerting on symptoms of a problem, not the problem
cpu > 70
memory > 80
disk_free < 1000000000 # 1GB

Good:

# Alert when users are actually affected
rate(http_requests_total{status=~"5.."}[5m]) > 10 # >10 errors/sec
http_request_duration_seconds{quantile="0.95"} > 2 # 95th percentile > 2s latency
db_connections_available < 5 # Running out of DB capacity

The first set triggers all the time and doesn’t mean anything. The second set triggers when users are impacted.

2. Distinguish Between Pages and Notifications

PagerDuty pages (wake someone up immediately):

Slack notifications (check when you get a chance):

# Prometheus alerting rules
groups:
- name: critical
rules:
- alert: ServiceDown
expr: up{job="api"} == 0
for: 1m
annotations:
severity: critical
# This goes to PagerDuty
- alert: HighErrorRate
expr: rate(http_500_errors[5m]) > 10
for: 2m
annotations:
severity: critical
- name: warnings
rules:
- alert: HighMemory
expr: memory_usage_percent > 90
for: 10m
annotations:
severity: warning
# This goes to Slack only
- alert: DiskTrendingUp
expr: rate(disk_usage_bytes[1h]) > 1000000000
annotations:
severity: warning

Critical alerts page. Warnings slack. Done.

3. Add Context to Alerts

- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: |
95th percentile latency is {{ $value }}s on {{ $labels.instance }}
Environment: {{ $labels.environment }}
Service: {{ $labels.service }}
Runbook: https://runbooks.internal/high-latency

When the alert fires, the on-call person sees:

Not: “latency alert fired” with no context.

4. Use Alerting Rules to Reduce Noise

Window with for:

- alert: HighCPU
expr: cpu_usage_percent > 80
for: 10m # Only alert if high CPU persists for 10 minutes

A 1-second spike doesn’t matter. But 10 minutes of high CPU? Investigate.

Track state changes, not absolute values:

- alert: ServiceRestarted
expr: increase(service_restarts_total[5m]) > 0
annotations:
description: "Service restarted {{ $value }} times in 5 minutes"

This alerts on the event (restart happened), not the value. Much less noise.

Real Example: Database Alerts

groups:
- name: database
rules:
# Page on this
- alert: DatabaseDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
# Page on this
- alert: ConnectionPoolExhausted
expr: pg_stat_activity_count >= pg_settings_max_connections
for: 2m
labels:
severity: critical
# Slack notification
- alert: LongRunningQueries
expr: max(pg_stat_statements_mean_time) > 5000 # 5+ seconds
for: 5m
labels:
severity: warning
annotations:
summary: "Slow queries detected"
description: "Longest query is {{ $value }}ms"
# Don't alert on this — just graph it
# - alert: HighConnections
# expr: pg_stat_activity_count > 50

Database is down → page. Running out of connections → page. Slow queries → slack. High connection count → graph only, no alert.

The Test

For each alert: “Would I interrupt my weekend for this?”

If the answer is “no,” it goes to Slack, not PagerDuty.

If the answer is “I don’t know what I’d do about it,” delete the alert and add a dashboard instead.

The Payoff

Your team stops ignoring alerts because alerts actually mean something. When your phone buzzes at 3 AM, it’s worth getting up.

That’s not monitoring theater. That’s actual observability.

Start by cutting your alerts in half. You’ll be surprised how few you actually need.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it may appear here.


Previous Post
dotenv Files: The Mistakes That Leak Secrets
Next Post
Cloudflare Workers: Edge Without the PhD

Related Posts