Alert Fatigue: Why Your Alerts Are Meaningless

The Curse

Your phone buzzes. You check Slack. Another alert. You click it. Disk at 85%. You clear the notification.

Five minutes later: Memory at 78%. CPU at 92%. These aren’t emergencies. They’re just… data.

This is alert fatigue. Your team gets so many alerts that you’ve stopped responding to any of them. The one alert that actually matters goes unnoticed.

Why This Happens

Most teams start with a reasonable alerting strategy. Then something breaks. They create an alert so it doesn’t happen again. And again. And again.

Suddenly you’re alerting on:

CPU > 70%
Memory > 80%
Disk > 85%
High latency (undefined threshold)
Any exception
Any 5xx response
Slow queries

You’re not alerting on problems. You’re alerting on… metrics.

The Difference

Alert: “The service is down and users are seeing errors.”

Metric to graph: “CPU at 73%.”

Do I need to wake someone up for high CPU? No. It’s data. Worth monitoring, not alerting on.

The Rule of Thumb

An alert should trigger when you’d actually get out of bed at 3 AM to fix it.

High CPU? Probably not. Your service still works. You could optimize tomorrow.

Database connection pool exhausted? Yes. New requests hang. Users see errors. Wake someone up.

Building Real Alerts

1. Alert on Symptoms, Not Causes

Bad:

# Alerting on symptoms of a problem, not the problem
cpu > 70
memory > 80
disk_free < 1000000000  # 1GB

Good:

# Alert when users are actually affected
rate(http_requests_total{status=~"5.."}[5m]) > 10  # >10 errors/sec
http_request_duration_seconds{quantile="0.95"} > 2  # 95th percentile > 2s latency
db_connections_available < 5  # Running out of DB capacity

The first set triggers all the time and doesn’t mean anything. The second set triggers when users are impacted.

2. Distinguish Between Pages and Notifications

PagerDuty pages (wake someone up immediately):

Service down
Cascading failures
Data loss in progress
User-impacting errors above threshold

Slack notifications (check when you get a chance):

Disk usage trending up
Slow queries increasing
New security warnings
Code deployment succeeded

# Prometheus alerting rules
groups:
  - name: critical
    rules:
      - alert: ServiceDown
        expr: up{job="api"} == 0
        for: 1m
        annotations:
          severity: critical
          # This goes to PagerDuty
      - alert: HighErrorRate
        expr: rate(http_500_errors[5m]) > 10
        for: 2m
        annotations:
          severity: critical

  - name: warnings
    rules:
      - alert: HighMemory
        expr: memory_usage_percent > 90
        for: 10m
        annotations:
          severity: warning
          # This goes to Slack only
      - alert: DiskTrendingUp
        expr: rate(disk_usage_bytes[1h]) > 1000000000
        annotations:
          severity: warning

Critical alerts page. Warnings slack. Done.

3. Add Context to Alerts

- alert: HighLatency
  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High latency detected"
    description: |
      95th percentile latency is {{ $value }}s on {{ $labels.instance }}
      Environment: {{ $labels.environment }}
      Service: {{ $labels.service }}
      Runbook: https://runbooks.internal/high-latency

When the alert fires, the on-call person sees:

What’s wrong
Where it’s happening
A link to a runbook

Not: “latency alert fired” with no context.

4. Use Alerting Rules to Reduce Noise

Window with for:

- alert: HighCPU
  expr: cpu_usage_percent > 80
  for: 10m  # Only alert if high CPU persists for 10 minutes

A 1-second spike doesn’t matter. But 10 minutes of high CPU? Investigate.

Track state changes, not absolute values:

- alert: ServiceRestarted
  expr: increase(service_restarts_total[5m]) > 0
  annotations:
    description: "Service restarted {{ $value }} times in 5 minutes"

This alerts on the event (restart happened), not the value. Much less noise.

Real Example: Database Alerts

groups:
  - name: database
    rules:
      # Page on this
      - alert: DatabaseDown
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical

      # Page on this
      - alert: ConnectionPoolExhausted
        expr: pg_stat_activity_count >= pg_settings_max_connections
        for: 2m
        labels:
          severity: critical

      # Slack notification
      - alert: LongRunningQueries
        expr: max(pg_stat_statements_mean_time) > 5000  # 5+ seconds
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow queries detected"
          description: "Longest query is {{ $value }}ms"

      # Don't alert on this — just graph it
      # - alert: HighConnections
      #   expr: pg_stat_activity_count > 50

Database is down → page. Running out of connections → page. Slow queries → slack. High connection count → graph only, no alert.

The Test

For each alert: “Would I interrupt my weekend for this?”

If the answer is “no,” it goes to Slack, not PagerDuty.

If the answer is “I don’t know what I’d do about it,” delete the alert and add a dashboard instead.

The Payoff

Your team stops ignoring alerts because alerts actually mean something. When your phone buzzes at 3 AM, it’s worth getting up.

That’s not monitoring theater. That’s actual observability.

Start by cutting your alerts in half. You’ll be surprised how few you actually need.

Alert Fatigue: Why Your Alerts Are Meaningless

The Curse

Why This Happens

The Difference

The Rule of Thumb

Building Real Alerts

1. Alert on Symptoms, Not Causes

2. Distinguish Between Pages and Notifications

3. Add Context to Alerts

4. Use Alerting Rules to Reduce Noise

Real Example: Database Alerts

The Test

The Payoff

Responses from around the web

Discussion

Related Posts

Linux System Monitoring: Tools and Techniques

Incident Response for Self-Hosters

Bash One-Liners Worth Remembering

Compiling on Linux With Low RAM