Skip to main content

Alert Best Practices

Follow these best practices to keep your alert system effective and your team happy.

1. Reduce Alert Fatigue

Alert fatigue happens when you get too many alerts, causing you to ignore important ones.

Signs of Alert Fatigue:

  • More than 10 alerts per day
  • Team ignoring notifications
  • People disabling alerts
  • False alarms outnumber real issues

How to Fix It

Increase Thresholds

Before: CPU > 70% = 50 alerts per day
After: CPU > 85% = 5 alerts per day
Result: Only alert on serious issues

Disable Non-Critical Rules

  • Remove low-priority alerts
  • Focus on what matters
  • Add back later if needed

Use Digests

Instead of: 100 individual emails
Use: 1 daily digest email with 100 alerts

Combine Related Alerts

Before: 3 separate memory alerts for 3 services
After: 1 alert for "Any service memory > 90%"

2. Write Clear Rule Names

A good rule name tells you exactly what's wrong when you see it.

Good Rule Names

  • ✅ "Production API - Response Time Over 5 Seconds"
  • ✅ "Database Server - Memory Usage High"
  • ✅ "Payment Service - Error Rate Above 5%"

Bad Rule Names

  • ❌ "Alert 1"
  • ❌ "API"
  • ❌ "CPU"
  • ❌ "Monitoring"

Naming Formula

[Service] - [What's Wrong] - [Threshold/Condition]

Example:
Production API - High Response Time - > 5000ms

3. Set Thresholds Based on Reality

Thresholds should be based on:

  • Your application's normal operating range
  • Your Service Level Agreement (SLA)
  • What value actually needs action

Finding the Right Threshold

Step 1: Monitor for 1 Week

  • Watch your metric without alerting
  • Note the normal range
  • Note the peak values

Step 2: Set Threshold Above Normal

Normal range: 20-40%
Peak values: 50-60%
Alert threshold: 75%
(Gives 15% buffer before crisis)

Step 3: Test

  • Watch if alerts fire naturally
  • Adjust if needed

Step 4: Document

  • Note why you chose this threshold
  • Update if conditions change

Examples

CPU Usage

Typical: 30-50%
Peak: 70%
Alert: 80% (Warning)
Alert: 95% (Critical)

API Response Time

Normal: 200-400ms
Acceptable: < 2000ms
Alert: &gt; 5000ms (Warning)
Alert: &gt; 10000ms (Critical)

Error Rate

Normal: 0.1%
Alert: &gt; 1% (Warning)
Alert: &gt; 5% (Critical)

4. Document Your Rules

Good documentation saves time during incidents.

What to Document

Rule Description:

  • What metric does it monitor?
  • What threshold triggers it?
  • Why is this threshold important?

Common Causes:

Rule: "High Database CPU"
Causes:
1. Slow SQL queries (run EXPLAIN PLAN)
2. High concurrent connections
3. Missing indexes
4. Data corruption

How to Fix:

Quick fixes:
1. Check slow query log
2. Kill long-running queries
3. Bounce database if needed

Permanent fixes:
1. Optimize queries
2. Add indexes
3. Scale database resources

Who Should Know:

  • DBA team
  • Application team
  • SRE team

5. Use Appropriate Severity Levels

Choose severity based on impact to users and system.

Severity Decision Tree

Can users use the service?
├─ NO → Critical 🔴
└─ YES
Performance degraded significantly?
├─ YES → Warning 🟠
└─ NO → Info 🟡

Examples

Critical 🔴

  • Service is completely down
  • Data corruption risk
  • Security breach

Warning 🟠

  • Service is slow
  • Users can use it but frustrated
  • Approaching critical threshold

Info 🟡

  • Deployment completed
  • Scheduled maintenance
  • Metrics for awareness only

6. Start Simple and Add Gradually

Don't create 100 rules on day one.

Week 1: Critical Only

  • Website/API Down
  • Database Down
  • Deployment Failures

Week 2: Add Performance

  • High Response Time
  • High CPU Usage
  • High Error Rate

Week 3: Add Resource

  • Low Disk Space
  • High Memory Usage
  • Network Issues

Week 4+: Refine and Optimize

  • Adjust thresholds
  • Remove false alarms
  • Add team-specific rules

7. Route by Severity

Different severities need different response channels.

Critical Alert
├─ PagerDuty (immediate page)
├─ Email (documentation)
└─ Slack (team visibility)

Warning Alert
├─ Slack (team discussion)
└─ Email digest (daily)

Info Alert
└─ Email digest (daily or weekly)

Benefits

  • Critical alerts get immediate attention
  • Warnings allow team discussion
  • Info alerts don't interrupt everyone

8. Review and Adjust Regularly

Alert system needs maintenance.

Weekly Review

  • Check which alerts fired
  • Any false alarms?
  • Any alert fatigue signals?

Monthly Review

  • Are thresholds still appropriate?
  • Any new patterns?
  • Rules that should be removed?
  • Channels that need updating?

Quarterly Review

  • Full alert system audit
  • Update documentation
  • Team training if needed
  • Adjust for seasonal changes

9. Test Before Relying

Always test new alerts and channels.

Test Checklist

For New Rules:

  • Rule name is clear
  • Threshold makes sense
  • Notification channel selected
  • Test alert fires correctly
  • Received notification
  • Acknowledged and resolved it

For New Channels:

  • Credentials configured
  • Send test notification
  • Verify you received it
  • Check message format
  • No special characters breaking it

10. Keep Contact Information Updated

Alerts are useless if they go to wrong person.

Quarterly Audit

  • Email addresses current?
  • Slack members still on team?
  • PagerDuty escalation policy updated?
  • Webhook URLs still valid?
  • All channels operational?

When Team Changes

  • Add new team members to channels
  • Remove people who left
  • Update escalation paths
  • Test all channels after changes

11. Document Root Causes

Learn from each incident.

After Resolving Alert

Write down:

What Happened:

Production API CPU spiked to 95% at 2:30 PM
Response time increased from 200ms to 5000ms
Users reported page timeouts

Root Cause:

New campaign drove 10x traffic
Caching was misconfigured after deployment
Database connection pool was too small

How We Fixed It:

1. Scaled API servers horizontally (15 min)
2. Fixed cache configuration (10 min)
3. Increased connection pool (5 min)
4. Traffic normalized after 30 minutes

How to Prevent Next Time:

1. Load test before major campaigns
2. Implement auto-scaling
3. Better monitoring on traffic metrics
4. Pre-deployment checklist for configs

12. Communicate with Your Team

Alert system is a team tool.

Share Knowledge

  • Document common issues
  • Share troubleshooting guides
  • Teach new team members
  • Review incidents together

Team Alerts Meeting

Monthly 15-minute meeting:

  • Review alert trends
  • Discuss improvements
  • Update documentation
  • Share learnings

Common Mistakes to Avoid

❌ Too Many Alerts

Creates alert fatigue. You'll ignore important ones.

Fix: Increase thresholds, disable non-critical

❌ Alerts with No Action

Alert fires but there's nothing to do about it.

Fix: Delete alerts you can't act on

❌ Never Adjusting Thresholds

Rules become outdated as system changes.

Fix: Review and adjust monthly

❌ Not Testing Channels

Notifications don't work when you need them.

Fix: Test quarterly

❌ Vague Rule Names

Team doesn't understand what's wrong.

Fix: Use specific, descriptive names

❌ No Documentation

Everyone asks what alert means.

Fix: Document each rule's purpose

❌ Not Learning from Incidents

Same problems keep happening.

Fix: Document root causes and fixes


Alert Checklist

Use this checklist when creating new rules:

Planning:

  • What metric should we monitor?
  • What threshold makes sense?
  • What's the normal range?
  • What's the current peak?
  • How often would this naturally trigger?

Configuration:

  • Clear, specific rule name
  • Correct threshold value
  • Appropriate severity level
  • Documented description
  • Notification channel selected

Testing:

  • Rule saves without errors
  • Rule enables successfully
  • Test notification received
  • Team aware of new rule
  • Know how to respond if it fires

Monitoring:

  • Watch for 1 week
  • Adjust if too many false alarms
  • Adjust if never fires
  • Document findings

Alert Audit Checklist

Run quarterly to keep system healthy:

Rules:

  • All active rules still needed?
  • Thresholds still appropriate?
  • Names still make sense?
  • Descriptions current?
  • Any duplicates?

Notifications:

  • Email addresses correct?
  • Slack channels still exist?
  • PagerDuty policy updated?
  • All channels tested?

Team:

  • Team trained on alerts?
  • Escalation path clear?
  • On-call rotation current?
  • Response procedures documented?

Getting Help

For alert best practices questions:

To improve your alert system:

  1. Pick one best practice from this guide
  2. Implement this week
  3. Measure the improvement
  4. Pick next best practice
  5. Repeat monthly

Next Steps

  1. Creating Alert Rules - Create your rules
  2. Alert Configuration - Set up notifications
  3. Responding to Alerts - Handle active alerts

Summary

Remember:

  • ✅ Start simple and add gradually
  • ✅ Use clear names and documentation
  • ✅ Set realistic thresholds
  • ✅ Test before relying
  • ✅ Review and adjust regularly
  • ✅ Learn from incidents
  • ✅ Keep your team informed

Happy alerting! 🎉