Alert Best Practices
Follow these best practices to keep your alert system effective and your team happy.
1. Reduce Alert Fatigue
Alert fatigue happens when you get too many alerts, causing you to ignore important ones.
Signs of Alert Fatigue:
- More than 10 alerts per day
- Team ignoring notifications
- People disabling alerts
- False alarms outnumber real issues
How to Fix It
Increase Thresholds
Before: CPU > 70% = 50 alerts per day
After: CPU > 85% = 5 alerts per day
Result: Only alert on serious issues
Disable Non-Critical Rules
- Remove low-priority alerts
- Focus on what matters
- Add back later if needed
Use Digests
Instead of: 100 individual emails
Use: 1 daily digest email with 100 alerts
Combine Related Alerts
Before: 3 separate memory alerts for 3 services
After: 1 alert for "Any service memory > 90%"
2. Write Clear Rule Names
A good rule name tells you exactly what's wrong when you see it.
Good Rule Names
- ✅ "Production API - Response Time Over 5 Seconds"
- ✅ "Database Server - Memory Usage High"
- ✅ "Payment Service - Error Rate Above 5%"
Bad Rule Names
- ❌ "Alert 1"
- ❌ "API"
- ❌ "CPU"
- ❌ "Monitoring"
Naming Formula
[Service] - [What's Wrong] - [Threshold/Condition]
Example:
Production API - High Response Time - > 5000ms
3. Set Thresholds Based on Reality
Thresholds should be based on:
- Your application's normal operating range
- Your Service Level Agreement (SLA)
- What value actually needs action
Finding the Right Threshold
Step 1: Monitor for 1 Week
- Watch your metric without alerting
- Note the normal range
- Note the peak values
Step 2: Set Threshold Above Normal
Normal range: 20-40%
Peak values: 50-60%
Alert threshold: 75%
(Gives 15% buffer before crisis)
Step 3: Test
- Watch if alerts fire naturally
- Adjust if needed
Step 4: Document
- Note why you chose this threshold
- Update if conditions change
Examples
CPU Usage
Typical: 30-50%
Peak: 70%
Alert: 80% (Warning)
Alert: 95% (Critical)
API Response Time
Normal: 200-400ms
Acceptable: < 2000ms
Alert: > 5000ms (Warning)
Alert: > 10000ms (Critical)
Error Rate
Normal: 0.1%
Alert: > 1% (Warning)
Alert: > 5% (Critical)
4. Document Your Rules
Good documentation saves time during incidents.
What to Document
Rule Description:
- What metric does it monitor?
- What threshold triggers it?
- Why is this threshold important?
Common Causes:
Rule: "High Database CPU"
Causes:
1. Slow SQL queries (run EXPLAIN PLAN)
2. High concurrent connections
3. Missing indexes
4. Data corruption
How to Fix:
Quick fixes:
1. Check slow query log
2. Kill long-running queries
3. Bounce database if needed
Permanent fixes:
1. Optimize queries
2. Add indexes
3. Scale database resources
Who Should Know:
- DBA team
- Application team
- SRE team
5. Use Appropriate Severity Levels
Choose severity based on impact to users and system.
Severity Decision Tree
Can users use the service?
├─ NO → Critical 🔴
└─ YES
Performance degraded significantly?
├─ YES → Warning 🟠
└─ NO → Info 🟡
Examples
Critical 🔴
- Service is completely down
- Data corruption risk
- Security breach
Warning 🟠
- Service is slow
- Users can use it but frustrated
- Approaching critical threshold
Info 🟡
- Deployment completed
- Scheduled maintenance
- Metrics for awareness only
6. Start Simple and Add Gradually
Don't create 100 rules on day one.
Recommended Approach
Week 1: Critical Only
- Website/API Down
- Database Down
- Deployment Failures
Week 2: Add Performance
- High Response Time
- High CPU Usage
- High Error Rate
Week 3: Add Resource
- Low Disk Space
- High Memory Usage
- Network Issues
Week 4+: Refine and Optimize
- Adjust thresholds
- Remove false alarms
- Add team-specific rules
7. Route by Severity
Different severities need different response channels.
Recommended Routing
Critical Alert
├─ PagerDuty (immediate page)
├─ Email (documentation)
└─ Slack (team visibility)
Warning Alert
├─ Slack (team discussion)
└─ Email digest (daily)
Info Alert
└─ Email digest (daily or weekly)
Benefits
- Critical alerts get immediate attention
- Warnings allow team discussion
- Info alerts don't interrupt everyone
8. Review and Adjust Regularly
Alert system needs maintenance.
Weekly Review
- Check which alerts fired
- Any false alarms?
- Any alert fatigue signals?
Monthly Review
- Are thresholds still appropriate?
- Any new patterns?
- Rules that should be removed?
- Channels that need updating?
Quarterly Review
- Full alert system audit
- Update documentation
- Team training if needed
- Adjust for seasonal changes
9. Test Before Relying
Always test new alerts and channels.
Test Checklist
For New Rules:
- Rule name is clear
- Threshold makes sense
- Notification channel selected
- Test alert fires correctly
- Received notification
- Acknowledged and resolved it
For New Channels:
- Credentials configured
- Send test notification
- Verify you received it
- Check message format
- No special characters breaking it
10. Keep Contact Information Updated
Alerts are useless if they go to wrong person.
Quarterly Audit
- Email addresses current?
- Slack members still on team?
- PagerDuty escalation policy updated?
- Webhook URLs still valid?
- All channels operational?
When Team Changes
- Add new team members to channels
- Remove people who left
- Update escalation paths
- Test all channels after changes
11. Document Root Causes
Learn from each incident.
After Resolving Alert
Write down:
What Happened:
Production API CPU spiked to 95% at 2:30 PM
Response time increased from 200ms to 5000ms
Users reported page timeouts
Root Cause:
New campaign drove 10x traffic
Caching was misconfigured after deployment
Database connection pool was too small
How We Fixed It:
1. Scaled API servers horizontally (15 min)
2. Fixed cache configuration (10 min)
3. Increased connection pool (5 min)
4. Traffic normalized after 30 minutes
How to Prevent Next Time:
1. Load test before major campaigns
2. Implement auto-scaling
3. Better monitoring on traffic metrics
4. Pre-deployment checklist for configs
12. Communicate with Your Team
Alert system is a team tool.
Share Knowledge
- Document common issues
- Share troubleshooting guides
- Teach new team members
- Review incidents together
Team Alerts Meeting
Monthly 15-minute meeting:
- Review alert trends
- Discuss improvements
- Update documentation
- Share learnings
Common Mistakes to Avoid
❌ Too Many Alerts
Creates alert fatigue. You'll ignore important ones.
Fix: Increase thresholds, disable non-critical
❌ Alerts with No Action
Alert fires but there's nothing to do about it.
Fix: Delete alerts you can't act on
❌ Never Adjusting Thresholds
Rules become outdated as system changes.
Fix: Review and adjust monthly
❌ Not Testing Channels
Notifications don't work when you need them.
Fix: Test quarterly
❌ Vague Rule Names
Team doesn't understand what's wrong.
Fix: Use specific, descriptive names
❌ No Documentation
Everyone asks what alert means.
Fix: Document each rule's purpose
❌ Not Learning from Incidents
Same problems keep happening.
Fix: Document root causes and fixes
Alert Checklist
Use this checklist when creating new rules:
Planning:
- What metric should we monitor?
- What threshold makes sense?
- What's the normal range?
- What's the current peak?
- How often would this naturally trigger?
Configuration:
- Clear, specific rule name
- Correct threshold value
- Appropriate severity level
- Documented description
- Notification channel selected
Testing:
- Rule saves without errors
- Rule enables successfully
- Test notification received
- Team aware of new rule
- Know how to respond if it fires
Monitoring:
- Watch for 1 week
- Adjust if too many false alarms
- Adjust if never fires
- Document findings
Alert Audit Checklist
Run quarterly to keep system healthy:
Rules:
- All active rules still needed?
- Thresholds still appropriate?
- Names still make sense?
- Descriptions current?
- Any duplicates?
Notifications:
- Email addresses correct?
- Slack channels still exist?
- PagerDuty policy updated?
- All channels tested?
Team:
- Team trained on alerts?
- Escalation path clear?
- On-call rotation current?
- Response procedures documented?
Getting Help
For alert best practices questions:
- Check this guide
- Review examples from successful teams
- Contact: [email protected]
To improve your alert system:
- Pick one best practice from this guide
- Implement this week
- Measure the improvement
- Pick next best practice
- Repeat monthly
Next Steps
- Creating Alert Rules - Create your rules
- Alert Configuration - Set up notifications
- Responding to Alerts - Handle active alerts
Summary
Remember:
- ✅ Start simple and add gradually
- ✅ Use clear names and documentation
- ✅ Set realistic thresholds
- ✅ Test before relying
- ✅ Review and adjust regularly
- ✅ Learn from incidents
- ✅ Keep your team informed
Happy alerting! 🎉