Responding to Alerts
When an alert fires, you need to respond quickly. This guide shows you how to handle active alerts.
Getting an Alert
What You'll See
When an alert fires, you'll be notified through your configured channels:
If Email:
Subject: CRITICAL: Production API Down
Body: Your production API is not responding
Severity: Critical
Time: 2024-01-15 14:32:00 UTC
If Slack:
🚨 CRITICAL: Production API Down
Status: Firing
Resource: api.example.com
Click here to view details
If PagerDuty:
- You'll get paged immediately
- Incident automatically created
- Escalates if not acknowledged
Quick Response Workflow
1. Get Notification
↓
2. Open SRE Alerts page
↓
3. Click "Acknowledge"
↓
4. Investigate the issue
↓
5. Fix the problem
↓
6. Click "Resolve"
↓
7. Confirm everything is working
Step 1: Access the Alert
Via Notification Link
Most notifications include a direct link:
- Click the link
- Goes straight to the alert details
Via Dashboard
- Go to SRE → Alerts
- Look for the alert with Firing status (red badge)
- Usually at the top of the list
Find Specific Alert
Use filters to find your alert quickly:
Status: Firing (most urgent)
Severity: Critical (highest priority)
Step 2: Review Alert Details
When you open the alert, you'll see:
Alert Information:
- 🔔 Alert status (Firing/Acknowledged/Resolved)
- Alert title and description
- Resource affected
- Severity level
- When it fired (e.g., "5 minutes ago")
Example Alert:
Status: Firing 🔔
Title: High CPU Usage on Production API
Severity: Critical
Resource: prod-api-server-01
Fired: 3 minutes ago
Description: CPU usage exceeded 80% threshold
Current value: 87%
Step 3: Acknowledge the Alert
Why Acknowledge?
Tells your team:
- You've seen the alert
- You're investigating it
- Others don't need to also respond
How to Acknowledge
- Click the Acknowledge button
- Confirm in the popup dialog
- Alert status changes from "Firing" to "Acknowledged"
- Your name appears as the investigator
What Happens:
Before: Status = Firing (everyone should look at it)
↓
You click Acknowledge
↓
After: Status = Acknowledged (I'm looking at it)
When to Acknowledge:
- ✅ Immediately when you start investigating
- ✅ Even if you can't fix it right away
- ✅ So team knows you're on it
When NOT to Acknowledge:
- ❌ If you don't actually know what's happening
- ❌ If someone else should handle it
- ❌ If you can't take action
Step 4: Investigate the Issue
What to Do
-
Understand the Alert
- What is it monitoring?
- What threshold triggered it?
- What's the current value?
-
Check the Resource
- Log in to the system
- View metrics/logs for that resource
- Check for errors or anomalies
-
Identify the Problem
- What's causing the issue?
- When did it start?
- Is it affecting users?
-
Document Your Findings
- Write down what you found
- Note what you're trying
- Keep team informed in Slack if needed
Investigation Tips
If CPU is High:
- Check what process is using CPU
- Look for runaway queries or loops
- Check if traffic spike occurred
If API is Slow:
- Check database performance
- Review error logs
- Check if upstream service is down
If Memory is High:
- Look for memory leaks
- Check if cache is bloated
- Verify application version
If Service is Down:
- Check if it's running
- Look at recent deployments
- Check network connectivity
- Review error logs
Step 5: Fix the Problem
Take Action
Based on your investigation, take appropriate action:
Common Fixes:
- Restart the service
- Scale up the application
- Clear cache
- Kill runaway process
- Deploy a fix
- Adjust configuration
- Route traffic elsewhere
Verify the Fix
After taking action, verify it worked:
- Check the metric that triggered alert
- Confirm it's back to normal
- Test the functionality
- Have users report if working
Before Resolving: Make absolutely sure the issue is fixed. Don't resolve and have it fire again 5 minutes later.
Step 6: Resolve the Alert
How to Resolve
- Click the Resolve button
- Confirm in the popup dialog
- Alert status changes to "Resolved" (✓)
- Your name appears as who resolved it
What This Means:
Status: Resolved ✓ = Issue is fixed, no more action needed
When to Resolve:
- ✅ After you've fixed the underlying issue
- ✅ After you've verified the fix works
- ✅ When the metric is back to normal
Don't Resolve Until:
- ❌ The issue is completely fixed
- ❌ You've verified the fix works
- ❌ Metric is back to acceptable level
Step 7: Verify and Document
Final Verification
Check that:
- The alert status changed to "Resolved"
- The metric is back to normal
- No related alerts are firing
- Users aren't reporting issues
- Team is aware it's resolved
Document for the Team
Post an update in Slack:
✅ RESOLVED: Production API High CPU
Issue: Memory leak in v2.1.0
Fix: Rolled back to v2.0.5
Status: All systems normal, no user impact
ETA for permanent fix: Tuesday
Alert Statuses Reference
Firing Status 🔔 (Red)
What it means: Alert condition is currently true. Action is needed.
What you should do:
- Acknowledge it
- Investigate
- Fix it
- Resolve it
Duration: Until you resolve it
Acknowledged Status ⏱️ (Yellow)
What it means: Someone is already investigating. No need for others to duplicate work.
What it shows:
- Shows who acknowledged it
- Shows when they acknowledged it
Next step: Resolve once the issue is fixed
Duration: While being investigated
Resolved Status ✓ (Green)
What it means: The issue has been fixed. Alert is closed.
What it shows:
- Shows who resolved it
- Shows when it was resolved
Historical: Kept for records and trend analysis
Duration: Forever (historical record)
Filtering to Find Alerts
Filter by Status
Firing:
- Currently active alerts
- Need attention NOW
- Most urgent
Acknowledged:
- Someone is investigating
- Not yet resolved
- Don't need to duplicate work
Resolved:
- Historical view
- Issue is fixed
- Use for trend analysis
Filter by Severity
Critical:
- Immediate action needed
- System down or critical function broken
High:
- Needs urgent attention
- User-impacting
Medium:
- Should be addressed
- Non-critical issues
Low:
- Nice to know
- Can handle when you have time
Team Coordination
When Multiple People Need to Help
- First person: Acknowledge the alert
- Post in Slack: "I'm on the X alert, currently investigating"
- Assign tasks: "Can someone check the database?"
- Coordinate: "Try restarting service on server-02"
- Report: "Found the issue, deploying fix now"
- Resolve: Once fixed
Escalation if Stuck
If you can't resolve it:
- Post in Slack asking for help
- Escalate to team lead if urgent
- If critical, create incident ticket
- Keep alert Acknowledged so others know it's being worked on
Common Alert Scenarios
Scenario 1: False Alarm
Problem: Alert fired but nothing is actually wrong
Solution:
- Acknowledge it
- Verify the metric
- Resolve if confirmed to be false
- Later: Adjust the alert threshold to prevent false alarms
Scenario 2: Recurring Alert
Problem: Same alert keeps firing over and over
Solution:
- First time: Acknowledge and investigate
- Second time: Find root cause
- Implement permanent fix
- Adjust alert threshold if needed
Scenario 3: Cascading Alerts
Problem: One issue causes multiple alerts to fire
Example:
Database goes down
→ API Alert fires
→ Website Alert fires
→ Scheduled Job Alert fires
Solution:
- Fix the root cause (database)
- All related alerts auto-resolve
- Document it for future reference
Scenario 4: Alert During Maintenance
Problem: Alert fires while you're doing planned maintenance
Solution:
- You expected it
- Still acknowledge it
- Note in alert: "Expected, maintenance in progress"
- Resolve when maintenance complete
Best Practices for Responding
1. Respond Quickly
- Acknowledge within 5 minutes
- Team should see it's being handled
- Time is critical for critical alerts
2. Acknowledge Immediately
- Don't wait until you have the solution
- Let team know you're investigating
- Prevents duplicate work
3. Keep Team Informed
- Post updates in Slack
- Let others know what you found
- Ask for help if needed
4. Verify Before Resolving
- Don't resolve until truly fixed
- Verify the metric is back to normal
- Check downstream systems
5. Document What Happened
- Write it down for future reference
- Include root cause
- Note how you fixed it
6. Learn from It
- Why did it happen?
- How can we prevent it next time?
- Do we need to adjust alert thresholds?
- Do we need better monitoring?
Quick Response Checklist
- Alert notification received
- Opened SRE Alerts page
- Found the firing alert
- Clicked Acknowledge
- Reviewed alert details
- Investigated the issue
- Found the root cause
- Implemented a fix
- Verified the fix works
- Metric back to normal
- Clicked Resolve
- Posted update in Slack
- Documented for future reference
Getting Help During an Alert
Need to ask for help?
- Post in Slack channel
- @mention the relevant team
- Include alert details
- Ask specific questions
Example:
@backend-team: Production API CPU alert
Currently at 87%, investigating.
Can someone check database performance?
Last deployment was 2 hours ago, could be related.
Next Steps
- Best Practices - Learn more advanced strategies
- Alert Management - Manage your rules
- Monitoring Guide - Overall monitoring strategy
Quick Links
| Need | Location |
|---|---|
| View alerts | SRE → Alerts |
| Create rule | Alerts → Alert Rules |
| Configure notifications | Alerts → Alert Config |
| Help | Click ? icon |
Contact Support
For issues with responding to alerts:
- Email: [email protected]
- Dashboard chat: Available 24/7