Responding to Alerts - SRE Guide | Nife Deploy
When an alert fires, you need to respond quickly. This guide shows you how to handle active alerts.
Getting an Alert#
What You'll See#
When an alert fires, you'll be notified through your configured channels:
If Email:
If Slack:
If PagerDuty:
- You'll get paged immediately
- Incident automatically created
- Escalates if not acknowledged
Quick Response Workflow#
Step 1: Access the Alert#
Via Notification Link#
Most notifications include a direct link:
- Click the link
- Goes straight to the alert details
Via Dashboard#
- Go to SRE โ Alerts
- Look for the alert with Firing status (red badge)
- Usually at the top of the list
Find Specific Alert#
Use filters to find your alert quickly:
Step 2: Review Alert Details#
When you open the alert, you'll see:
Alert Information:
- ๐ Alert status (Firing/Acknowledged/Resolved)
- Alert title and description
- Resource affected
- Severity level
- When it fired (e.g., "5 minutes ago")
Example Alert:
Step 3: Acknowledge the Alert#
Why Acknowledge?#
Tells your team:
- You've seen the alert
- You're investigating it
- Others don't need to also respond
How to Acknowledge#
- Click the Acknowledge button
- Confirm in the popup dialog
- Alert status changes from "Firing" to "Acknowledged"
- Your name appears as the investigator
What Happens:#
When to Acknowledge:#
- โ Immediately when you start investigating
- โ Even if you can't fix it right away
- โ So team knows you're on it
When NOT to Acknowledge:#
- โ If you don't actually know what's happening
- โ If someone else should handle it
- โ If you can't take action
Step 4: Investigate the Issue#
What to Do#
Understand the Alert
- What is it monitoring?
- What threshold triggered it?
- What's the current value?
Check the Resource
- Log in to the system
- View metrics/logs for that resource
- Check for errors or anomalies
Identify the Problem
- What's causing the issue?
- When did it start?
- Is it affecting users?
Document Your Findings
- Write down what you found
- Note what you're trying
- Keep team informed in Slack if needed
Investigation Tips#
If CPU is High:
- Check what process is using CPU
- Look for runaway queries or loops
- Check if traffic spike occurred
If API is Slow:
- Check database performance
- Review error logs
- Check if upstream service is down
If Memory is High:
- Look for memory leaks
- Check if cache is bloated
- Verify application version
If Service is Down:
- Check if it's running
- Look at recent deployments
- Check network connectivity
- Review error logs
Step 5: Fix the Problem#
Take Action#
Based on your investigation, take appropriate action:
Common Fixes:
- Restart the service
- Scale up the application
- Clear cache
- Kill runaway process
- Deploy a fix
- Adjust configuration
- Route traffic elsewhere
Verify the Fix#
After taking action, verify it worked:
- Check the metric that triggered alert
- Confirm it's back to normal
- Test the functionality
- Have users report if working
Before Resolving: Make absolutely sure the issue is fixed. Don't resolve and have it fire again 5 minutes later.
Step 6: Resolve the Alert#
How to Resolve#
- Click the Resolve button
- Confirm in the popup dialog
- Alert status changes to "Resolved" (โ)
- Your name appears as who resolved it
What This Means:#
When to Resolve:#
- โ After you've fixed the underlying issue
- โ After you've verified the fix works
- โ When the metric is back to normal
Don't Resolve Until:#
- โ The issue is completely fixed
- โ You've verified the fix works
- โ Metric is back to acceptable level
Step 7: Verify and Document#
Final Verification#
Check that:
- The alert status changed to "Resolved"
- The metric is back to normal
- No related alerts are firing
- Users aren't reporting issues
- Team is aware it's resolved
Document for the Team#
Post an update in Slack:
Alert Statuses Reference#
Firing Status ๐ (Red)#
What it means: Alert condition is currently true. Action is needed.
What you should do:
- Acknowledge it
- Investigate
- Fix it
- Resolve it
Duration: Until you resolve it
Acknowledged Status โฑ๏ธ (Yellow)#
What it means: Someone is already investigating. No need for others to duplicate work.
What it shows:
- Shows who acknowledged it
- Shows when they acknowledged it
Next step: Resolve once the issue is fixed
Duration: While being investigated
Resolved Status โ (Green)#
What it means: The issue has been fixed. Alert is closed.
What it shows:
- Shows who resolved it
- Shows when it was resolved
Historical: Kept for records and trend analysis
Duration: Forever (historical record)
Filtering to Find Alerts#
Filter by Status#
Firing:
- Currently active alerts
- Need attention NOW
- Most urgent
Acknowledged:
- Someone is investigating
- Not yet resolved
- Don't need to duplicate work
Resolved:
- Historical view
- Issue is fixed
- Use for trend analysis
Filter by Severity#
Critical:
- Immediate action needed
- System down or critical function broken
High:
- Needs urgent attention
- User-impacting
Medium:
- Should be addressed
- Non-critical issues
Low:
- Nice to know
- Can handle when you have time
Team Coordination#
When Multiple People Need to Help#
- First person: Acknowledge the alert
- Post in Slack: "I'm on the X alert, currently investigating"
- Assign tasks: "Can someone check the database?"
- Coordinate: "Try restarting service on server-02"
- Report: "Found the issue, deploying fix now"
- Resolve: Once fixed
Escalation if Stuck#
If you can't resolve it:
- Post in Slack asking for help
- Escalate to team lead if urgent
- If critical, create incident ticket
- Keep alert Acknowledged so others know it's being worked on
Common Alert Scenarios#
Scenario 1: False Alarm#
Problem: Alert fired but nothing is actually wrong
Solution:
- Acknowledge it
- Verify the metric
- Resolve if confirmed to be false
- Later: Adjust the alert threshold to prevent false alarms
Scenario 2: Recurring Alert#
Problem: Same alert keeps firing over and over
Solution:
- First time: Acknowledge and investigate
- Second time: Find root cause
- Implement permanent fix
- Adjust alert threshold if needed
Scenario 3: Cascading Alerts#
Problem: One issue causes multiple alerts to fire
Example:
Solution:
- Fix the root cause (database)
- All related alerts auto-resolve
- Document it for future reference
Scenario 4: Alert During Maintenance#
Problem: Alert fires while you're doing planned maintenance
Solution:
- You expected it
- Still acknowledge it
- Note in alert: "Expected, maintenance in progress"
- Resolve when maintenance complete
Best Practices for Responding#
1. Respond Quickly#
- Acknowledge within 5 minutes
- Team should see it's being handled
- Time is critical for critical alerts
2. Acknowledge Immediately#
- Don't wait until you have the solution
- Let team know you're investigating
- Prevents duplicate work
3. Keep Team Informed#
- Post updates in Slack
- Let others know what you found
- Ask for help if needed
4. Verify Before Resolving#
- Don't resolve until truly fixed
- Verify the metric is back to normal
- Check downstream systems
5. Document What Happened#
- Write it down for future reference
- Include root cause
- Note how you fixed it
6. Learn from It#
- Why did it happen?
- How can we prevent it next time?
- Do we need to adjust alert thresholds?
- Do we need better monitoring?
Quick Response Checklist#
- Alert notification received
- Opened SRE Alerts page
- Found the firing alert
- Clicked Acknowledge
- Reviewed alert details
- Investigated the issue
- Found the root cause
- Implemented a fix
- Verified the fix works
- Metric back to normal
- Clicked Resolve
- Posted update in Slack
- Documented for future reference
Getting Help During an Alert#
Need to ask for help?
- Post in Slack channel
- @mention the relevant team
- Include alert details
- Ask specific questions
Example:
Next Steps#
- Best Practices - Learn more advanced strategies
- Alert Management - Manage your rules
- Monitoring Guide - Overall monitoring strategy
Quick Links#
| Need | Location |
|---|---|
| View alerts | SRE โ Alerts |
| Create rule | Alerts โ Alert Rules |
| Configure notifications | Alerts โ Alert Config |
| Help | Click ? icon |
Contact Support#
For issues with responding to alerts:
- Email: [email protected]
- Dashboard chat: Available 24/7