Responding to Alerts - SRE Guide | Nife Deploy

When an alert fires, you need to respond quickly. This guide shows you how to handle active alerts.

Getting an Alert#

What You'll See#

When an alert fires, you'll be notified through your configured channels:

If Email:

Subject: CRITICAL: Production API Down
Body: Your production API is not responding
Severity: Critical
Time: 2024-01-15 14:32:00 UTC

If Slack:

🚨 CRITICAL: Production API Down
Status: Firing
Resource: api.example.com
Click here to view details

If PagerDuty:

You'll get paged immediately
Incident automatically created
Escalates if not acknowledged

Quick Response Workflow#

1. Get Notification
   ↓
2. Open SRE Alerts page
   ↓
3. Click "Acknowledge"
   ↓
4. Investigate the issue
   ↓
5. Fix the problem
   ↓
6. Click "Resolve"
   ↓
7. Confirm everything is working

Step 1: Access the Alert#

Via Notification Link#

Most notifications include a direct link:

Click the link
Goes straight to the alert details

Via Dashboard#

Go to SRE → Alerts
Look for the alert with Firing status (red badge)
Usually at the top of the list

Find Specific Alert#

Use filters to find your alert quickly:

Status: Firing (most urgent)
Severity: Critical (highest priority)

Step 2: Review Alert Details#

When you open the alert, you'll see:

Alert Information:

🔔 Alert status (Firing/Acknowledged/Resolved)
Alert title and description
Resource affected
Severity level
When it fired (e.g., "5 minutes ago")

Example Alert:

Status: Firing 🔔
Title: High CPU Usage on Production API
Severity: Critical
Resource: prod-api-server-01
Fired: 3 minutes ago
Description: CPU usage exceeded 80% threshold
Current value: 87%

Step 3: Acknowledge the Alert#

Why Acknowledge?#

Tells your team:

You've seen the alert
You're investigating it
Others don't need to also respond

How to Acknowledge#

Click the Acknowledge button
Confirm in the popup dialog
Alert status changes from "Firing" to "Acknowledged"
Your name appears as the investigator

What Happens:#

Before: Status = Firing (everyone should look at it)
↓
You click Acknowledge
↓
After: Status = Acknowledged (I'm looking at it)

When to Acknowledge:#

✅ Immediately when you start investigating
✅ Even if you can't fix it right away
✅ So team knows you're on it

When NOT to Acknowledge:#

❌ If you don't actually know what's happening
❌ If someone else should handle it
❌ If you can't take action

Step 4: Investigate the Issue#

What to Do#

Understand the Alert
- What is it monitoring?
- What threshold triggered it?
- What's the current value?
Check the Resource
- Log in to the system
- View metrics/logs for that resource
- Check for errors or anomalies
Identify the Problem
- What's causing the issue?
- When did it start?
- Is it affecting users?
Document Your Findings
- Write down what you found
- Note what you're trying
- Keep team informed in Slack if needed

Investigation Tips#

If CPU is High:

Check what process is using CPU
Look for runaway queries or loops
Check if traffic spike occurred

If API is Slow:

Check database performance
Review error logs
Check if upstream service is down

If Memory is High:

Look for memory leaks
Check if cache is bloated
Verify application version

If Service is Down:

Check if it's running
Look at recent deployments
Check network connectivity
Review error logs

Step 5: Fix the Problem#

Take Action#

Based on your investigation, take appropriate action:

Common Fixes:

Restart the service
Scale up the application
Clear cache
Kill runaway process
Deploy a fix
Adjust configuration
Route traffic elsewhere

Verify the Fix#

After taking action, verify it worked:

Check the metric that triggered alert
Confirm it's back to normal
Test the functionality
Have users report if working

Before Resolving: Make absolutely sure the issue is fixed. Don't resolve and have it fire again 5 minutes later.

Step 6: Resolve the Alert#

How to Resolve#

Click the Resolve button
Confirm in the popup dialog
Alert status changes to "Resolved" (✓)
Your name appears as who resolved it

What This Means:#

Status: Resolved ✓ = Issue is fixed, no more action needed

When to Resolve:#

✅ After you've fixed the underlying issue
✅ After you've verified the fix works
✅ When the metric is back to normal

Don't Resolve Until:#

❌ The issue is completely fixed
❌ You've verified the fix works
❌ Metric is back to acceptable level

Step 7: Verify and Document#

Final Verification#

Check that:

The alert status changed to "Resolved"
The metric is back to normal
No related alerts are firing
Users aren't reporting issues
Team is aware it's resolved

Document for the Team#

Post an update in Slack:

✅ RESOLVED: Production API High CPU
Issue: Memory leak in v2.1.0
Fix: Rolled back to v2.0.5
Status: All systems normal, no user impact
ETA for permanent fix: Tuesday

Alert Statuses Reference#

Firing Status 🔔 (Red)#

What it means: Alert condition is currently true. Action is needed.

What you should do:

Acknowledge it
Investigate
Fix it
Resolve it

Duration: Until you resolve it

Acknowledged Status ⏱️ (Yellow)#

What it means: Someone is already investigating. No need for others to duplicate work.

What it shows:

Shows who acknowledged it
Shows when they acknowledged it

Next step: Resolve once the issue is fixed

Duration: While being investigated

Resolved Status ✓ (Green)#

What it means: The issue has been fixed. Alert is closed.

What it shows:

Shows who resolved it
Shows when it was resolved

Historical: Kept for records and trend analysis

Duration: Forever (historical record)

Filtering to Find Alerts#

Filter by Status#

Firing:

Currently active alerts
Need attention NOW
Most urgent

Acknowledged:

Someone is investigating
Not yet resolved
Don't need to duplicate work

Resolved:

Historical view
Issue is fixed
Use for trend analysis

Filter by Severity#

Critical:

Immediate action needed
System down or critical function broken

High:

Needs urgent attention
User-impacting

Medium:

Should be addressed
Non-critical issues

Low:

Nice to know
Can handle when you have time

Team Coordination#

When Multiple People Need to Help#

First person: Acknowledge the alert
Post in Slack: "I'm on the X alert, currently investigating"
Assign tasks: "Can someone check the database?"
Coordinate: "Try restarting service on server-02"
Report: "Found the issue, deploying fix now"
Resolve: Once fixed

Escalation if Stuck#

If you can't resolve it:

Post in Slack asking for help
Escalate to team lead if urgent
If critical, create incident ticket
Keep alert Acknowledged so others know it's being worked on

Common Alert Scenarios#

Scenario 1: False Alarm#

Problem: Alert fired but nothing is actually wrong

Solution:

Acknowledge it
Verify the metric
Resolve if confirmed to be false
Later: Adjust the alert threshold to prevent false alarms

Scenario 2: Recurring Alert#

Problem: Same alert keeps firing over and over

Solution:

First time: Acknowledge and investigate
Second time: Find root cause
Implement permanent fix
Adjust alert threshold if needed

Scenario 3: Cascading Alerts#

Problem: One issue causes multiple alerts to fire

Example:

Database goes down
→ API Alert fires
→ Website Alert fires
→ Scheduled Job Alert fires

Solution:

Fix the root cause (database)
All related alerts auto-resolve
Document it for future reference

Scenario 4: Alert During Maintenance#

Problem: Alert fires while you're doing planned maintenance

Solution:

You expected it
Still acknowledge it
Note in alert: "Expected, maintenance in progress"
Resolve when maintenance complete

Best Practices for Responding#

1. Respond Quickly#

Acknowledge within 5 minutes
Team should see it's being handled
Time is critical for critical alerts

2. Acknowledge Immediately#

Don't wait until you have the solution
Let team know you're investigating
Prevents duplicate work

3. Keep Team Informed#

Post updates in Slack
Let others know what you found
Ask for help if needed

4. Verify Before Resolving#

Don't resolve until truly fixed
Verify the metric is back to normal
Check downstream systems

5. Document What Happened#

Write it down for future reference
Include root cause
Note how you fixed it

6. Learn from It#

Why did it happen?
How can we prevent it next time?
Do we need to adjust alert thresholds?
Do we need better monitoring?

Quick Response Checklist#

Getting Help During an Alert#

Need to ask for help?

Post in Slack channel
@mention the relevant team
Include alert details
Ask specific questions

Example:

@backend-team: Production API CPU alert
Currently at 87%, investigating.
Can someone check database performance?
Last deployment was 2 hours ago, could be related.

Next Steps#

Best Practices - Learn more advanced strategies
Alert Management - Manage your rules
Monitoring Guide - Overall monitoring strategy

Quick Links#

Need	Location
View alerts	SRE → Alerts
Create rule	Alerts → Alert Rules
Configure notifications	Alerts → Alert Config
Help	Click ? icon

Contact Support#

For issues with responding to alerts:

Email: [email protected]
Dashboard chat: Available 24/7