Responding to Alerts - SRE Guide | Nife Deploy

When an alert fires, you need to respond quickly. This guide shows you how to handle active alerts.

Getting an Alert#

What You'll See#

When an alert fires, you'll be notified through your configured channels:

If Email:

Subject: CRITICAL: Production API Down
Body: Your production API is not responding
Severity: Critical
Time: 2024-01-15 14:32:00 UTC

If Slack:

๐Ÿšจ CRITICAL: Production API Down
Status: Firing
Resource: api.example.com
Click here to view details

If PagerDuty:

  • You'll get paged immediately
  • Incident automatically created
  • Escalates if not acknowledged

Quick Response Workflow#

1. Get Notification
โ†“
2. Open SRE Alerts page
โ†“
3. Click "Acknowledge"
โ†“
4. Investigate the issue
โ†“
5. Fix the problem
โ†“
6. Click "Resolve"
โ†“
7. Confirm everything is working

Step 1: Access the Alert#

Via Notification Link#

Most notifications include a direct link:

  • Click the link
  • Goes straight to the alert details

Via Dashboard#

  1. Go to SRE โ†’ Alerts
  2. Look for the alert with Firing status (red badge)
  3. Usually at the top of the list

Find Specific Alert#

Use filters to find your alert quickly:

Status: Firing (most urgent)
Severity: Critical (highest priority)

Step 2: Review Alert Details#

When you open the alert, you'll see:

Alert Information:

  • ๐Ÿ”” Alert status (Firing/Acknowledged/Resolved)
  • Alert title and description
  • Resource affected
  • Severity level
  • When it fired (e.g., "5 minutes ago")

Example Alert:

Status: Firing ๐Ÿ””
Title: High CPU Usage on Production API
Severity: Critical
Resource: prod-api-server-01
Fired: 3 minutes ago
Description: CPU usage exceeded 80% threshold
Current value: 87%

Step 3: Acknowledge the Alert#

Why Acknowledge?#

Tells your team:

  • You've seen the alert
  • You're investigating it
  • Others don't need to also respond

How to Acknowledge#

  1. Click the Acknowledge button
  2. Confirm in the popup dialog
  3. Alert status changes from "Firing" to "Acknowledged"
  4. Your name appears as the investigator

What Happens:#

Before: Status = Firing (everyone should look at it)
โ†“
You click Acknowledge
โ†“
After: Status = Acknowledged (I'm looking at it)

When to Acknowledge:#

  • โœ… Immediately when you start investigating
  • โœ… Even if you can't fix it right away
  • โœ… So team knows you're on it

When NOT to Acknowledge:#

  • โŒ If you don't actually know what's happening
  • โŒ If someone else should handle it
  • โŒ If you can't take action

Step 4: Investigate the Issue#

What to Do#

  1. Understand the Alert

    • What is it monitoring?
    • What threshold triggered it?
    • What's the current value?
  2. Check the Resource

    • Log in to the system
    • View metrics/logs for that resource
    • Check for errors or anomalies
  3. Identify the Problem

    • What's causing the issue?
    • When did it start?
    • Is it affecting users?
  4. Document Your Findings

    • Write down what you found
    • Note what you're trying
    • Keep team informed in Slack if needed

Investigation Tips#

If CPU is High:

  • Check what process is using CPU
  • Look for runaway queries or loops
  • Check if traffic spike occurred

If API is Slow:

  • Check database performance
  • Review error logs
  • Check if upstream service is down

If Memory is High:

  • Look for memory leaks
  • Check if cache is bloated
  • Verify application version

If Service is Down:

  • Check if it's running
  • Look at recent deployments
  • Check network connectivity
  • Review error logs

Step 5: Fix the Problem#

Take Action#

Based on your investigation, take appropriate action:

Common Fixes:

  • Restart the service
  • Scale up the application
  • Clear cache
  • Kill runaway process
  • Deploy a fix
  • Adjust configuration
  • Route traffic elsewhere

Verify the Fix#

After taking action, verify it worked:

  • Check the metric that triggered alert
  • Confirm it's back to normal
  • Test the functionality
  • Have users report if working

Before Resolving: Make absolutely sure the issue is fixed. Don't resolve and have it fire again 5 minutes later.


Step 6: Resolve the Alert#

How to Resolve#

  1. Click the Resolve button
  2. Confirm in the popup dialog
  3. Alert status changes to "Resolved" (โœ“)
  4. Your name appears as who resolved it

What This Means:#

Status: Resolved โœ“ = Issue is fixed, no more action needed

When to Resolve:#

  • โœ… After you've fixed the underlying issue
  • โœ… After you've verified the fix works
  • โœ… When the metric is back to normal

Don't Resolve Until:#

  • โŒ The issue is completely fixed
  • โŒ You've verified the fix works
  • โŒ Metric is back to acceptable level

Step 7: Verify and Document#

Final Verification#

Check that:

  • The alert status changed to "Resolved"
  • The metric is back to normal
  • No related alerts are firing
  • Users aren't reporting issues
  • Team is aware it's resolved

Document for the Team#

Post an update in Slack:

โœ… RESOLVED: Production API High CPU
Issue: Memory leak in v2.1.0
Fix: Rolled back to v2.0.5
Status: All systems normal, no user impact
ETA for permanent fix: Tuesday

Alert Statuses Reference#

Firing Status ๐Ÿ”” (Red)#

What it means: Alert condition is currently true. Action is needed.

What you should do:

  1. Acknowledge it
  2. Investigate
  3. Fix it
  4. Resolve it

Duration: Until you resolve it

Acknowledged Status โฑ๏ธ (Yellow)#

What it means: Someone is already investigating. No need for others to duplicate work.

What it shows:

  • Shows who acknowledged it
  • Shows when they acknowledged it

Next step: Resolve once the issue is fixed

Duration: While being investigated

Resolved Status โœ“ (Green)#

What it means: The issue has been fixed. Alert is closed.

What it shows:

  • Shows who resolved it
  • Shows when it was resolved

Historical: Kept for records and trend analysis

Duration: Forever (historical record)


Filtering to Find Alerts#

Filter by Status#

Firing:

  • Currently active alerts
  • Need attention NOW
  • Most urgent

Acknowledged:

  • Someone is investigating
  • Not yet resolved
  • Don't need to duplicate work

Resolved:

  • Historical view
  • Issue is fixed
  • Use for trend analysis

Filter by Severity#

Critical:

  • Immediate action needed
  • System down or critical function broken

High:

  • Needs urgent attention
  • User-impacting

Medium:

  • Should be addressed
  • Non-critical issues

Low:

  • Nice to know
  • Can handle when you have time

Team Coordination#

When Multiple People Need to Help#

  1. First person: Acknowledge the alert
  2. Post in Slack: "I'm on the X alert, currently investigating"
  3. Assign tasks: "Can someone check the database?"
  4. Coordinate: "Try restarting service on server-02"
  5. Report: "Found the issue, deploying fix now"
  6. Resolve: Once fixed

Escalation if Stuck#

If you can't resolve it:

  1. Post in Slack asking for help
  2. Escalate to team lead if urgent
  3. If critical, create incident ticket
  4. Keep alert Acknowledged so others know it's being worked on

Common Alert Scenarios#

Scenario 1: False Alarm#

Problem: Alert fired but nothing is actually wrong

Solution:

  1. Acknowledge it
  2. Verify the metric
  3. Resolve if confirmed to be false
  4. Later: Adjust the alert threshold to prevent false alarms

Scenario 2: Recurring Alert#

Problem: Same alert keeps firing over and over

Solution:

  1. First time: Acknowledge and investigate
  2. Second time: Find root cause
  3. Implement permanent fix
  4. Adjust alert threshold if needed

Scenario 3: Cascading Alerts#

Problem: One issue causes multiple alerts to fire

Example:

Database goes down
โ†’ API Alert fires
โ†’ Website Alert fires
โ†’ Scheduled Job Alert fires

Solution:

  1. Fix the root cause (database)
  2. All related alerts auto-resolve
  3. Document it for future reference

Scenario 4: Alert During Maintenance#

Problem: Alert fires while you're doing planned maintenance

Solution:

  1. You expected it
  2. Still acknowledge it
  3. Note in alert: "Expected, maintenance in progress"
  4. Resolve when maintenance complete

Best Practices for Responding#

1. Respond Quickly#

  • Acknowledge within 5 minutes
  • Team should see it's being handled
  • Time is critical for critical alerts

2. Acknowledge Immediately#

  • Don't wait until you have the solution
  • Let team know you're investigating
  • Prevents duplicate work

3. Keep Team Informed#

  • Post updates in Slack
  • Let others know what you found
  • Ask for help if needed

4. Verify Before Resolving#

  • Don't resolve until truly fixed
  • Verify the metric is back to normal
  • Check downstream systems

5. Document What Happened#

  • Write it down for future reference
  • Include root cause
  • Note how you fixed it

6. Learn from It#

  • Why did it happen?
  • How can we prevent it next time?
  • Do we need to adjust alert thresholds?
  • Do we need better monitoring?

Quick Response Checklist#

  • Alert notification received
  • Opened SRE Alerts page
  • Found the firing alert
  • Clicked Acknowledge
  • Reviewed alert details
  • Investigated the issue
  • Found the root cause
  • Implemented a fix
  • Verified the fix works
  • Metric back to normal
  • Clicked Resolve
  • Posted update in Slack
  • Documented for future reference

Getting Help During an Alert#

Need to ask for help?

  • Post in Slack channel
  • @mention the relevant team
  • Include alert details
  • Ask specific questions

Example:

@backend-team: Production API CPU alert
Currently at 87%, investigating.
Can someone check database performance?
Last deployment was 2 hours ago, could be related.

Next Steps#

  1. Best Practices - Learn more advanced strategies
  2. Alert Management - Manage your rules
  3. Monitoring Guide - Overall monitoring strategy

Quick Links#

NeedLocation
View alertsSRE โ†’ Alerts
Create ruleAlerts โ†’ Alert Rules
Configure notificationsAlerts โ†’ Alert Config
HelpClick ? icon

Contact Support#

For issues with responding to alerts: