Creating Alert Rules - Step-by-Step Guide | Nife Deploy

Alert rules are the foundation of your monitoring system. A rule defines WHEN an alert should fire.

What is an Alert Rule?#

An alert rule is a condition you set up that automatically watches your system. When the condition becomes true, an alert fires.

Simple Example:

Rule: "Alert me if CPU usage goes above 80%"
When: CPU reaches 80% or higher
Then: Fire an alert and notify me

Step-by-Step: Create Your First Alert Rule#

Step 1: Navigate to Alert Rules#

  1. Go to Alerts in the main navigation menu
  2. Click the Alert Rules tab
  3. Click the New Rule button

Step 2: Choose What to Monitor#

Select what metric or condition you want to watch.

Infrastructure Metrics:

  • CPU usage (%)
  • Memory usage (%)
  • Disk space remaining (%)
  • Network traffic (bytes/sec)

Application Metrics:

  • Error rate (%)
  • Response time (ms)
  • Request count (per second)
  • Success rate (%)

Service Status:

  • Service is up/down
  • Endpoint responding (yes/no)
  • Health check status

Step 3: Set the Threshold#

Define the exact value that triggers the alert:

Examples:

  • CPU usage > 80%
  • Memory > 90%
  • Response time > 5000 ms (5 seconds)
  • Error rate > 5%
  • Disk space < 10% remaining
  • Failed requests โ‰ฅ 10

Tips for setting good thresholds:

  • Start with a conservative value (higher threshold = fewer alerts)
  • You can always adjust it later
  • Consider your system's normal patterns
  • Think about what value actually needs action

Step 4: Name Your Rule#

Give it a clear, specific name that describes the problem:

Good Names:

  • "Production API - Response Time Over 5 Seconds"
  • "Database Server - Memory Usage High"
  • "Web Application - High Error Rate"
  • "Backup Job - Failed"

Bad Names:

  • "Alert 1"
  • "CPU"
  • "Problem"

Pro Tip: Name it so someone new to your team understands immediately what it monitors.

Step 5: Choose Severity Level#

Select how serious this problem is:

Critical ๐Ÿ”ด

  • Use when: System is down or critical function broken
  • Examples: App unreachable, database offline, data loss risk
  • Response time: Immediate (within minutes)

Warning ๐ŸŸ 

  • Use when: Problem is impacting users or approaching critical
  • Examples: High response times, approaching resource limits, elevated error rate
  • Response time: Soon (within 1 hour)

Info ๐ŸŸก

  • Use when: Informational only, no urgent action needed
  • Examples: Deployment completed, scheduled maintenance finished
  • Response time: As time allows (no rush)

Step 6: Add Description (Optional but Recommended)#

Explain what this alert means and what to do about it:

"Fires when API response time exceeds 5 seconds.
Usually caused by:
- Database slowness
- High traffic
To fix:
1. Check database performance
2. Review application logs
3. Scale if needed"

Step 7: Save and Enable#

  1. Click Save Rule
  2. The rule is created and appears in your rules list
  3. Toggle the Enabled switch to ON
  4. Rule is now monitoring

Done! Your first alert rule is now active.


Common Alert Rule Examples#

Example 1: Website Down#

What to Monitor: HTTP Status Code
Condition: = 503 or = 0
Severity: Critical
Name: "Website Is Down"
Description: "Service is not responding to requests"

Example 2: High CPU Usage#

What to Monitor: CPU Usage
Condition: > 80%
Severity: Warning
Name: "Server CPU Too High"
Description: "CPU usage is above normal, performance degrading"

Example 3: Low Disk Space#

What to Monitor: Disk Space Free
Condition: < 10%
Severity: Critical
Name: "Disk Space Running Out"
Description: "Less than 10% disk space remaining"

Example 4: Slow API Response#

What to Monitor: API Response Time
Condition: > 5000 ms
Severity: Warning
Name: "API Response Time High"
Description: "API is responding slower than acceptable"

Example 5: High Error Rate#

What to Monitor: Error Rate
Condition: > 5%
Severity: Warning
Name: "Application Error Rate High"
Description: "More than 5% of requests are failing"

Managing Your Rules#

View All Rules#

Go to Alerts โ†’ Alert Rules tab to see all your rules:

  • Rule name
  • Whether it's enabled
  • Current status
  • Last triggered time

Enable/Disable a Rule#

When to disable:

  • Rule is creating too many false alerts
  • You're doing maintenance
  • Rule is outdated

How:

  1. Find the rule in the list
  2. Toggle the Enabled switch
  3. OFF = disabled, ON = enabled

Edit a Rule#

Change a rule's condition, threshold, or name:

  1. Find the rule
  2. Click Edit or the rule name
  3. Change the settings
  4. Click Save

The updated rule applies immediately.

Delete a Rule#

If you no longer need a rule:

  1. Find the rule
  2. Click Delete
  3. Confirm deletion

Note: Deleted rules cannot be recovered.


Best Practices for Alert Rules#

1. Start Simple#

  • Begin with 2-3 critical rules
  • Add more as you get comfortable
  • Don't try to monitor everything at once

2. Use Clear Names#

  • Someone new should understand immediately
  • Include the metric and what triggers it
  • Use consistent naming across your rules

3. Set Realistic Thresholds#

  • Based on your normal operations
  • Not so sensitive you get alert fatigue (100+ per day)
  • Not so loose you miss real problems

4. Document Your Rules#

  • Explain what the rule monitors
  • Note what typically causes it
  • Include steps to fix the issue

5. Review Regularly#

  • Check which rules fire most often
  • Identify false alarms
  • Adjust thresholds based on patterns
  • Remove rules that aren't useful

6. Test Before Relying#

  • Create a test rule with very low threshold
  • Verify it fires when it should
  • Check that you receive notifications
  • Delete the test rule when done

Adjusting Your Rules#

Rule Triggers Too Often#

Problem: You're getting 20 alerts per day but only 2 are real issues.

Solution - Increase Threshold:

Original:

CPU > 70%

Updated:

CPU > 85%

Now it only alerts when truly critical.

Rule Never Triggers#

Problem: You never get the alert, even when the problem exists.

Solution - Lower Threshold:

Original:

CPU > 95%

Updated:

CPU > 80%

Now it catches the problem earlier.

Rule No Longer Needed#

Solution:

  • Disable: Keeps it but turns it off (easy to re-enable)
  • Delete: Removes it completely (good if truly not needed)

When to Create an Alert Rule#

Create a rule for:

  • โœ… Production issues (outages, errors)
  • โœ… Performance degradation (slowness)
  • โœ… Resource exhaustion (disk full, memory high)
  • โœ… Security events
  • โœ… Important business metrics

Don't create a rule for:

  • โŒ Normal/expected variations
  • โŒ Things you can't act on
  • โŒ Low priority "nice to know" info
  • โŒ Duplicate alerts for the same issue

Alert Rule Templates#

Use these templates to create rules for common scenarios:

API/Web Application:

  • Response Time High (> 5 seconds)
  • Error Rate High (> 5%)
  • Service Down (status code 503)

Database:

  • CPU Usage High (> 80%)
  • Memory Usage High (> 90%)
  • Connection Pool Exhausted

Infrastructure:

  • Disk Space Low (< 10%)
  • Network Latency High (> 100ms)
  • Server Unreachable

Scheduled Jobs:

  • Job Failed (0 success)
  • Job Took Too Long (> expected time)
  • Job Didn't Run

Next Steps#

Now that you've created alert rules:

  1. Configure Notifications - Set up how you're notified
  2. Respond to Alerts - Handle active alerts
  3. Alert Management - More advanced topics

Getting Help#

Questions about creating rules?

  • Check the examples above
  • Click the ? icon on the Alerts page
  • Contact support: [email protected]

Rule isn't working?

  • Make sure it's Enabled
  • Check the threshold is set correctly
  • Verify the metric you're monitoring
  • Test with a temporary rule first