Creating Alert Rules
Alert rules are the foundation of your monitoring system. A rule defines WHEN an alert should fire.
What is an Alert Rule?
An alert rule is a condition you set up that automatically watches your system. When the condition becomes true, an alert fires.
Simple Example:
Rule: "Alert me if CPU usage goes above 80%"
When: CPU reaches 80% or higher
Then: Fire an alert and notify me
Step-by-Step: Create Your First Alert Rule
Step 1: Navigate to Alert Rules
- Go to Alerts in the main navigation menu
- Click the Alert Rules tab
- Click the New Rule button
Step 2: Choose What to Monitor
Select what metric or condition you want to watch.
Infrastructure Metrics:
- CPU usage (%)
- Memory usage (%)
- Disk space remaining (%)
- Network traffic (bytes/sec)
Application Metrics:
- Error rate (%)
- Response time (ms)
- Request count (per second)
- Success rate (%)
Service Status:
- Service is up/down
- Endpoint responding (yes/no)
- Health check status
Step 3: Set the Threshold
Define the exact value that triggers the alert:
Examples:
- CPU usage > 80%
- Memory > 90%
- Response time > 5000 ms (5 seconds)
- Error rate > 5%
- Disk space < 10% remaining
- Failed requests ≥ 10
Tips for setting good thresholds:
- Start with a conservative value (higher threshold = fewer alerts)
- You can always adjust it later
- Consider your system's normal patterns
- Think about what value actually needs action
Step 4: Name Your Rule
Give it a clear, specific name that describes the problem:
Good Names:
- "Production API - Response Time Over 5 Seconds"
- "Database Server - Memory Usage High"
- "Web Application - High Error Rate"
- "Backup Job - Failed"
Bad Names:
- "Alert 1"
- "CPU"
- "Problem"
Pro Tip: Name it so someone new to your team understands immediately what it monitors.
Step 5: Choose Severity Level
Select how serious this problem is:
Critical 🔴
- Use when: System is down or critical function broken
- Examples: App unreachable, database offline, data loss risk
- Response time: Immediate (within minutes)
Warning 🟠
- Use when: Problem is impacting users or approaching critical
- Examples: High response times, approaching resource limits, elevated error rate
- Response time: Soon (within 1 hour)
Info 🟡
- Use when: Informational only, no urgent action needed
- Examples: Deployment completed, scheduled maintenance finished
- Response time: As time allows (no rush)
Step 6: Add Description (Optional but Recommended)
Explain what this alert means and what to do about it:
"Fires when API response time exceeds 5 seconds.
Usually caused by:
- Database slowness
- High traffic
To fix:
1. Check database performance
2. Review application logs
3. Scale if needed"
Step 7: Save and Enable
- Click Save Rule
- The rule is created and appears in your rules list
- Toggle the Enabled switch to ON
- Rule is now monitoring
Done! Your first alert rule is now active.
Common Alert Rule Examples
Example 1: Website Down
What to Monitor: HTTP Status Code
Condition: = 503 or = 0
Severity: Critical
Name: "Website Is Down"
Description: "Service is not responding to requests"
Example 2: High CPU Usage
What to Monitor: CPU Usage
Condition: > 80%
Severity: Warning
Name: "Server CPU Too High"
Description: "CPU usage is above normal, performance degrading"
Example 3: Low Disk Space
What to Monitor: Disk Space Free
Condition: < 10%
Severity: Critical
Name: "Disk Space Running Out"
Description: "Less than 10% disk space remaining"
Example 4: Slow API Response
What to Monitor: API Response Time
Condition: > 5000 ms
Severity: Warning
Name: "API Response Time High"
Description: "API is responding slower than acceptable"
Example 5: High Error Rate
What to Monitor: Error Rate
Condition: > 5%
Severity: Warning
Name: "Application Error Rate High"
Description: "More than 5% of requests are failing"
Managing Your Rules
View All Rules
Go to Alerts → Alert Rules tab to see all your rules:
- Rule name
- Whether it's enabled
- Current status
- Last triggered time
Enable/Disable a Rule
When to disable:
- Rule is creating too many false alerts
- You're doing maintenance
- Rule is outdated
How:
- Find the rule in the list
- Toggle the Enabled switch
- OFF = disabled, ON = enabled
Edit a Rule
Change a rule's condition, threshold, or name:
- Find the rule
- Click Edit or the rule name
- Change the settings
- Click Save
The updated rule applies immediately.
Delete a Rule
If you no longer need a rule:
- Find the rule
- Click Delete
- Confirm deletion
Note: Deleted rules cannot be recovered.
Best Practices for Alert Rules
1. Start Simple
- Begin with 2-3 critical rules
- Add more as you get comfortable
- Don't try to monitor everything at once
2. Use Clear Names
- Someone new should understand immediately
- Include the metric and what triggers it
- Use consistent naming across your rules
3. Set Realistic Thresholds
- Based on your normal operations
- Not so sensitive you get alert fatigue (100+ per day)
- Not so loose you miss real problems
4. Document Your Rules
- Explain what the rule monitors
- Note what typically causes it
- Include steps to fix the issue
5. Review Regularly
- Check which rules fire most often
- Identify false alarms
- Adjust thresholds based on patterns
- Remove rules that aren't useful
6. Test Before Relying
- Create a test rule with very low threshold
- Verify it fires when it should
- Check that you receive notifications
- Delete the test rule when done
Adjusting Your Rules
Rule Triggers Too Often
Problem: You're getting 20 alerts per day but only 2 are real issues.
Solution - Increase Threshold:
Original:
CPU > 70%
Updated:
CPU > 85%
Now it only alerts when truly critical.
Rule Never Triggers
Problem: You never get the alert, even when the problem exists.
Solution - Lower Threshold:
Original:
CPU > 95%
Updated:
CPU > 80%
Now it catches the problem earlier.
Rule No Longer Needed
Solution:
- Disable: Keeps it but turns it off (easy to re-enable)
- Delete: Removes it completely (good if truly not needed)
When to Create an Alert Rule
Create a rule for:
- ✅ Production issues (outages, errors)
- ✅ Performance degradation (slowness)
- ✅ Resource exhaustion (disk full, memory high)
- ✅ Security events
- ✅ Important business metrics
Don't create a rule for:
- ❌ Normal/expected variations
- ❌ Things you can't act on
- ❌ Low priority "nice to know" info
- ❌ Duplicate alerts for the same issue
Alert Rule Templates
Use these templates to create rules for common scenarios:
API/Web Application:
- Response Time High (> 5 seconds)
- Error Rate High (> 5%)
- Service Down (status code 503)
Database:
- CPU Usage High (> 80%)
- Memory Usage High (> 90%)
- Connection Pool Exhausted
Infrastructure:
- Disk Space Low (< 10%)
- Network Latency High (> 100ms)
- Server Unreachable
Scheduled Jobs:
- Job Failed (0 success)
- Job Took Too Long (> expected time)
- Job Didn't Run
Next Steps
Now that you've created alert rules:
- Configure Notifications - Set up how you're notified
- Respond to Alerts - Handle active alerts
- Alert Management - More advanced topics
Getting Help
Questions about creating rules?
- Check the examples above
- Click the ? icon on the Alerts page
- Contact support: [email protected]
Rule isn't working?
- Make sure it's Enabled
- Check the threshold is set correctly
- Verify the metric you're monitoring
- Test with a temporary rule first