Monitor VM Performance | Real-Time Metrics and Health Status Guide
Comprehensive monitoring is essential for maintaining optimal VM performance and identifying issues before they impact your applications.
Monitoring Overview#
Nife provides real-time and historical monitoring capabilities for all your VM instances across AWS, GCP, and Azure.
Key Monitoring Features#
- Real-time Metrics: Live performance data updated continuously
- Historical Data: Track metrics over time for trend analysis
- Performance Alerts: Get notified of performance issues
- Resource Tracking: Monitor CPU, memory, disk, and network usage
- Health Status: Quick view of instance health
- Activity Logs: Review recent instance actions and changes
Accessing Monitoring#
From Instance Card#
- Locate the instance in the VM Management list
- Click the Monitoring button in the instance actions
- Monitoring dashboard opens in a panel
From Detail Panel#
- Open the instance detail panel
- Click the Monitoring tab
- Comprehensive monitoring dashboard displays
Monitoring Dashboard#
The monitoring dashboard shows:
Overview Cards
- Current CPU usage percentage
- Memory utilization (GB and percentage)
- Network throughput (inbound/outbound)
- Disk usage (GB and percentage)
Performance Graphs
- CPU usage over time (last 24 hours, 7 days, 30 days)
- Memory usage trending
- Network I/O (bytes in/out)
- Disk I/O operations
Health Status
- Instance status (Running, Stopped, etc.)
- Uptime duration
- Last state change
- Network reachability
Key Metrics Explained#
CPU Metrics#
CPU Usage Percentage
- How much of the CPU is being utilized
- Normal: 20-60% for typical workloads
- High: >80% sustained indicates capacity issues
- Action: Consider scaling or optimizing application
CPU Cores
- Number of vCPUs allocated to instance
- Check if application can utilize all cores
- Consider upgrading if CPU-bound
Memory Metrics#
Memory Usage
- How much RAM is currently in use
- Shown in GB and as percentage
- Normal: 40-70% of available memory
- High: >85% may cause slowdowns or crashes
Memory Available
- Free memory available for applications
- Should have buffer (10-20% minimum)
- Low memory can cause swapping and poor performance
Network Metrics#
Network In
- Incoming data to the instance
- Measured in Mbps (megabits per second)
- Normal: Depends on application type
- Spikes: May indicate traffic surge or attack
Network Out
- Outgoing data from the instance
- Measured in Mbps
- Monitor for unexpected data transfers
- High: May indicate data exfiltration or misconfiguration
Packet Loss
- Percentage of network packets lost
- Should be
<0.1%in healthy network - High: Indicates network issues
- Action: Check network configuration and cloud provider status
Disk Metrics#
Disk Usage
- How much storage is in use
- Shown in GB and percentage
- Target: Keep
<80%for optimal performance 90%: Risk of out-of-disk errors
Disk I/O
- Read/write operations per second (IOPS)
- High: May indicate disk bottleneck
- Sustained high: Consider upgrading disk
Disk Latency
- Time taken for disk operations
- Normal:
<5msfor SSD - High: >20ms indicates performance issues
- Action: Check for background processes
Performance Analysis#
Setting Time Ranges#
View metrics for different time periods:
- 1 Hour: Recent performance and current issues
- 24 Hours: Daily patterns and peak usage times
- 7 Days: Weekly trends and recurring issues
- 30 Days: Long-term trends and capacity planning
- Custom Range: Specific date range analysis
Identifying Performance Issues#
High CPU Usage
- Check which processes are consuming CPU
- Review application logs for errors
- Check for runaway processes
- Monitor network I/O for correlation
- Consider application optimization or scaling
High Memory Usage
- Review running processes and services
- Check for memory leaks in applications
- Monitor for unnecessary background tasks
- Consider increasing memory allocation
- Check for caching issues
High Network Usage
- Verify application is performing as expected
- Check for data downloads/uploads
- Monitor for malware or unauthorized access
- Review firewall and security rules
- Check bandwidth costs and limits
Low Disk Space
- Identify large files and directories
- Clean up logs and temporary files
- Review application data growth
- Consider disk expansion
- Implement log rotation policies
Health Monitoring#
Instance Health Status#
Running
- Instance is active and operational
- Applications can be deployed and accessed
- Monitoring data is current
- Can perform all operations
Stopped
- Instance is powered down
- No monitoring data available (shows last known state)
- Cannot run applications
- Resources are released
Paused
- Instance is temporarily paused
- Minimal resource usage
- Monitoring paused
- Quick resume available
Degraded
- Instance is running but experiencing issues
- Some services may be unavailable
- Investigate alerts and logs
- May require restart or troubleshooting
Health Checks#
Automatic health checks monitor:
- Instance reachability via network
- System disk status
- Memory health
- CPU functionality
- Network connectivity
Status Indicators#
Green: All systems healthy Yellow: Warning conditions detected Red: Critical issue requires attention
Setting Up Alerts#
Alert Types#
CPU Alerts
- Trigger when CPU exceeds threshold
- Typical threshold: 80%
- Duration: Sustained for 5+ minutes
Memory Alerts
- Trigger when memory usage exceeds threshold
- Typical threshold: 85%
- Duration: Sustained for 5+ minutes
Disk Alerts
- Trigger when disk usage exceeds threshold
- Typical threshold: 80%
- Action: Requires immediate attention
Network Alerts
- High traffic alerts
- Packet loss detection
- Connection timeouts
Creating Alerts#
- Navigate to monitoring dashboard
- Click Set Alert button
- Choose metric to monitor
- Set threshold value
- Set duration (5 minutes, 15 minutes, 1 hour)
- Choose notification method (Email, Slack, etc.)
- Save alert
Alert Notifications#
Alerts can be sent via:
- Email notifications
- Slack messages
- Webhook calls
- SMS (premium)
- PagerDuty integration
Exporting Monitoring Data#
Export Formats#
CSV Export
- Timestamp
- CPU usage
- Memory usage
- Network In/Out
- Disk usage
- Custom metrics
JSON Export
- Full metric details
- Metadata information
- Custom fields
- API-ready format
Exporting Data#
- Open monitoring dashboard
- Select time range
- Click Export button
- Choose format (CSV or JSON)
- File downloads to computer
Performance Optimization Tips#
CPU Optimization#
Identify CPU-bound Processes
- Use monitoring to identify high CPU processes
- Optimize application code
- Consider horizontal scaling
Reduce CPU Usage
- Disable unused services
- Optimize database queries
- Use caching strategies
- Implement rate limiting
Upgrade if Needed
- Consider instance type with more vCPUs
- Scale across multiple instances
- Use load balancing
Memory Optimization#
Monitor Memory Leaks
- Look for gradually increasing memory
- Restart services periodically
- Review application logs
Optimize Memory Usage
- Increase garbage collection frequency
- Reduce cache sizes
- Optimize data structures
- Limit concurrent connections
Expand Memory
- Upgrade instance type
- Consider read replicas for database loads
- Implement distributed caching
Disk Optimization#
Manage Disk Space
- Implement log rotation
- Archive old data
- Remove temporary files
- Compress backups
Improve Disk I/O
- Use SSD storage
- Implement caching
- Optimize database indexing
- Separate read/write workloads
Network Optimization#
Reduce Latency
- Use Content Delivery Network (CDN)
- Deploy closer to users
- Optimize payload sizes
- Reduce hops in architecture
Optimize Bandwidth
- Compress data transfer
- Use regional endpoints
- Implement request batching
- Monitor for data leaks
Troubleshooting with Metrics#
Common Issues and Solutions#
Instance Shows as Running but Not Accessible
- Check network reachability metric
- Verify security group rules
- Check application status
- Review error logs
- Attempt restart
Sudden Performance Drop
- Check if metrics show resource exhaustion
- Look for spikes in CPU or memory
- Review recent deployments or changes
- Check for background processes
- Monitor network for DDoS
Intermittent Slowness
- Look for periodic spikes in metrics
- Correlate with scheduled tasks
- Check for backup operations
- Review disk I/O patterns
- Monitor network latency
High Costs Despite Low Usage
- Check for reserved instance mismatches
- Verify instance type allocation
- Monitor network transfer costs
- Check for data storage growth
- Review pricing for current tier
Best Practices#
- Regular Review: Check metrics weekly
- Set Baselines: Know your normal usage patterns
- Proactive Alerts: Set alerts before critical thresholds
- Archive Data: Export historical data for long-term analysis
- Document Issues: Keep records of problems and solutions
- Plan Capacity: Use trends to predict future needs
- Correlate Metrics: Look at multiple metrics together
- Test Alerts: Verify alert notifications work
Recommended Thresholds#
| Metric | Warning | Critical |
|---|---|---|
| CPU | 70% | 85% |
| Memory | 75% | 90% |
| Disk | 80% | 95% |
| Network Out | 1000 Mbps | 1500 Mbps |
| Packet Loss | 0.5% | 2% |
Next Steps#
- Managing VM Instances - Manage instance operations
- Troubleshooting - Common issues and solutions
- Cloud Provider Setup - Configure providers