Monitor VM Performance | Real-Time Metrics and Health Status Guide

Comprehensive monitoring is essential for maintaining optimal VM performance and identifying issues before they impact your applications.

Monitoring Overview#

Nife provides real-time and historical monitoring capabilities for all your VM instances across AWS, GCP, and Azure.

Key Monitoring Features#

  • Real-time Metrics: Live performance data updated continuously
  • Historical Data: Track metrics over time for trend analysis
  • Performance Alerts: Get notified of performance issues
  • Resource Tracking: Monitor CPU, memory, disk, and network usage
  • Health Status: Quick view of instance health
  • Activity Logs: Review recent instance actions and changes

Accessing Monitoring#

From Instance Card#

  1. Locate the instance in the VM Management list
  2. Click the Monitoring button in the instance actions
  3. Monitoring dashboard opens in a panel

From Detail Panel#

  1. Open the instance detail panel
  2. Click the Monitoring tab
  3. Comprehensive monitoring dashboard displays

Monitoring Dashboard#

The monitoring dashboard shows:

Overview Cards

  • Current CPU usage percentage
  • Memory utilization (GB and percentage)
  • Network throughput (inbound/outbound)
  • Disk usage (GB and percentage)

Performance Graphs

  • CPU usage over time (last 24 hours, 7 days, 30 days)
  • Memory usage trending
  • Network I/O (bytes in/out)
  • Disk I/O operations

Health Status

  • Instance status (Running, Stopped, etc.)
  • Uptime duration
  • Last state change
  • Network reachability

Key Metrics Explained#

CPU Metrics#

CPU Usage Percentage

  • How much of the CPU is being utilized
  • Normal: 20-60% for typical workloads
  • High: >80% sustained indicates capacity issues
  • Action: Consider scaling or optimizing application

CPU Cores

  • Number of vCPUs allocated to instance
  • Check if application can utilize all cores
  • Consider upgrading if CPU-bound

Memory Metrics#

Memory Usage

  • How much RAM is currently in use
  • Shown in GB and as percentage
  • Normal: 40-70% of available memory
  • High: >85% may cause slowdowns or crashes

Memory Available

  • Free memory available for applications
  • Should have buffer (10-20% minimum)
  • Low memory can cause swapping and poor performance

Network Metrics#

Network In

  • Incoming data to the instance
  • Measured in Mbps (megabits per second)
  • Normal: Depends on application type
  • Spikes: May indicate traffic surge or attack

Network Out

  • Outgoing data from the instance
  • Measured in Mbps
  • Monitor for unexpected data transfers
  • High: May indicate data exfiltration or misconfiguration

Packet Loss

  • Percentage of network packets lost
  • Should be <0.1% in healthy network
  • High: Indicates network issues
  • Action: Check network configuration and cloud provider status

Disk Metrics#

Disk Usage

  • How much storage is in use
  • Shown in GB and percentage
  • Target: Keep <80% for optimal performance
  • 90%: Risk of out-of-disk errors

Disk I/O

  • Read/write operations per second (IOPS)
  • High: May indicate disk bottleneck
  • Sustained high: Consider upgrading disk

Disk Latency

  • Time taken for disk operations
  • Normal: <5ms for SSD
  • High: >20ms indicates performance issues
  • Action: Check for background processes

Performance Analysis#

Setting Time Ranges#

View metrics for different time periods:

  • 1 Hour: Recent performance and current issues
  • 24 Hours: Daily patterns and peak usage times
  • 7 Days: Weekly trends and recurring issues
  • 30 Days: Long-term trends and capacity planning
  • Custom Range: Specific date range analysis

Identifying Performance Issues#

High CPU Usage

  1. Check which processes are consuming CPU
  2. Review application logs for errors
  3. Check for runaway processes
  4. Monitor network I/O for correlation
  5. Consider application optimization or scaling

High Memory Usage

  1. Review running processes and services
  2. Check for memory leaks in applications
  3. Monitor for unnecessary background tasks
  4. Consider increasing memory allocation
  5. Check for caching issues

High Network Usage

  1. Verify application is performing as expected
  2. Check for data downloads/uploads
  3. Monitor for malware or unauthorized access
  4. Review firewall and security rules
  5. Check bandwidth costs and limits

Low Disk Space

  1. Identify large files and directories
  2. Clean up logs and temporary files
  3. Review application data growth
  4. Consider disk expansion
  5. Implement log rotation policies

Health Monitoring#

Instance Health Status#

Running

  • Instance is active and operational
  • Applications can be deployed and accessed
  • Monitoring data is current
  • Can perform all operations

Stopped

  • Instance is powered down
  • No monitoring data available (shows last known state)
  • Cannot run applications
  • Resources are released

Paused

  • Instance is temporarily paused
  • Minimal resource usage
  • Monitoring paused
  • Quick resume available

Degraded

  • Instance is running but experiencing issues
  • Some services may be unavailable
  • Investigate alerts and logs
  • May require restart or troubleshooting

Health Checks#

Automatic health checks monitor:

  • Instance reachability via network
  • System disk status
  • Memory health
  • CPU functionality
  • Network connectivity

Status Indicators#

Green: All systems healthy Yellow: Warning conditions detected Red: Critical issue requires attention

Setting Up Alerts#

Alert Types#

CPU Alerts

  • Trigger when CPU exceeds threshold
  • Typical threshold: 80%
  • Duration: Sustained for 5+ minutes

Memory Alerts

  • Trigger when memory usage exceeds threshold
  • Typical threshold: 85%
  • Duration: Sustained for 5+ minutes

Disk Alerts

  • Trigger when disk usage exceeds threshold
  • Typical threshold: 80%
  • Action: Requires immediate attention

Network Alerts

  • High traffic alerts
  • Packet loss detection
  • Connection timeouts

Creating Alerts#

  1. Navigate to monitoring dashboard
  2. Click Set Alert button
  3. Choose metric to monitor
  4. Set threshold value
  5. Set duration (5 minutes, 15 minutes, 1 hour)
  6. Choose notification method (Email, Slack, etc.)
  7. Save alert

Alert Notifications#

Alerts can be sent via:

  • Email notifications
  • Slack messages
  • Webhook calls
  • SMS (premium)
  • PagerDuty integration

Exporting Monitoring Data#

Export Formats#

CSV Export

  • Timestamp
  • CPU usage
  • Memory usage
  • Network In/Out
  • Disk usage
  • Custom metrics

JSON Export

  • Full metric details
  • Metadata information
  • Custom fields
  • API-ready format

Exporting Data#

  1. Open monitoring dashboard
  2. Select time range
  3. Click Export button
  4. Choose format (CSV or JSON)
  5. File downloads to computer

Performance Optimization Tips#

CPU Optimization#

  1. Identify CPU-bound Processes

    • Use monitoring to identify high CPU processes
    • Optimize application code
    • Consider horizontal scaling
  2. Reduce CPU Usage

    • Disable unused services
    • Optimize database queries
    • Use caching strategies
    • Implement rate limiting
  3. Upgrade if Needed

    • Consider instance type with more vCPUs
    • Scale across multiple instances
    • Use load balancing

Memory Optimization#

  1. Monitor Memory Leaks

    • Look for gradually increasing memory
    • Restart services periodically
    • Review application logs
  2. Optimize Memory Usage

    • Increase garbage collection frequency
    • Reduce cache sizes
    • Optimize data structures
    • Limit concurrent connections
  3. Expand Memory

    • Upgrade instance type
    • Consider read replicas for database loads
    • Implement distributed caching

Disk Optimization#

  1. Manage Disk Space

    • Implement log rotation
    • Archive old data
    • Remove temporary files
    • Compress backups
  2. Improve Disk I/O

    • Use SSD storage
    • Implement caching
    • Optimize database indexing
    • Separate read/write workloads

Network Optimization#

  1. Reduce Latency

    • Use Content Delivery Network (CDN)
    • Deploy closer to users
    • Optimize payload sizes
    • Reduce hops in architecture
  2. Optimize Bandwidth

    • Compress data transfer
    • Use regional endpoints
    • Implement request batching
    • Monitor for data leaks

Troubleshooting with Metrics#

Common Issues and Solutions#

Instance Shows as Running but Not Accessible

  1. Check network reachability metric
  2. Verify security group rules
  3. Check application status
  4. Review error logs
  5. Attempt restart

Sudden Performance Drop

  1. Check if metrics show resource exhaustion
  2. Look for spikes in CPU or memory
  3. Review recent deployments or changes
  4. Check for background processes
  5. Monitor network for DDoS

Intermittent Slowness

  1. Look for periodic spikes in metrics
  2. Correlate with scheduled tasks
  3. Check for backup operations
  4. Review disk I/O patterns
  5. Monitor network latency

High Costs Despite Low Usage

  1. Check for reserved instance mismatches
  2. Verify instance type allocation
  3. Monitor network transfer costs
  4. Check for data storage growth
  5. Review pricing for current tier

Best Practices#

  1. Regular Review: Check metrics weekly
  2. Set Baselines: Know your normal usage patterns
  3. Proactive Alerts: Set alerts before critical thresholds
  4. Archive Data: Export historical data for long-term analysis
  5. Document Issues: Keep records of problems and solutions
  6. Plan Capacity: Use trends to predict future needs
  7. Correlate Metrics: Look at multiple metrics together
  8. Test Alerts: Verify alert notifications work

Recommended Thresholds#

MetricWarningCritical
CPU70%85%
Memory75%90%
Disk80%95%
Network Out1000 Mbps1500 Mbps
Packet Loss0.5%2%

Next Steps#