Managing Cluster Resources & Performance | Nife Deploy
Monitor your cluster's health and performance with real-time metrics and resource tracking.
Understanding Cluster Resources#
Cluster resources include everything your cluster uses:
- CPU: Processing power
- Memory: RAM
- Disk: Storage space
- Network: Data transfer
- Pods: Running containers
Resource Dashboard#
Viewing Cluster Metrics#
- Go to Clusters page
- Click a cluster to see details
- Metrics tab shows resource data
Information Displayed:
- Current CPU usage %
- Current memory usage %
- Disk space used/available
- Pod count and status
- Node health
Real-time Monitoring#
Metrics update in real-time when agent is deployed:
- CPU and memory updates every 30 seconds
- Disk usage updates every 5 minutes
- Pod status updates immediately
- Node health updates continuously
CPU Metrics#
Understanding CPU Usage#
CPU is measured as a percentage (0-100%):
- 0-20%: Idle, plenty of capacity
- 20-50%: Normal operating range
- 50-80%: Moderate load
- 80-100%: High load, at capacity
CPU Guidelines#
Recommended Range:
- Development: 20-50% average
- Production: 30-60% average
- Peak: Should stay below 80%
- Never: Sustained 100% usage
Managing High CPU#
If CPU is consistently high:
Identify the cause:
- Check Pod Logs tab
- Look for error messages
- Check for runaway processes
Temporary solutions:
- Restart problematic pods
- Disable non-essential services
- Kill stuck processes
Long-term solutions:
- Scale up cluster (add nodes)
- Optimize application code
- Fix resource leaks
- Load balance better
Memory Metrics#
Understanding Memory Usage#
Memory is also measured as a percentage (0-100%):
- 0-30%: Plenty of available memory
- 30-60%: Normal operating range
- 60-80%: Getting full, monitor closely
- 80-100%: Critical, pods may be evicted
Memory Guidelines#
Recommended Range:
- Always keep 20% free for OS
- Applications: 40-70% of total
- Peak: Should stay below 80%
- Emergency: Never above 90%
Managing High Memory#
If memory is consistently high:
Identify memory leaks:
- Check Pod Logs for leak messages
- Monitor memory trend over time
- Identify which pod is using most
Free up memory:
- Restart affected pods
- Delete unused deployments
- Clear caches
Optimize:
- Scale up (add more RAM)
- Optimize application memory
- Use smaller images
- Enable memory compression
Disk Space#
Understanding Disk Usage#
Disk shows used/available space:
- Free Space: How much is available
- Used Space: How much is being used
- Used Percentage: % of total
Disk Guidelines#
Healthy Disk State:
- Keep at least 10% free space
- Recommended: 20-30% free
- Never drop below 5% free
- Critical: Less than 2% free
Managing Low Disk#
If disk space is low:
Find what's using space:
- Check container images
- Look for log files
- Review persistent volumes
Free up space:
- Delete old images
- Clear temporary files
- Rotate old logs
- Delete unused volumes
Long-term:
- Add more storage
- Implement log rotation
- Use image cleanup policies
- Monitor usage regularly
Pod Monitoring#
What are Pods?#
Pods are running instances of your applications:
- One or more containers
- Basic deployable unit
- Can be created/destroyed dynamically
Pod Status#
| Status | Meaning | Action |
|---|---|---|
| Running | Pod is healthy and running | No action needed |
| Pending | Pod is starting | Wait for startup |
| Succeeded | Pod completed (job) | Normal completion |
| Failed | Pod crashed | Check logs and investigate |
| CrashLoop | Pod keeps restarting | Fix application error |
| Unknown | Cannot determine status | Check cluster health |
Pod Health Indicators#
Green (Healthy):
- All containers running
- No restarts
- Ready for traffic
Yellow (Warning):
- Frequent restarts
- High resource usage
- Slow response
Red (Error):
- Containers failing
- Crash loops
- Not responding
Node Health#
What are Nodes?#
Nodes are the machines that run your pods:
- Physical or virtual machines
- Have their own CPU, memory, disk
- Run multiple pods
Node Metrics#
Each node shows:
- Node name/ID
- CPU capacity and usage
- Memory capacity and usage
- Disk capacity and usage
- Pod count
Node Status#
Healthy Node:
- Status: Ready
- No taints or conditions
- Resources available
- All components running
Problem Node:
- Status: NotReady
- May have taints
- Resources exhausted
- Components failing
Setting Resource Limits#
Understanding Limits#
Limits prevent containers from using too many resources:
- Request: Minimum guaranteed resources
- Limit: Maximum allowed resources
Setting Limits#
Via Nife Dashboard:
- Go to cluster details
- Click Configure
- Set CPU limits
- Set memory limits
- Click Save
Via kubectl:
Limit Guidelines#
Frontend Application:
- CPU request: 100m
- CPU limit: 500m
- Memory request: 128Mi
- Memory limit: 512Mi
Backend API:
- CPU request: 200m
- CPU limit: 1000m
- Memory request: 256Mi
- Memory limit: 1Gi
Database:
- CPU request: 500m
- CPU limit: 2000m
- Memory request: 512Mi
- Memory limit: 4Gi
Autoscaling#
Horizontal Pod Autoscaling (HPA)#
Automatically scales number of pods based on metrics:
How it works:
- Monitor CPU/memory metrics
- If above threshold, scale up (add pods)
- If below threshold, scale down (remove pods)
- Maintains target metric percentage
When to use:
- Variable traffic patterns
- Cost optimization
- High availability needs
Vertical Pod Autoscaling (VPA)#
Automatically adjusts resource requests/limits:
How it works:
- Monitor actual resource usage
- If using more, increase limits
- If using less, decrease limits
- Optimizes resource efficiency
When to use:
- Unknown resource requirements
- Right-sizing applications
- Cost optimization
Cluster Autoscaling#
Automatically adds/removes nodes:
How it works:
- When pods can't fit, scale up cluster
- When nodes are idle, scale down cluster
- Maintains desired capacity
When to use:
- Dynamic workloads
- Cost savings
- Unused capacity
Performance Optimization#
1. Right-Size Applications#
Before:
After:
2. Use Resource Quotas#
Limit namespace resource usage:
3. Enable Metrics Server#
Required for monitoring:
4. Monitor Regularly#
- Check metrics daily
- Review trends weekly
- Plan capacity monthly
- Optimize quarterly
Troubleshooting Resource Issues#
Problem: Nodes Running Out of Memory#
Symptoms:
- Pod evictions
- Out of memory errors
- Node goes "NotReady"
Solutions:
- Delete unnecessary pods
- Increase node memory
- Adjust pod memory limits
- Identify memory leaks
Problem: Pods Keep Crashing#
Symptoms:
- CrashLoopBackOff status
- Frequent restarts
Solutions:
- Check pod logs
- Increase memory/CPU
- Fix application code
- Verify configuration
Problem: Cluster Won't Scale#
Symptoms:
- Pods stuck in Pending
- Autoscale not working
Solutions:
- Check cluster capacity
- Verify autoscaling enabled
- Check resource quotas
- Add more nodes manually
Best Practices#
1. Monitor Continuously#
- Set up alerts for thresholds
- Review metrics regularly
- Plan for growth
2. Set Realistic Limits#
- Not too high (waste resources)
- Not too low (pod crashes)
- Based on actual usage
3. Plan Capacity#
- Monitor trends
- Forecast growth
- Scale before hitting limits
4. Use Autoscaling#
- Set up HPA for applications
- Enable cluster autoscaling
- Monitor autoscaling behavior
5. Clean Up Regularly#
- Delete old images
- Remove unused deployments
- Clear temporary files
- Archive old logs
Next Steps#
- View Pod Logs - Debug issues
- Security Findings - Check security
- Deploy Applications - Use your cluster
Support#
Questions about resources?
- Check the guidelines above
- Review your metrics
- Contact support: [email protected]