Managing Cluster Resources
Monitor your cluster's health and performance with real-time metrics and resource tracking.
Understanding Cluster Resources
Cluster resources include everything your cluster uses:
- CPU: Processing power
- Memory: RAM
- Disk: Storage space
- Network: Data transfer
- Pods: Running containers
Resource Dashboard
Viewing Cluster Metrics
- Go to Clusters page
- Click a cluster to see details
- Metrics tab shows resource data
Information Displayed:
- Current CPU usage %
- Current memory usage %
- Disk space used/available
- Pod count and status
- Node health
Real-time Monitoring
Metrics update in real-time when agent is deployed:
- CPU and memory updates every 30 seconds
- Disk usage updates every 5 minutes
- Pod status updates immediately
- Node health updates continuously
CPU Metrics
Understanding CPU Usage
CPU is measured as a percentage (0-100%):
- 0-20%: Idle, plenty of capacity
- 20-50%: Normal operating range
- 50-80%: Moderate load
- 80-100%: High load, at capacity
CPU Guidelines
Recommended Range:
- Development: 20-50% average
- Production: 30-60% average
- Peak: Should stay below 80%
- Never: Sustained 100% usage
Managing High CPU
If CPU is consistently high:
-
Identify the cause:
- Check Pod Logs tab
- Look for error messages
- Check for runaway processes
-
Temporary solutions:
- Restart problematic pods
- Disable non-essential services
- Kill stuck processes
-
Long-term solutions:
- Scale up cluster (add nodes)
- Optimize application code
- Fix resource leaks
- Load balance better
Memory Metrics
Understanding Memory Usage
Memory is also measured as a percentage (0-100%):
- 0-30%: Plenty of available memory
- 30-60%: Normal operating range
- 60-80%: Getting full, monitor closely
- 80-100%: Critical, pods may be evicted
Memory Guidelines
Recommended Range:
- Always keep 20% free for OS
- Applications: 40-70% of total
- Peak: Should stay below 80%
- Emergency: Never above 90%
Managing High Memory
If memory is consistently high:
-
Identify memory leaks:
- Check Pod Logs for leak messages
- Monitor memory trend over time
- Identify which pod is using most
-
Free up memory:
- Restart affected pods
- Delete unused deployments
- Clear caches
-
Optimize:
- Scale up (add more RAM)
- Optimize application memory
- Use smaller images
- Enable memory compression
Disk Space
Understanding Disk Usage
Disk shows used/available space:
- Free Space: How much is available
- Used Space: How much is being used
- Used Percentage: % of total
Disk Guidelines
Healthy Disk State:
- Keep at least 10% free space
- Recommended: 20-30% free
- Never drop below 5% free
- Critical: Less than 2% free
Managing Low Disk
If disk space is low:
-
Find what's using space:
- Check container images
- Look for log files
- Review persistent volumes
-
Free up space:
- Delete old images
- Clear temporary files
- Rotate old logs
- Delete unused volumes
-
Long-term:
- Add more storage
- Implement log rotation
- Use image cleanup policies
- Monitor usage regularly
Pod Monitoring
What are Pods?
Pods are running instances of your applications:
- One or more containers
- Basic deployable unit
- Can be created/destroyed dynamically
Pod Status
| Status | Meaning | Action |
|---|---|---|
| Running | Pod is healthy and running | No action needed |
| Pending | Pod is starting | Wait for startup |
| Succeeded | Pod completed (job) | Normal completion |
| Failed | Pod crashed | Check logs and investigate |
| CrashLoop | Pod keeps restarting | Fix application error |
| Unknown | Cannot determine status | Check cluster health |
Pod Health Indicators
Green (Healthy):
- All containers running
- No restarts
- Ready for traffic
Yellow (Warning):
- Frequent restarts
- High resource usage
- Slow response
Red (Error):
- Containers failing
- Crash loops
- Not responding
Node Health
What are Nodes?
Nodes are the machines that run your pods:
- Physical or virtual machines
- Have their own CPU, memory, disk
- Run multiple pods
Node Metrics
Each node shows:
- Node name/ID
- CPU capacity and usage
- Memory capacity and usage
- Disk capacity and usage
- Pod count
Node Status
Healthy Node:
- Status: Ready
- No taints or conditions
- Resources available
- All components running
Problem Node:
- Status: NotReady
- May have taints
- Resources exhausted
- Components failing
Setting Resource Limits
Understanding Limits
Limits prevent containers from using too many resources:
- Request: Minimum guaranteed resources
- Limit: Maximum allowed resources
Setting Limits
Via Nife Dashboard:
- Go to cluster details
- Click Configure
- Set CPU limits
- Set memory limits
- Click Save
Via kubectl:
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
Limit Guidelines
Frontend Application:
- CPU request: 100m
- CPU limit: 500m
- Memory request: 128Mi
- Memory limit: 512Mi
Backend API:
- CPU request: 200m
- CPU limit: 1000m
- Memory request: 256Mi
- Memory limit: 1Gi
Database:
- CPU request: 500m
- CPU limit: 2000m
- Memory request: 512Mi
- Memory limit: 4Gi
Autoscaling
Horizontal Pod Autoscaling (HPA)
Automatically scales number of pods based on metrics:
How it works:
- Monitor CPU/memory metrics
- If above threshold, scale up (add pods)
- If below threshold, scale down (remove pods)
- Maintains target metric percentage
When to use:
- Variable traffic patterns
- Cost optimization
- High availability needs
Vertical Pod Autoscaling (VPA)
Automatically adjusts resource requests/limits:
How it works:
- Monitor actual resource usage
- If using more, increase limits
- If using less, decrease limits
- Optimizes resource efficiency
When to use:
- Unknown resource requirements
- Right-sizing applications
- Cost optimization
Cluster Autoscaling
Automatically adds/removes nodes:
How it works:
- When pods can't fit, scale up cluster
- When nodes are idle, scale down cluster
- Maintains desired capacity
When to use:
- Dynamic workloads
- Cost savings
- Unused capacity
Performance Optimization
1. Right-Size Applications
Before:
Pod requests: 2 CPU, 2Gi memory
Actual usage: 200m CPU, 256Mi memory
Result: Wasting 90% of resources
After:
Pod requests: 250m CPU, 512Mi memory
Actual usage: 200m CPU, 256Mi memory
Result: Optimized allocation
2. Use Resource Quotas
Limit namespace resource usage:
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
spec:
hard:
requests.cpu: "10"
requests.memory: "20Gi"
limits.cpu: "20"
limits.memory: "40Gi"
3. Enable Metrics Server
Required for monitoring:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
4. Monitor Regularly
- Check metrics daily
- Review trends weekly
- Plan capacity monthly
- Optimize quarterly
Troubleshooting Resource Issues
Problem: Nodes Running Out of Memory
Symptoms:
- Pod evictions
- Out of memory errors
- Node goes "NotReady"
Solutions:
- Delete unnecessary pods
- Increase node memory
- Adjust pod memory limits
- Identify memory leaks
Problem: Pods Keep Crashing
Symptoms:
- CrashLoopBackOff status
- Frequent restarts
Solutions:
- Check pod logs
- Increase memory/CPU
- Fix application code
- Verify configuration
Problem: Cluster Won't Scale
Symptoms:
- Pods stuck in Pending
- Autoscale not working
Solutions:
- Check cluster capacity
- Verify autoscaling enabled
- Check resource quotas
- Add more nodes manually
Best Practices
1. Monitor Continuously
- Set up alerts for thresholds
- Review metrics regularly
- Plan for growth
2. Set Realistic Limits
- Not too high (waste resources)
- Not too low (pod crashes)
- Based on actual usage
3. Plan Capacity
- Monitor trends
- Forecast growth
- Scale before hitting limits
4. Use Autoscaling
- Set up HPA for applications
- Enable cluster autoscaling
- Monitor autoscaling behavior
5. Clean Up Regularly
- Delete old images
- Remove unused deployments
- Clear temporary files
- Archive old logs
Next Steps
- View Pod Logs - Debug issues
- Security Findings - Check security
- Deploy Applications - Use your cluster
Support
Questions about resources?
- Check the guidelines above
- Review your metrics
- Contact support: [email protected]