Managing Cluster Resources & Performance | Nife Deploy

Monitor your cluster's health and performance with real-time metrics and resource tracking.

Understanding Cluster Resources#

Cluster resources include everything your cluster uses:

CPU: Processing power
Memory: RAM
Disk: Storage space
Network: Data transfer
Pods: Running containers

Resource Dashboard#

Viewing Cluster Metrics#

Go to Clusters page
Click a cluster to see details
Metrics tab shows resource data

Information Displayed:

Current CPU usage %
Current memory usage %
Disk space used/available
Pod count and status
Node health

Real-time Monitoring#

Metrics update in real-time when agent is deployed:

CPU and memory updates every 30 seconds
Disk usage updates every 5 minutes
Pod status updates immediately
Node health updates continuously

CPU Metrics#

Understanding CPU Usage#

CPU is measured as a percentage (0-100%):

0-20%: Idle, plenty of capacity
20-50%: Normal operating range
50-80%: Moderate load
80-100%: High load, at capacity

CPU Guidelines#

Recommended Range:

Development: 20-50% average
Production: 30-60% average
Peak: Should stay below 80%
Never: Sustained 100% usage

Managing High CPU#

If CPU is consistently high:

Identify the cause:
- Check Pod Logs tab
- Look for error messages
- Check for runaway processes
Temporary solutions:
- Restart problematic pods
- Disable non-essential services
- Kill stuck processes
Long-term solutions:
- Scale up cluster (add nodes)
- Optimize application code
- Fix resource leaks
- Load balance better

Memory Metrics#

Understanding Memory Usage#

Memory is also measured as a percentage (0-100%):

0-30%: Plenty of available memory
30-60%: Normal operating range
60-80%: Getting full, monitor closely
80-100%: Critical, pods may be evicted

Memory Guidelines#

Recommended Range:

Always keep 20% free for OS
Applications: 40-70% of total
Peak: Should stay below 80%
Emergency: Never above 90%

Managing High Memory#

If memory is consistently high:

Identify memory leaks:
- Check Pod Logs for leak messages
- Monitor memory trend over time
- Identify which pod is using most
Free up memory:
- Restart affected pods
- Delete unused deployments
- Clear caches
Optimize:
- Scale up (add more RAM)
- Optimize application memory
- Use smaller images
- Enable memory compression

Disk Space#

Understanding Disk Usage#

Disk shows used/available space:

Free Space: How much is available
Used Space: How much is being used
Used Percentage: % of total

Disk Guidelines#

Healthy Disk State:

Keep at least 10% free space
Recommended: 20-30% free
Never drop below 5% free
Critical: Less than 2% free

Managing Low Disk#

If disk space is low:

Find what's using space:
- Check container images
- Look for log files
- Review persistent volumes
Free up space:
- Delete old images
- Clear temporary files
- Rotate old logs
- Delete unused volumes
Long-term:
- Add more storage
- Implement log rotation
- Use image cleanup policies
- Monitor usage regularly

Pod Monitoring#

What are Pods?#

Pods are running instances of your applications:

One or more containers
Basic deployable unit
Can be created/destroyed dynamically

Pod Status#

Status	Meaning	Action
Running	Pod is healthy and running	No action needed
Pending	Pod is starting	Wait for startup
Succeeded	Pod completed (job)	Normal completion
Failed	Pod crashed	Check logs and investigate
CrashLoop	Pod keeps restarting	Fix application error
Unknown	Cannot determine status	Check cluster health

Pod Health Indicators#

Green (Healthy):

All containers running
No restarts
Ready for traffic

Yellow (Warning):

Frequent restarts
High resource usage
Slow response

Red (Error):

Containers failing
Crash loops
Not responding

Node Health#

What are Nodes?#

Nodes are the machines that run your pods:

Physical or virtual machines
Have their own CPU, memory, disk
Run multiple pods

Node Metrics#

Each node shows:

Node name/ID
CPU capacity and usage
Memory capacity and usage
Disk capacity and usage
Pod count

Node Status#

Healthy Node:

Status: Ready
No taints or conditions
Resources available
All components running

Problem Node:

Status: NotReady
May have taints
Resources exhausted
Components failing

Setting Resource Limits#

Understanding Limits#

Limits prevent containers from using too many resources:

Request: Minimum guaranteed resources
Limit: Maximum allowed resources

Setting Limits#

Via Nife Dashboard:

Go to cluster details
Click Configure
Set CPU limits
Set memory limits
Click Save

Via kubectl:

resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

Limit Guidelines#

Frontend Application:

CPU request: 100m
CPU limit: 500m
Memory request: 128Mi
Memory limit: 512Mi

Backend API:

CPU request: 200m
CPU limit: 1000m
Memory request: 256Mi
Memory limit: 1Gi

Database:

CPU request: 500m
CPU limit: 2000m
Memory request: 512Mi
Memory limit: 4Gi

Autoscaling#

Horizontal Pod Autoscaling (HPA)#

Automatically scales number of pods based on metrics:

How it works:

Monitor CPU/memory metrics
If above threshold, scale up (add pods)
If below threshold, scale down (remove pods)
Maintains target metric percentage

When to use:

Variable traffic patterns
Cost optimization
High availability needs

Vertical Pod Autoscaling (VPA)#

Automatically adjusts resource requests/limits:

How it works:

Monitor actual resource usage
If using more, increase limits
If using less, decrease limits
Optimizes resource efficiency

When to use:

Unknown resource requirements
Right-sizing applications
Cost optimization

Cluster Autoscaling#

Automatically adds/removes nodes:

How it works:

When pods can't fit, scale up cluster
When nodes are idle, scale down cluster
Maintains desired capacity

When to use:

Dynamic workloads
Cost savings
Unused capacity

Performance Optimization#

1. Right-Size Applications#

Before:

Pod requests: 2 CPU, 2Gi memory
Actual usage: 200m CPU, 256Mi memory
Result: Wasting 90% of resources

After:

Pod requests: 250m CPU, 512Mi memory
Actual usage: 200m CPU, 256Mi memory
Result: Optimized allocation

2. Use Resource Quotas#

Limit namespace resource usage:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"

3. Enable Metrics Server#

Required for monitoring:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

4. Monitor Regularly#

Check metrics daily
Review trends weekly
Plan capacity monthly
Optimize quarterly

Troubleshooting Resource Issues#

Problem: Nodes Running Out of Memory#

Symptoms:

Pod evictions
Out of memory errors
Node goes "NotReady"

Solutions:

Delete unnecessary pods
Increase node memory
Adjust pod memory limits
Identify memory leaks

Problem: Pods Keep Crashing#

Symptoms:

CrashLoopBackOff status
Frequent restarts

Solutions:

Check pod logs
Increase memory/CPU
Fix application code
Verify configuration

Problem: Cluster Won't Scale#

Symptoms:

Pods stuck in Pending
Autoscale not working

Solutions:

Check cluster capacity
Verify autoscaling enabled
Check resource quotas
Add more nodes manually

Best Practices#

1. Monitor Continuously#

Set up alerts for thresholds
Review metrics regularly
Plan for growth

2. Set Realistic Limits#

Not too high (waste resources)
Not too low (pod crashes)
Based on actual usage

3. Plan Capacity#

Monitor trends
Forecast growth
Scale before hitting limits

4. Use Autoscaling#

Set up HPA for applications
Enable cluster autoscaling
Monitor autoscaling behavior

5. Clean Up Regularly#

Delete old images
Remove unused deployments
Clear temporary files
Archive old logs

Next Steps#

View Pod Logs - Debug issues
Security Findings - Check security
Deploy Applications - Use your cluster

Support#

Questions about resources?

Check the guidelines above
Review your metrics
Contact support: [email protected]