Managing Cluster Resources & Performance | Nife Deploy

Monitor your cluster's health and performance with real-time metrics and resource tracking.

Understanding Cluster Resources#

Cluster resources include everything your cluster uses:

  • CPU: Processing power
  • Memory: RAM
  • Disk: Storage space
  • Network: Data transfer
  • Pods: Running containers

Resource Dashboard#

Viewing Cluster Metrics#

  1. Go to Clusters page
  2. Click a cluster to see details
  3. Metrics tab shows resource data

Information Displayed:

  • Current CPU usage %
  • Current memory usage %
  • Disk space used/available
  • Pod count and status
  • Node health

Real-time Monitoring#

Metrics update in real-time when agent is deployed:

  • CPU and memory updates every 30 seconds
  • Disk usage updates every 5 minutes
  • Pod status updates immediately
  • Node health updates continuously

CPU Metrics#

Understanding CPU Usage#

CPU is measured as a percentage (0-100%):

  • 0-20%: Idle, plenty of capacity
  • 20-50%: Normal operating range
  • 50-80%: Moderate load
  • 80-100%: High load, at capacity

CPU Guidelines#

Recommended Range:

  • Development: 20-50% average
  • Production: 30-60% average
  • Peak: Should stay below 80%
  • Never: Sustained 100% usage

Managing High CPU#

If CPU is consistently high:

  1. Identify the cause:

    • Check Pod Logs tab
    • Look for error messages
    • Check for runaway processes
  2. Temporary solutions:

    • Restart problematic pods
    • Disable non-essential services
    • Kill stuck processes
  3. Long-term solutions:

    • Scale up cluster (add nodes)
    • Optimize application code
    • Fix resource leaks
    • Load balance better

Memory Metrics#

Understanding Memory Usage#

Memory is also measured as a percentage (0-100%):

  • 0-30%: Plenty of available memory
  • 30-60%: Normal operating range
  • 60-80%: Getting full, monitor closely
  • 80-100%: Critical, pods may be evicted

Memory Guidelines#

Recommended Range:

  • Always keep 20% free for OS
  • Applications: 40-70% of total
  • Peak: Should stay below 80%
  • Emergency: Never above 90%

Managing High Memory#

If memory is consistently high:

  1. Identify memory leaks:

    • Check Pod Logs for leak messages
    • Monitor memory trend over time
    • Identify which pod is using most
  2. Free up memory:

    • Restart affected pods
    • Delete unused deployments
    • Clear caches
  3. Optimize:

    • Scale up (add more RAM)
    • Optimize application memory
    • Use smaller images
    • Enable memory compression

Disk Space#

Understanding Disk Usage#

Disk shows used/available space:

  • Free Space: How much is available
  • Used Space: How much is being used
  • Used Percentage: % of total

Disk Guidelines#

Healthy Disk State:

  • Keep at least 10% free space
  • Recommended: 20-30% free
  • Never drop below 5% free
  • Critical: Less than 2% free

Managing Low Disk#

If disk space is low:

  1. Find what's using space:

    • Check container images
    • Look for log files
    • Review persistent volumes
  2. Free up space:

    • Delete old images
    • Clear temporary files
    • Rotate old logs
    • Delete unused volumes
  3. Long-term:

    • Add more storage
    • Implement log rotation
    • Use image cleanup policies
    • Monitor usage regularly

Pod Monitoring#

What are Pods?#

Pods are running instances of your applications:

  • One or more containers
  • Basic deployable unit
  • Can be created/destroyed dynamically

Pod Status#

StatusMeaningAction
RunningPod is healthy and runningNo action needed
PendingPod is startingWait for startup
SucceededPod completed (job)Normal completion
FailedPod crashedCheck logs and investigate
CrashLoopPod keeps restartingFix application error
UnknownCannot determine statusCheck cluster health

Pod Health Indicators#

Green (Healthy):

  • All containers running
  • No restarts
  • Ready for traffic

Yellow (Warning):

  • Frequent restarts
  • High resource usage
  • Slow response

Red (Error):

  • Containers failing
  • Crash loops
  • Not responding

Node Health#

What are Nodes?#

Nodes are the machines that run your pods:

  • Physical or virtual machines
  • Have their own CPU, memory, disk
  • Run multiple pods

Node Metrics#

Each node shows:

  • Node name/ID
  • CPU capacity and usage
  • Memory capacity and usage
  • Disk capacity and usage
  • Pod count

Node Status#

Healthy Node:

  • Status: Ready
  • No taints or conditions
  • Resources available
  • All components running

Problem Node:

  • Status: NotReady
  • May have taints
  • Resources exhausted
  • Components failing

Setting Resource Limits#

Understanding Limits#

Limits prevent containers from using too many resources:

  • Request: Minimum guaranteed resources
  • Limit: Maximum allowed resources

Setting Limits#

Via Nife Dashboard:

  1. Go to cluster details
  2. Click Configure
  3. Set CPU limits
  4. Set memory limits
  5. Click Save

Via kubectl:

resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"

Limit Guidelines#

Frontend Application:

  • CPU request: 100m
  • CPU limit: 500m
  • Memory request: 128Mi
  • Memory limit: 512Mi

Backend API:

  • CPU request: 200m
  • CPU limit: 1000m
  • Memory request: 256Mi
  • Memory limit: 1Gi

Database:

  • CPU request: 500m
  • CPU limit: 2000m
  • Memory request: 512Mi
  • Memory limit: 4Gi

Autoscaling#

Horizontal Pod Autoscaling (HPA)#

Automatically scales number of pods based on metrics:

How it works:

  1. Monitor CPU/memory metrics
  2. If above threshold, scale up (add pods)
  3. If below threshold, scale down (remove pods)
  4. Maintains target metric percentage

When to use:

  • Variable traffic patterns
  • Cost optimization
  • High availability needs

Vertical Pod Autoscaling (VPA)#

Automatically adjusts resource requests/limits:

How it works:

  1. Monitor actual resource usage
  2. If using more, increase limits
  3. If using less, decrease limits
  4. Optimizes resource efficiency

When to use:

  • Unknown resource requirements
  • Right-sizing applications
  • Cost optimization

Cluster Autoscaling#

Automatically adds/removes nodes:

How it works:

  1. When pods can't fit, scale up cluster
  2. When nodes are idle, scale down cluster
  3. Maintains desired capacity

When to use:

  • Dynamic workloads
  • Cost savings
  • Unused capacity

Performance Optimization#

1. Right-Size Applications#

Before:

Pod requests: 2 CPU, 2Gi memory
Actual usage: 200m CPU, 256Mi memory
Result: Wasting 90% of resources

After:

Pod requests: 250m CPU, 512Mi memory
Actual usage: 200m CPU, 256Mi memory
Result: Optimized allocation

2. Use Resource Quotas#

Limit namespace resource usage:

apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
spec:
hard:
requests.cpu: "10"
requests.memory: "20Gi"
limits.cpu: "20"
limits.memory: "40Gi"

3. Enable Metrics Server#

Required for monitoring:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

4. Monitor Regularly#

  • Check metrics daily
  • Review trends weekly
  • Plan capacity monthly
  • Optimize quarterly

Troubleshooting Resource Issues#

Problem: Nodes Running Out of Memory#

Symptoms:

  • Pod evictions
  • Out of memory errors
  • Node goes "NotReady"

Solutions:

  1. Delete unnecessary pods
  2. Increase node memory
  3. Adjust pod memory limits
  4. Identify memory leaks

Problem: Pods Keep Crashing#

Symptoms:

  • CrashLoopBackOff status
  • Frequent restarts

Solutions:

  1. Check pod logs
  2. Increase memory/CPU
  3. Fix application code
  4. Verify configuration

Problem: Cluster Won't Scale#

Symptoms:

  • Pods stuck in Pending
  • Autoscale not working

Solutions:

  1. Check cluster capacity
  2. Verify autoscaling enabled
  3. Check resource quotas
  4. Add more nodes manually

Best Practices#

1. Monitor Continuously#

  • Set up alerts for thresholds
  • Review metrics regularly
  • Plan for growth

2. Set Realistic Limits#

  • Not too high (waste resources)
  • Not too low (pod crashes)
  • Based on actual usage

3. Plan Capacity#

  • Monitor trends
  • Forecast growth
  • Scale before hitting limits

4. Use Autoscaling#

  • Set up HPA for applications
  • Enable cluster autoscaling
  • Monitor autoscaling behavior

5. Clean Up Regularly#

  • Delete old images
  • Remove unused deployments
  • Clear temporary files
  • Archive old logs

Next Steps#

  1. View Pod Logs - Debug issues
  2. Security Findings - Check security
  3. Deploy Applications - Use your cluster

Support#

Questions about resources?