Monitoring
Monitor cluster health, resource utilization, and application performance through the Platform interface.
Cluster Metrics
Overview Dashboard
View cluster-wide metrics at a glance:
- CPU Utilization: Total CPU usage across all nodes
- Memory Utilization: Total memory usage across all nodes
- Pod Count: Number of running pods
- Node Status: Healthy vs. unhealthy nodes
- Storage Usage: Persistent volume utilization
Node Metrics
Monitor individual node performance:
- CPU: Current usage, requests, limits, capacity
- Memory: Current usage, requests, limits, capacity
- Disk: Used vs. available disk space
- Network: Ingress/egress bandwidth
- Pods: Number of pods running on node
- Conditions: Ready, MemoryPressure, DiskPressure, PIDPressure
Pod Metrics
Track pod-level resource consumption:
- CPU Usage: Current CPU utilization
- Memory Usage: Current memory utilization
- Restart Count: Number of pod restarts
- Age: How long pod has been running
- Status: Running, Pending, Failed, etc.
Application Performance
Container Metrics
Monitor individual container performance:
- CPU Usage: Per-container CPU consumption
- Memory Usage: Per-container memory consumption
- Disk I/O: Read/write operations per second
- Network I/O: Bytes sent/received
- Resource Limits: How close to limits
Service Metrics
Track service-level metrics:
- Request Rate: Requests per second
- Error Rate: Failed requests percentage
- Latency: Response time percentiles (p50, p95, p99)
- Active Connections: Current connection count
Alerts & Notifications
Setting Up Alerts
Create alerts for critical conditions:
- Go to Platform → Monitoring → Alerts
- Click Create Alert
- Configure alert:
- Metric: CPU, memory, disk, etc.
- Threshold: When to trigger alert
- Duration: How long condition must persist
- Severity: Critical, warning, info
- Set notification channels:
- Slack
- PagerDuty
- Webhook
- Click Create Alert
Common Alert Rules
Pre-configured alerts for common issues:
- High CPU Usage: Node CPU > 80% for 5 minutes
- High Memory Usage: Node memory > 85% for 5 minutes
- Pod Crashes: Pod restart count > 5 in 10 minutes
- Disk Space Low: Node disk usage > 85%
- Node Not Ready: Node becomes NotReady
Alert States
Alerts can be in these states:
- OK: Condition not met, no issue
- Pending: Condition met, waiting for duration
- Firing: Alert triggered, notifications sent
- Resolved: Condition no longer met
Health Checks
Liveness Probes
Check if container is running:
- HTTP GET: Check endpoint returns 200-399
- TCP Socket: Check port is open
- Exec: Run command, check exit code 0
- Failure Action: Restart container
Readiness Probes
Check if container ready to serve traffic:
- Same Methods: HTTP, TCP, Exec
- Failure Action: Remove from service endpoints
- Use Case: Don't send traffic until ready
Startup Probes
Check if application has started:
- Use Case: Slow-starting applications
- Disables Other Probes: Until startup succeeds
- Failure Action: Restart container
Performance Dashboards
Pre-Built Dashboards
Access ready-made dashboards:
- Cluster Overview: High-level cluster health
- Node Performance: Per-node resource usage
- Namespace Usage: Resource consumption by namespace
- Pod Performance: Individual pod metrics
- Storage Metrics: PV/PVC utilization
Custom Dashboards
Create your own dashboards:
- Go to Monitoring → Dashboards
- Click Create Dashboard
- Add panels:
- Select metric to visualize
- Choose visualization type (line, bar, gauge)
- Set time range and refresh interval
- Arrange panels in layout
- Save and share dashboard
Metrics Collection
Prometheus Integration
Metrics are collected via Prometheus:
- Scrape Interval: Every 30 seconds
- Retention: 15 days by default
- Storage: Persistent volume for metrics
- Query Language: PromQL for custom queries
Custom Metrics
Expose application metrics:
# Python example with prometheus_client
from prometheus_client import Counter, Gauge, Histogram
import time
# Define metrics
requests_total = Counter('app_requests_total', 'Total requests')
active_users = Gauge('app_active_users', 'Active users')
request_duration = Histogram('app_request_duration_seconds', 'Request duration')
# Instrument code
@request_duration.time()
def handle_request():
requests_total.inc()
# Your application logic
time.sleep(0.1)
Troubleshooting with Monitoring
High CPU Usage
- Check Node Metrics to identify affected node
- View Pod Metrics to find high-CPU pods
- Check pod Logs for issues
- Scale deployment or increase resources
Memory Leaks
- Monitor Memory Usage over time
- Look for steadily increasing memory
- Check for OOMKilled pods in events
- Increase memory limits or fix application
Slow Response Times
- Check Service Metrics for latency spikes
- Correlate with CPU/Memory usage
- Review Pod Logs for errors
- Check Network I/O for bottlenecks
Best Practice
Set up alerts before issues occur. Monitor trends over time to catch problems early and plan capacity.