Skip to main content

Monitoring

Monitor cluster health, resource utilization, and application performance through the Platform interface.

Cluster Metrics

Overview Dashboard

View cluster-wide metrics at a glance:

  • CPU Utilization: Total CPU usage across all nodes
  • Memory Utilization: Total memory usage across all nodes
  • Pod Count: Number of running pods
  • Node Status: Healthy vs. unhealthy nodes
  • Storage Usage: Persistent volume utilization

Node Metrics

Monitor individual node performance:

  • CPU: Current usage, requests, limits, capacity
  • Memory: Current usage, requests, limits, capacity
  • Disk: Used vs. available disk space
  • Network: Ingress/egress bandwidth
  • Pods: Number of pods running on node
  • Conditions: Ready, MemoryPressure, DiskPressure, PIDPressure

Pod Metrics

Track pod-level resource consumption:

  • CPU Usage: Current CPU utilization
  • Memory Usage: Current memory utilization
  • Restart Count: Number of pod restarts
  • Age: How long pod has been running
  • Status: Running, Pending, Failed, etc.

Application Performance

Container Metrics

Monitor individual container performance:

  • CPU Usage: Per-container CPU consumption
  • Memory Usage: Per-container memory consumption
  • Disk I/O: Read/write operations per second
  • Network I/O: Bytes sent/received
  • Resource Limits: How close to limits

Service Metrics

Track service-level metrics:

  • Request Rate: Requests per second
  • Error Rate: Failed requests percentage
  • Latency: Response time percentiles (p50, p95, p99)
  • Active Connections: Current connection count

Alerts & Notifications

Setting Up Alerts

Create alerts for critical conditions:

  1. Go to PlatformMonitoringAlerts
  2. Click Create Alert
  3. Configure alert:
    • Metric: CPU, memory, disk, etc.
    • Threshold: When to trigger alert
    • Duration: How long condition must persist
    • Severity: Critical, warning, info
  4. Set notification channels:
    • Email
    • Slack
    • PagerDuty
    • Webhook
  5. Click Create Alert

Common Alert Rules

Pre-configured alerts for common issues:

  • High CPU Usage: Node CPU > 80% for 5 minutes
  • High Memory Usage: Node memory > 85% for 5 minutes
  • Pod Crashes: Pod restart count > 5 in 10 minutes
  • Disk Space Low: Node disk usage > 85%
  • Node Not Ready: Node becomes NotReady

Alert States

Alerts can be in these states:

  • OK: Condition not met, no issue
  • Pending: Condition met, waiting for duration
  • Firing: Alert triggered, notifications sent
  • Resolved: Condition no longer met

Health Checks

Liveness Probes

Check if container is running:

  • HTTP GET: Check endpoint returns 200-399
  • TCP Socket: Check port is open
  • Exec: Run command, check exit code 0
  • Failure Action: Restart container

Readiness Probes

Check if container ready to serve traffic:

  • Same Methods: HTTP, TCP, Exec
  • Failure Action: Remove from service endpoints
  • Use Case: Don't send traffic until ready

Startup Probes

Check if application has started:

  • Use Case: Slow-starting applications
  • Disables Other Probes: Until startup succeeds
  • Failure Action: Restart container

Performance Dashboards

Pre-Built Dashboards

Access ready-made dashboards:

  • Cluster Overview: High-level cluster health
  • Node Performance: Per-node resource usage
  • Namespace Usage: Resource consumption by namespace
  • Pod Performance: Individual pod metrics
  • Storage Metrics: PV/PVC utilization

Custom Dashboards

Create your own dashboards:

  1. Go to MonitoringDashboards
  2. Click Create Dashboard
  3. Add panels:
    • Select metric to visualize
    • Choose visualization type (line, bar, gauge)
    • Set time range and refresh interval
  4. Arrange panels in layout
  5. Save and share dashboard

Metrics Collection

Prometheus Integration

Metrics are collected via Prometheus:

  • Scrape Interval: Every 30 seconds
  • Retention: 15 days by default
  • Storage: Persistent volume for metrics
  • Query Language: PromQL for custom queries

Custom Metrics

Expose application metrics:

# Python example with prometheus_client
from prometheus_client import Counter, Gauge, Histogram
import time

# Define metrics
requests_total = Counter('app_requests_total', 'Total requests')
active_users = Gauge('app_active_users', 'Active users')
request_duration = Histogram('app_request_duration_seconds', 'Request duration')

# Instrument code
@request_duration.time()
def handle_request():
requests_total.inc()
# Your application logic
time.sleep(0.1)

Troubleshooting with Monitoring

High CPU Usage

  1. Check Node Metrics to identify affected node
  2. View Pod Metrics to find high-CPU pods
  3. Check pod Logs for issues
  4. Scale deployment or increase resources

Memory Leaks

  1. Monitor Memory Usage over time
  2. Look for steadily increasing memory
  3. Check for OOMKilled pods in events
  4. Increase memory limits or fix application

Slow Response Times

  1. Check Service Metrics for latency spikes
  2. Correlate with CPU/Memory usage
  3. Review Pod Logs for errors
  4. Check Network I/O for bottlenecks
Best Practice

Set up alerts before issues occur. Monitor trends over time to catch problems early and plan capacity.