Skip to main content

Monitoring

Monitor cluster health, resource utilization, and application performance through the Platform interface. The platform collects metrics from the Kubernetes Metrics API and provides them through the dashboard and resource-specific endpoints.

Dashboard Overview

The platform dashboard (/api/v1/dashboard/overview) provides a cluster-wide summary:

  • Cluster health status: Overall cluster readiness
  • Resource utilization: CPU and memory usage across all nodes
  • Node status: Count of healthy vs. unhealthy nodes
  • Pod count: Total running pods
  • Recent events: Latest cluster events including warnings

Resource Utilization Summary

The /api/v1/dashboard/resources/utilization endpoint provides detailed resource consumption data:

  • Current CPU usage vs. total capacity
  • Current memory usage vs. total capacity
  • Per-node utilization breakdown
  • Resource requests vs. limits vs. actual usage

Node Metrics

Cluster-Wide Node Metrics

View metrics for all nodes via GET /api/v1/cluster/nodes/metrics:

  • CPU: Current usage, allocatable, and capacity per node
  • Memory: Current usage, allocatable, and capacity per node
  • Pod count: Number of pods running on each node
  • Conditions: Ready, MemoryPressure, DiskPressure, PIDPressure

Individual Node Metrics

View detailed metrics for a specific node via GET /api/v1/cluster/nodes/{name}/metrics:

  • CPU utilization percentage
  • Memory utilization percentage
  • Number of pods vs. pod capacity
  • Node conditions and their status

Advanced Node Information

The GET /api/v1/cluster/nodes/advanced endpoint provides enhanced node data including:

  • Instance type and availability zone
  • Kubernetes version
  • Operating system and architecture
  • Labels, annotations, and taints
  • Resource allocation details

Node Health Checks

Run comprehensive health checks on individual nodes via GET /api/v1/cluster/nodes/{name}/health-check:

  • Check all node conditions (Ready, MemoryPressure, DiskPressure, PIDPressure)
  • Verify node is schedulable (not cordoned)
  • Check resource pressure states
  • Get condition summaries via GET /api/v1/cluster/nodes/{name}/conditions-summary
  • View cluster-wide conditions overview via GET /api/v1/cluster/nodes/conditions-overview

Pod Metrics

All Pod Metrics

View metrics across all pods via GET /api/v1/pods/metrics:

  • CPU and memory usage for every pod
  • Filterable by namespace
  • Status counts by phase (Running, Pending, Failed, Succeeded)

Individual Pod Metrics

View metrics for a specific pod via GET /api/v1/pods/{namespace}/{name}/metrics:

  • Per-container CPU usage
  • Per-container memory usage
  • Container restart counts
  • Pod age and uptime

Pod Status Counts

Quick summary of pod states via GET /api/v1/pods/status-counts:

  • Running pods count
  • Pending pods count
  • Failed pods count
  • Succeeded pods count

Workload Metrics

Deployment Metrics

View deployment-level metrics via GET /api/v1/workloads/deployments/{namespace}/{name}/metrics:

  • Aggregate CPU and memory across all deployment pods
  • Ready vs. desired replica count
  • Rollout status and progress

HPA Metrics

Monitor autoscaler performance:

  • Current metrics: GET /api/v1/workloads/hpas/{namespace}/{name} shows current vs. target values
  • Metrics history: GET /api/v1/workloads/hpas/{namespace}/{name}/metrics-history shows historical scaling data
  • Scaling events: GET /api/v1/workloads/hpas/{namespace}/{name}/scaling-events shows when and why scaling occurred

Storage Metrics

Persistent Volume Utilization

View storage usage via the storage endpoints:

  • PVC usage: GET /api/v1/storage/persistent-volume-claims/{namespace}/{name}/usage
  • ConfigMap usage: GET /api/v1/storage/configmaps/{namespace}/{name}/usage (which pods reference the ConfigMap)
  • Secret usage: GET /api/v1/storage/secrets/{namespace}/{name}/usage (which pods reference the secret)

Pod Volume Capacity

View volume usage within pods via GET /api/v1/pods/{namespace}/{name}/volumes/capacity:

  • Per-volume capacity and usage
  • Percentage utilized
  • Available space

Volume Snapshot Statistics

View snapshot storage metrics via GET /api/v1/storage/volume-snapshots/statistics:

  • Total number of snapshots
  • Storage consumed by snapshots
  • Snapshot age distribution

Cluster Events

Real-Time Event Stream

Events provide the primary mechanism for understanding cluster activity:

  • All events: GET /api/v1/events with namespace, type, and time range filtering
  • Recent events: GET /api/v1/events/recent for the latest events
  • Event summary: GET /api/v1/events/summary for aggregated event statistics
  • Resource events: GET /api/v1/events/resource/{kind}/{name} for events related to a specific resource

StatefulSet Events (Advanced)

Comprehensive event analysis for StatefulSets:

EndpointPurpose
GET /events/statefulset/{name}All events for a StatefulSet
GET /events/statefulset/{name}/summaryEvent summary and statistics
GET /events/statefulset/{name}/lifecycleLifecycle transition events
GET /events/statefulset/{name}/podsEvents for StatefulSet pods
GET /events/statefulset/{name}/streamEvent stream
GET /events/statefulset/{name}/volume-eventsVolume-related events
GET /events/statefulset/{name}/correlatedCorrelated events across resources
GET /events/statefulset/{name}/analyticsEvent pattern analysis
GET /events/statefulset/{name}/stream-realtimeReal-time event streaming

Event Types

Events are categorized by type:

  • Normal: Standard lifecycle events (pod scheduled, container started, image pulled)
  • Warning: Issues requiring attention (OOMKilled, CrashLoopBackOff, FailedScheduling, ImagePullBackOff)

Capacity Planning

Capacity Analysis

Analyze current cluster capacity via GET /api/v1/capacity/analysis:

  • CPU headroom (available vs. allocated)
  • Memory headroom (available vs. allocated)
  • Pod capacity per node
  • Resource efficiency metrics

Capacity Recommendations

Get recommendations for capacity planning via GET /api/v1/capacity/recommendations:

  • Right-sizing suggestions for over/under-provisioned workloads
  • Node scaling recommendations
  • Cost optimization opportunities

Cluster Components

Component Health

View cluster component statuses via GET /api/v1/cluster/components:

  • API Server status
  • Scheduler status
  • Controller Manager status
  • etcd status (if available)

Cluster Capabilities

Check what the cluster supports via GET /api/v1/cluster/capabilities:

  • Supported API versions
  • Available resource types
  • Feature gates enabled

Metrics Collection Architecture

The platform collects metrics from your Kubernetes cluster:

  1. Metrics API: Queries the Kubernetes Metrics Server for CPU and memory data
  2. Caching: Metrics are cached for historical queries when caching is enabled
  3. Polling intervals: Metrics are collected at configurable intervals
  4. Adaptive polling: When users are actively viewing a resource page, polling frequency increases for near-real-time data

Metrics API Service

The metrics API service provides endpoints for querying collected metrics:

  • Pod-level metrics
  • Node-level metrics
  • Namespace-level aggregations
  • Historical trends (when caching is enabled)

Monitoring Best Practices

Set Up Key Alerts

Monitor these critical conditions:

  • Node NotReady: Any node becoming unavailable
  • Pod CrashLoopBackOff: Applications repeatedly crashing
  • OOMKilled: Applications exceeding memory limits
  • High CPU/Memory: Nodes approaching resource capacity
  • DiskPressure: Nodes running low on disk space
  • PendingPods: Pods that cannot be scheduled

Regular Health Checks

  • Check the dashboard overview daily for cluster health
  • Review node conditions weekly for degradation trends
  • Monitor HPA scaling events to verify autoscaling is working correctly
  • Review event analytics for recurring warning patterns

Capacity Planning

  • Track resource utilization trends over time
  • Plan for seasonal traffic increases
  • Use capacity recommendations to right-size workloads
  • Monitor PVC usage to prevent storage exhaustion
tip

Use the events endpoints as your primary troubleshooting tool. Warning events almost always explain why pods fail to start, why volumes don't mount, or why deployments stall. Start with the event summary to identify patterns, then drill into specific resource events.