Monitoring

Monitor cluster health, resource utilization, and application performance through the Platform interface. The platform collects metrics from the Kubernetes Metrics API and provides them through the dashboard and resource-specific endpoints.

Dashboard Overview

The platform dashboard (/api/v1/dashboard/overview) provides a cluster-wide summary:

Cluster health status: Overall cluster readiness
Resource utilization: CPU and memory usage across all nodes
Node status: Count of healthy vs. unhealthy nodes
Pod count: Total running pods
Recent events: Latest cluster events including warnings

Resource Utilization Summary

The /api/v1/dashboard/resources/utilization endpoint provides detailed resource consumption data:

Current CPU usage vs. total capacity
Current memory usage vs. total capacity
Per-node utilization breakdown
Resource requests vs. limits vs. actual usage

Node Metrics

Cluster-Wide Node Metrics

View metrics for all nodes via GET /api/v1/cluster/nodes/metrics:

CPU: Current usage, allocatable, and capacity per node
Memory: Current usage, allocatable, and capacity per node
Pod count: Number of pods running on each node
Conditions: Ready, MemoryPressure, DiskPressure, PIDPressure

Individual Node Metrics

View detailed metrics for a specific node via GET /api/v1/cluster/nodes/{name}/metrics:

CPU utilization percentage
Memory utilization percentage
Number of pods vs. pod capacity
Node conditions and their status

Advanced Node Information

The GET /api/v1/cluster/nodes/advanced endpoint provides enhanced node data including:

Instance type and availability zone
Kubernetes version
Operating system and architecture
Labels, annotations, and taints
Resource allocation details

Node Health Checks

Run comprehensive health checks on individual nodes via GET /api/v1/cluster/nodes/{name}/health-check:

Check all node conditions (Ready, MemoryPressure, DiskPressure, PIDPressure)
Verify node is schedulable (not cordoned)
Check resource pressure states
Get condition summaries via GET /api/v1/cluster/nodes/{name}/conditions-summary
View cluster-wide conditions overview via GET /api/v1/cluster/nodes/conditions-overview

Pod Metrics

All Pod Metrics

View metrics across all pods via GET /api/v1/pods/metrics:

CPU and memory usage for every pod
Filterable by namespace
Status counts by phase (Running, Pending, Failed, Succeeded)

Individual Pod Metrics

View metrics for a specific pod via GET /api/v1/pods/{namespace}/{name}/metrics:

Per-container CPU usage
Per-container memory usage
Container restart counts
Pod age and uptime

Pod Status Counts

Quick summary of pod states via GET /api/v1/pods/status-counts:

Running pods count
Pending pods count
Failed pods count
Succeeded pods count

Workload Metrics

Deployment Metrics

View deployment-level metrics via GET /api/v1/workloads/deployments/{namespace}/{name}/metrics:

Aggregate CPU and memory across all deployment pods
Ready vs. desired replica count
Rollout status and progress

HPA Metrics

Monitor autoscaler performance:

Current metrics: GET /api/v1/workloads/hpas/{namespace}/{name} shows current vs. target values
Metrics history: GET /api/v1/workloads/hpas/{namespace}/{name}/metrics-history shows historical scaling data
Scaling events: GET /api/v1/workloads/hpas/{namespace}/{name}/scaling-events shows when and why scaling occurred

Storage Metrics

Persistent Volume Utilization

View storage usage via the storage endpoints:

PVC usage: GET /api/v1/storage/persistent-volume-claims/{namespace}/{name}/usage
ConfigMap usage: GET /api/v1/storage/configmaps/{namespace}/{name}/usage (which pods reference the ConfigMap)
Secret usage: GET /api/v1/storage/secrets/{namespace}/{name}/usage (which pods reference the secret)

Pod Volume Capacity

View volume usage within pods via GET /api/v1/pods/{namespace}/{name}/volumes/capacity:

Per-volume capacity and usage
Percentage utilized
Available space

Volume Snapshot Statistics

View snapshot storage metrics via GET /api/v1/storage/volume-snapshots/statistics:

Total number of snapshots
Storage consumed by snapshots
Snapshot age distribution

Cluster Events

Real-Time Event Stream

Events provide the primary mechanism for understanding cluster activity:

All events: GET /api/v1/events with namespace, type, and time range filtering
Recent events: GET /api/v1/events/recent for the latest events
Event summary: GET /api/v1/events/summary for aggregated event statistics
Resource events: GET /api/v1/events/resource/{kind}/{name} for events related to a specific resource

StatefulSet Events (Advanced)

Comprehensive event analysis for StatefulSets:

Endpoint	Purpose
`GET /events/statefulset/{name}`	All events for a StatefulSet
`GET /events/statefulset/{name}/summary`	Event summary and statistics
`GET /events/statefulset/{name}/lifecycle`	Lifecycle transition events
`GET /events/statefulset/{name}/pods`	Events for StatefulSet pods
`GET /events/statefulset/{name}/stream`	Event stream
`GET /events/statefulset/{name}/volume-events`	Volume-related events
`GET /events/statefulset/{name}/correlated`	Correlated events across resources
`GET /events/statefulset/{name}/analytics`	Event pattern analysis
`GET /events/statefulset/{name}/stream-realtime`	Real-time event streaming

Event Types

Events are categorized by type:

Normal: Standard lifecycle events (pod scheduled, container started, image pulled)
Warning: Issues requiring attention (OOMKilled, CrashLoopBackOff, FailedScheduling, ImagePullBackOff)

Capacity Planning

Capacity Analysis

Analyze current cluster capacity via GET /api/v1/capacity/analysis:

CPU headroom (available vs. allocated)
Memory headroom (available vs. allocated)
Pod capacity per node
Resource efficiency metrics

Capacity Recommendations

Get recommendations for capacity planning via GET /api/v1/capacity/recommendations:

Right-sizing suggestions for over/under-provisioned workloads
Node scaling recommendations
Cost optimization opportunities

Cluster Components

Component Health

View cluster component statuses via GET /api/v1/cluster/components:

API Server status
Scheduler status
Controller Manager status
etcd status (if available)

Cluster Capabilities

Check what the cluster supports via GET /api/v1/cluster/capabilities:

Supported API versions
Available resource types
Feature gates enabled

Metrics Collection Architecture

The platform collects metrics from your Kubernetes cluster:

Metrics API: Queries the Kubernetes Metrics Server for CPU and memory data
Caching: Metrics are cached for historical queries when caching is enabled
Polling intervals: Metrics are collected at configurable intervals
Adaptive polling: When users are actively viewing a resource page, polling frequency increases for near-real-time data

Metrics API Service

The metrics API service provides endpoints for querying collected metrics:

Pod-level metrics
Node-level metrics
Namespace-level aggregations
Historical trends (when caching is enabled)

Monitoring Best Practices

Set Up Key Alerts

Monitor these critical conditions:

Node NotReady: Any node becoming unavailable
Pod CrashLoopBackOff: Applications repeatedly crashing
OOMKilled: Applications exceeding memory limits
High CPU/Memory: Nodes approaching resource capacity
DiskPressure: Nodes running low on disk space
PendingPods: Pods that cannot be scheduled

Regular Health Checks

Check the dashboard overview daily for cluster health
Review node conditions weekly for degradation trends
Monitor HPA scaling events to verify autoscaling is working correctly
Review event analytics for recurring warning patterns

Capacity Planning

Track resource utilization trends over time
Plan for seasonal traffic increases
Use capacity recommendations to right-size workloads
Monitor PVC usage to prevent storage exhaustion

tip

Use the events endpoints as your primary troubleshooting tool. Warning events almost always explain why pods fail to start, why volumes don't mount, or why deployments stall. Start with the event summary to identify patterns, then drill into specific resource events.

Dashboard Overview​

Resource Utilization Summary​

Node Metrics​

Cluster-Wide Node Metrics​

Individual Node Metrics​

Advanced Node Information​

Node Health Checks​

Pod Metrics​

All Pod Metrics​

Individual Pod Metrics​

Pod Status Counts​

Workload Metrics​

Deployment Metrics​

HPA Metrics​

Storage Metrics​

Persistent Volume Utilization​

Pod Volume Capacity​

Volume Snapshot Statistics​

Cluster Events​

Real-Time Event Stream​

StatefulSet Events (Advanced)​

Event Types​

Capacity Planning​

Capacity Analysis​

Capacity Recommendations​

Cluster Components​

Component Health​

Cluster Capabilities​

Metrics Collection Architecture​

Metrics API Service​

Monitoring Best Practices​

Set Up Key Alerts​

Regular Health Checks​

Capacity Planning​

Dashboard Overview

Resource Utilization Summary

Node Metrics

Cluster-Wide Node Metrics

Individual Node Metrics

Advanced Node Information

Node Health Checks

Pod Metrics

All Pod Metrics

Individual Pod Metrics

Pod Status Counts

Workload Metrics

Deployment Metrics

HPA Metrics

Storage Metrics

Persistent Volume Utilization

Pod Volume Capacity

Volume Snapshot Statistics

Cluster Events

Real-Time Event Stream

StatefulSet Events (Advanced)

Event Types

Capacity Planning

Capacity Analysis

Capacity Recommendations

Cluster Components

Component Health

Cluster Capabilities

Metrics Collection Architecture

Metrics API Service

Monitoring Best Practices

Set Up Key Alerts

Regular Health Checks

Capacity Planning