Monitoring
Monitor cluster health, resource utilization, and application performance through the Platform interface. The platform collects metrics from the Kubernetes Metrics API and provides them through the dashboard and resource-specific endpoints.
Dashboard Overview
The platform dashboard (/api/v1/dashboard/overview) provides a cluster-wide summary:
- Cluster health status: Overall cluster readiness
- Resource utilization: CPU and memory usage across all nodes
- Node status: Count of healthy vs. unhealthy nodes
- Pod count: Total running pods
- Recent events: Latest cluster events including warnings
Resource Utilization Summary
The /api/v1/dashboard/resources/utilization endpoint provides detailed resource consumption data:
- Current CPU usage vs. total capacity
- Current memory usage vs. total capacity
- Per-node utilization breakdown
- Resource requests vs. limits vs. actual usage
Node Metrics
Cluster-Wide Node Metrics
View metrics for all nodes via GET /api/v1/cluster/nodes/metrics:
- CPU: Current usage, allocatable, and capacity per node
- Memory: Current usage, allocatable, and capacity per node
- Pod count: Number of pods running on each node
- Conditions: Ready, MemoryPressure, DiskPressure, PIDPressure
Individual Node Metrics
View detailed metrics for a specific node via GET /api/v1/cluster/nodes/{name}/metrics:
- CPU utilization percentage
- Memory utilization percentage
- Number of pods vs. pod capacity
- Node conditions and their status
Advanced Node Information
The GET /api/v1/cluster/nodes/advanced endpoint provides enhanced node data including:
- Instance type and availability zone
- Kubernetes version
- Operating system and architecture
- Labels, annotations, and taints
- Resource allocation details
Node Health Checks
Run comprehensive health checks on individual nodes via GET /api/v1/cluster/nodes/{name}/health-check:
- Check all node conditions (Ready, MemoryPressure, DiskPressure, PIDPressure)
- Verify node is schedulable (not cordoned)
- Check resource pressure states
- Get condition summaries via
GET /api/v1/cluster/nodes/{name}/conditions-summary - View cluster-wide conditions overview via
GET /api/v1/cluster/nodes/conditions-overview
Pod Metrics
All Pod Metrics
View metrics across all pods via GET /api/v1/pods/metrics:
- CPU and memory usage for every pod
- Filterable by namespace
- Status counts by phase (Running, Pending, Failed, Succeeded)
Individual Pod Metrics
View metrics for a specific pod via GET /api/v1/pods/{namespace}/{name}/metrics:
- Per-container CPU usage
- Per-container memory usage
- Container restart counts
- Pod age and uptime
Pod Status Counts
Quick summary of pod states via GET /api/v1/pods/status-counts:
- Running pods count
- Pending pods count
- Failed pods count
- Succeeded pods count
Workload Metrics
Deployment Metrics
View deployment-level metrics via GET /api/v1/workloads/deployments/{namespace}/{name}/metrics:
- Aggregate CPU and memory across all deployment pods
- Ready vs. desired replica count
- Rollout status and progress
HPA Metrics
Monitor autoscaler performance:
- Current metrics:
GET /api/v1/workloads/hpas/{namespace}/{name}shows current vs. target values - Metrics history:
GET /api/v1/workloads/hpas/{namespace}/{name}/metrics-historyshows historical scaling data - Scaling events:
GET /api/v1/workloads/hpas/{namespace}/{name}/scaling-eventsshows when and why scaling occurred
Storage Metrics
Persistent Volume Utilization
View storage usage via the storage endpoints:
- PVC usage:
GET /api/v1/storage/persistent-volume-claims/{namespace}/{name}/usage - ConfigMap usage:
GET /api/v1/storage/configmaps/{namespace}/{name}/usage(which pods reference the ConfigMap) - Secret usage:
GET /api/v1/storage/secrets/{namespace}/{name}/usage(which pods reference the secret)
Pod Volume Capacity
View volume usage within pods via GET /api/v1/pods/{namespace}/{name}/volumes/capacity:
- Per-volume capacity and usage
- Percentage utilized
- Available space
Volume Snapshot Statistics
View snapshot storage metrics via GET /api/v1/storage/volume-snapshots/statistics:
- Total number of snapshots
- Storage consumed by snapshots
- Snapshot age distribution
Cluster Events
Real-Time Event Stream
Events provide the primary mechanism for understanding cluster activity:
- All events:
GET /api/v1/eventswith namespace, type, and time range filtering - Recent events:
GET /api/v1/events/recentfor the latest events - Event summary:
GET /api/v1/events/summaryfor aggregated event statistics - Resource events:
GET /api/v1/events/resource/{kind}/{name}for events related to a specific resource
StatefulSet Events (Advanced)
Comprehensive event analysis for StatefulSets:
| Endpoint | Purpose |
|---|---|
GET /events/statefulset/{name} | All events for a StatefulSet |
GET /events/statefulset/{name}/summary | Event summary and statistics |
GET /events/statefulset/{name}/lifecycle | Lifecycle transition events |
GET /events/statefulset/{name}/pods | Events for StatefulSet pods |
GET /events/statefulset/{name}/stream | Event stream |
GET /events/statefulset/{name}/volume-events | Volume-related events |
GET /events/statefulset/{name}/correlated | Correlated events across resources |
GET /events/statefulset/{name}/analytics | Event pattern analysis |
GET /events/statefulset/{name}/stream-realtime | Real-time event streaming |
Event Types
Events are categorized by type:
- Normal: Standard lifecycle events (pod scheduled, container started, image pulled)
- Warning: Issues requiring attention (OOMKilled, CrashLoopBackOff, FailedScheduling, ImagePullBackOff)
Capacity Planning
Capacity Analysis
Analyze current cluster capacity via GET /api/v1/capacity/analysis:
- CPU headroom (available vs. allocated)
- Memory headroom (available vs. allocated)
- Pod capacity per node
- Resource efficiency metrics
Capacity Recommendations
Get recommendations for capacity planning via GET /api/v1/capacity/recommendations:
- Right-sizing suggestions for over/under-provisioned workloads
- Node scaling recommendations
- Cost optimization opportunities
Cluster Components
Component Health
View cluster component statuses via GET /api/v1/cluster/components:
- API Server status
- Scheduler status
- Controller Manager status
- etcd status (if available)
Cluster Capabilities
Check what the cluster supports via GET /api/v1/cluster/capabilities:
- Supported API versions
- Available resource types
- Feature gates enabled
Metrics Collection Architecture
The platform collects metrics from your Kubernetes cluster:
- Metrics API: Queries the Kubernetes Metrics Server for CPU and memory data
- Caching: Metrics are cached for historical queries when caching is enabled
- Polling intervals: Metrics are collected at configurable intervals
- Adaptive polling: When users are actively viewing a resource page, polling frequency increases for near-real-time data
Metrics API Service
The metrics API service provides endpoints for querying collected metrics:
- Pod-level metrics
- Node-level metrics
- Namespace-level aggregations
- Historical trends (when caching is enabled)
Monitoring Best Practices
Set Up Key Alerts
Monitor these critical conditions:
- Node NotReady: Any node becoming unavailable
- Pod CrashLoopBackOff: Applications repeatedly crashing
- OOMKilled: Applications exceeding memory limits
- High CPU/Memory: Nodes approaching resource capacity
- DiskPressure: Nodes running low on disk space
- PendingPods: Pods that cannot be scheduled
Regular Health Checks
- Check the dashboard overview daily for cluster health
- Review node conditions weekly for degradation trends
- Monitor HPA scaling events to verify autoscaling is working correctly
- Review event analytics for recurring warning patterns
Capacity Planning
- Track resource utilization trends over time
- Plan for seasonal traffic increases
- Use capacity recommendations to right-size workloads
- Monitor PVC usage to prevent storage exhaustion
Use the events endpoints as your primary troubleshooting tool. Warning events almost always explain why pods fail to start, why volumes don't mount, or why deployments stall. Start with the event summary to identify patterns, then drill into specific resource events.