Troubleshooting
Common platform issues and how to resolve them using the platform's diagnostic tools.
Pod Issues
Pod Stuck in Pending
Symptoms: Pod remains in Pending state and never transitions to Running.
Diagnosis using the Platform UI:
- Navigate to Platform > Pods and find the pod
- Click the pod name and check the Events tab (
GET /api/v1/pods/{namespace}/{name}/events) - Look for specific event messages:
Cause: Insufficient Resources
- Event message:
FailedScheduling: Insufficient cpuorInsufficient memory - Check node capacity: Platform > Nodes or
GET /api/v1/cluster/nodes/metrics - Check capacity analysis:
GET /api/v1/capacity/analysis - Solution: Scale down other workloads, increase resource requests for under-utilized pods, or add more nodes
Cause: Node Selector/Affinity Mismatch
- Event message:
FailedScheduling: 0/N nodes are available: N node(s) didn't match node selector - Check node labels:
GET /api/v1/cluster/nodes/{name}and examine labels - Solution: Update the pod's node selector to match existing node labels, or add labels to nodes via
PATCH /api/v1/cluster/nodes/{name}/labels
Cause: PVC Not Bound
- Event message:
FailedScheduling: persistentvolumeclaim "xxx" not foundor PVC shows as Pending - Check PVC status:
GET /api/v1/storage/persistent-volume-claims/{namespace}/{name} - Solution: Create the required PVC, ensure the storage class exists, or check PV availability
Cause: Image Pull Secret Missing
- Event message:
FailedSchedulingfollowed byErrImagePullafter pod is scheduled - Solution: Create an image pull secret and add it to the pod's service account
Pod Stuck in CrashLoopBackOff
Symptoms: Pod starts, crashes, and restarts repeatedly with exponentially increasing backoff.
Diagnosis:
- View pod logs with previous container:
GET /api/v1/pods/{namespace}/{name}/logs?previous=true - Check pod events:
GET /api/v1/pods/{namespace}/{name}/events - Inspect the pod status:
GET /api/v1/pods/{namespace}/{name}/status
Cause: Application Error
- Logs show stack trace, exception, or error message
- Container exit code is non-zero
- Solution: Fix the application code based on the error in the logs
Cause: Missing Environment Variables or Config
- Logs show "config not found", "missing required setting", or connection string errors
- Check environment:
GET /api/v1/pods/{namespace}/{name}/environment - Solution: Add missing environment variables to the deployment, verify ConfigMap/Secret references
Cause: Health Check Failing Too Quickly
- Events show
Unhealthy: Liveness probe failedshortly after container starts - The container doesn't have enough time to initialize before the liveness probe starts checking
- Solution: Increase
initialDelaySecondson the liveness probe, or add a startup probe
Cause: Database or External Service Unavailable
- Logs show connection refused, timeout, or authentication failures
- Solution: Verify the dependent service is running, check network policies, verify credentials in secrets
ImagePullBackOff
Symptoms: Pod can't pull the container image.
Diagnosis: Check pod events for specific error messages.
Cause: Image Does Not Exist
- Event:
Failed to pull image "xxx": manifest unknown - Solution: Verify the image name and tag. Use
GET /api/v1/pods/{namespace}/{name}/eventsto see the exact image reference that failed
Cause: Private Registry Authentication
- Event:
Failed to pull image: unauthorizedorno basic auth credentials - Solution: Create a Docker registry secret and add it to the service account or pod spec
Cause: Docker Hub Rate Limiting
- Event:
toomanyrequests: You have reached your pull rate limit - Solution: Use authenticated Docker Hub pulls (add credentials as secret) or mirror images to a private registry
Cause: Network Issues
- Event:
Failed to pull image: dial tcp: lookup registry.example.com: no such host - Solution: Check DNS resolution, network policies (
GET /api/v1/networking/network-policies), and security group rules
OOMKilled (Out of Memory)
Symptoms: Pod is killed with OOMKilled reason, restart count increases.
Diagnosis:
- Check pod events for "OOMKilled" message
- Review pod metrics:
GET /api/v1/pods/{namespace}/{name}/metrics - Check container resource limits in pod YAML:
GET /api/v1/pods/{namespace}/{name}/yaml
Solutions:
- Increase memory limit: If the application legitimately needs more memory, update the deployment with a higher memory limit via
PATCH /api/v1/workloads/deployments/{namespace}/{name} - Fix memory leak: If memory grows unbounded, profile the application
- Optimize application: Reduce memory footprint (smaller batch sizes, connection pool limits, cache size limits)
Networking Issues
Service Not Accessible
Symptoms: Cannot reach a service from other pods or externally.
Diagnosis Steps:
- Verify service exists and has endpoints:
GET /api/v1/networking/services/{namespace}/{name}/endpoints - Test endpoint connectivity:
POST /api/v1/networking/services/{namespace}/{name}/test-endpoint - Test all endpoints:
POST /api/v1/networking/services/{namespace}/{name}/test-all-endpoints - Check related resources:
GET /api/v1/networking/services/{namespace}/{name}/related - Check network policies:
GET /api/v1/networking/network-policies
Cause: No Endpoints
- Service has selector but no pods match it
- Check pod labels against service selector
- Solution: Fix the service selector or pod labels to match
Cause: Pods Not Ready
- Pods exist but failing readiness probe
- Endpoints only include pods that pass readiness checks
- Solution: Fix the readiness probe or the application's health endpoint
Cause: Network Policy Blocking Traffic
- Use impact analysis:
GET /api/v1/networking/network-policies/{namespace}/{name}/impact-analysis - Test connectivity between pods:
POST /api/v1/networking/network-policies/test-connectivity - Solution: Update network policy to allow traffic from the source pod's labels/namespace
Cause: Wrong Port Configuration
- Service port doesn't match container port
- Solution: Update the service's
targetPortto match the container's listening port
Ingress Not Working
Symptoms: Cannot access application via ingress URL.
Diagnosis Steps:
- Verify ingress exists:
GET /api/v1/networking/ingresses/{namespace}/{name} - Check ingress rules:
GET /api/v1/networking/ingresses/{namespace}/{name}/rules - Validate backend services:
GET /api/v1/networking/ingresses/{namespace}/{name}/backend-services - Check ingress status:
GET /api/v1/networking/ingresses/{namespace}/{name}/status - Verify ingress class:
GET /api/v1/networking/ingress-classes
Cause: DNS Not Configured
- Ingress has a host rule but DNS doesn't point to the load balancer
- Check ingress status for the load balancer IP/hostname
- Solution: Update DNS records to point to the ingress load balancer
Cause: TLS Certificate Issues
- HTTPS failing with certificate errors
- Check the TLS secret referenced in the ingress
- Solution: Update the TLS secret with a valid certificate via
GET /api/v1/storage/secrets/{namespace}/{name}
Cause: Backend Service Has No Ready Pods
- Ingress is configured but the backend service has no healthy pods
- Use
GET /api/v1/networking/ingresses/{namespace}/{name}/backend-servicesto check - Solution: Fix the backend deployment/pods
Cause: Wrong Ingress Class
- Ingress specifies a class that doesn't match the installed ingress controller
- Solution: Update the ingress class annotation to match your controller
Storage Issues
PVC Stuck in Pending
Symptoms: PersistentVolumeClaim remains in Pending state.
Diagnosis:
- Check PVC details:
GET /api/v1/storage/persistent-volume-claims/{namespace}/{name} - Validate PVC configuration:
POST /api/v1/storage/persistent-volume-claims/validate - Check available PVs:
GET /api/v1/storage/persistent-volumes - Check storage info:
GET /api/v1/cluster/storage-info
Cause: No Matching PV or StorageClass
- The requested storage class doesn't exist or no PV matches
- Solution: Use an existing storage class (check
GET /api/v1/cluster/storage-info), or create a PV with matching criteria
Cause: Zone Mismatch
- PV is in a different availability zone than the pod/node
- Solution: Use
WaitForFirstConsumervolume binding mode in the storage class
Cause: Capacity Exceeded
- Cluster storage quota or cloud provider limits reached
- Solution: Clean up unused PVCs, request quota increase
Volume Mount Failures
Symptoms: Pod stuck in ContainerCreating, events show mount errors.
Diagnosis:
- Check pod events for FailedMount messages
- Verify PVC is Bound:
GET /api/v1/storage/persistent-volume-claims/{namespace}/{name} - Check PV status:
GET /api/v1/storage/persistent-volumes/{name}
Cause: ReadWriteOnce Conflict
- Volume is
ReadWriteOncebut a pod on another node is already using it - Solution: Delete the other pod first, or switch to
ReadWriteManyaccess mode
Cause: Stale Volume Attachment
- The volume detach from a previous node didn't complete
- Solution: Wait for the automatic detach timeout, or manually remove the VolumeAttachment
PVC Expansion Failed
Diagnosis:
- Validate expansion:
POST /api/v1/storage/persistent-volume-claims/{namespace}/{name}/validate-expansion - Check expansion history:
GET /api/v1/storage/persistent-volume-claims/{namespace}/{name}/expansion-history
Cause: Storage Class Does Not Allow Expansion
- The
AllowVolumeExpansionflag is false on the storage class - Solution: Use a storage class that supports expansion, or create a new PVC with more space
Node Issues
Node Not Ready
Symptoms: Node shows NotReady status in the node list.
Diagnosis:
- Check node conditions:
GET /api/v1/cluster/nodes/{name}/conditions - View condition summary:
GET /api/v1/cluster/nodes/{name}/conditions-summary - Run health check:
GET /api/v1/cluster/nodes/{name}/health-check - View node-level events: check cluster events filtered by node
Common Causes:
- kubelet is not responding (node crashed or kubelet service stopped)
- Network partition between node and control plane
- Node ran out of disk space (DiskPressure condition)
- Node ran out of memory (MemoryPressure condition)
- Too many processes (PIDPressure condition)
Solutions:
- If the node is temporarily unavailable, wait for automatic recovery
- For maintenance, cordon the node first:
POST /api/v1/cluster/nodes/{name}/cordon - Drain the node to move pods safely:
POST /api/v1/cluster/nodes/{name}/drain - If the node is unrecoverable, remove it:
DELETE /api/v1/cluster/nodes/{name}
Node Under Pressure
Symptoms: Pods being evicted, node conditions show pressure.
Diagnosis:
- Check node metrics:
GET /api/v1/cluster/nodes/{name}/metrics - View pods on the node:
GET /api/v1/cluster/nodes/{name}/pods - Get cluster-wide conditions:
GET /api/v1/cluster/nodes/conditions-overview
Solutions:
- Identify resource-heavy pods and scale them down or set lower limits
- Add more nodes to distribute the load
- Cordon the node to prevent new scheduling while resolving the issue
Configuration Issues
ConfigMap Changes Not Taking Effect
Symptoms: Application using old configuration after ConfigMap update.
Explanation: Pods do not automatically reload ConfigMaps. The behavior depends on how the ConfigMap is consumed:
- Environment variables from ConfigMap: Never update automatically. The pod must be restarted.
- Volume-mounted ConfigMap: Updates automatically after a delay (up to a few minutes), but the application must detect and reload the file.
Solutions:
- Restart the deployment:
POST /api/v1/workloads/deployments/{namespace}/{name}/restarttriggers a rolling restart - Use volume mounts: Mount ConfigMaps as volumes and implement file-watching in the application
Secret Changes Not Taking Effect
Same behavior as ConfigMaps. Secrets mounted as volumes update automatically after a delay. Secrets injected as environment variables require a pod restart.
Verify the secret value: GET /api/v1/storage/secrets/{namespace}/{name}/data/{key}
Performance Issues
CPU Throttling
Symptoms: Application is slow despite metrics showing moderate CPU usage.
Diagnosis:
- Check pod metrics:
GET /api/v1/pods/{namespace}/{name}/metrics - Compare CPU requests vs. limits in the pod spec
- Check deployment metrics:
GET /api/v1/workloads/deployments/{namespace}/{name}/metrics - Review HPA status if autoscaling is configured
Solutions:
- Increase CPU limits to allow bursting
- Increase CPU requests to guarantee more baseline capacity
- Scale the deployment horizontally:
POST /api/v1/workloads/deployments/{namespace}/{name}/scale
Slow Pod Startup
Symptoms: Pods take a long time to become Ready.
Common Causes:
- Large container image: Use smaller base images, multi-stage builds
- Slow application initialization: Increase readiness probe
initialDelaySeconds - Dependency waiting: Application waits for databases or other services at startup
- CPU throttling during startup: Increase CPU requests
Quick Troubleshooting Checklist
- Check pod events:
GET /api/v1/pods/{namespace}/{name}/events-- events explain scheduling, image pull, and volume issues - View pod logs:
GET /api/v1/pods/{namespace}/{name}/logs?previous=true-- include previous container logs for crash analysis - Check resource usage:
GET /api/v1/pods/{namespace}/{name}/metrics-- verify CPU and memory are within limits - Test connectivity:
POST /api/v1/networking/services/{namespace}/{name}/test-endpoint-- verify service reachability - Review recent events:
GET /api/v1/events/recent-- see what changed recently in the cluster - Check node health:
GET /api/v1/cluster/nodes/{name}/health-check-- verify the underlying node is healthy - Inspect environment:
GET /api/v1/pods/{namespace}/{name}/environment-- verify environment variables - View YAML:
GET /api/v1/pods/{namespace}/{name}/yaml-- check the full pod specification
Using Global Search for Troubleshooting
The global search endpoint (GET /api/v1/global-search) allows searching across all resource types:
- Search by resource name to find related resources quickly
- Search by label to find all resources with a specific label
- Results include resource type, namespace, and current status
Using Resource Relationships
The relationships endpoint (GET /api/v1/relationships) helps trace resource dependencies:
- View parent-child relationships (Deployment > ReplicaSet > Pod)
- Identify which Service routes to which Pods
- Find orphaned resources that might be causing issues
- Understand the full resource chain for a failing application
When troubleshooting, work from the outside in: Start with the pod events to understand scheduling and infrastructure issues, then move to container logs for application-level problems. Use the relationships API to understand how resources connect, and use global search to find related resources quickly.