Troubleshooting

Common platform issues and how to resolve them using the platform's diagnostic tools.

Pod Issues

Pod Stuck in Pending

Symptoms: Pod remains in Pending state and never transitions to Running.

Diagnosis using the Platform UI:

Navigate to Platform > Pods and find the pod
Click the pod name and check the Events tab (GET /api/v1/pods/{namespace}/{name}/events)
Look for specific event messages:

Cause: Insufficient Resources

Event message: FailedScheduling: Insufficient cpu or Insufficient memory
Check node capacity: Platform > Nodes or GET /api/v1/cluster/nodes/metrics
Check capacity analysis: GET /api/v1/capacity/analysis
Solution: Scale down other workloads, increase resource requests for under-utilized pods, or add more nodes

Cause: Node Selector/Affinity Mismatch

Event message: FailedScheduling: 0/N nodes are available: N node(s) didn't match node selector
Check node labels: GET /api/v1/cluster/nodes/{name} and examine labels
Solution: Update the pod's node selector to match existing node labels, or add labels to nodes via PATCH /api/v1/cluster/nodes/{name}/labels

Cause: PVC Not Bound

Event message: FailedScheduling: persistentvolumeclaim "xxx" not found or PVC shows as Pending
Check PVC status: GET /api/v1/storage/persistent-volume-claims/{namespace}/{name}
Solution: Create the required PVC, ensure the storage class exists, or check PV availability

Cause: Image Pull Secret Missing

Event message: FailedScheduling followed by ErrImagePull after pod is scheduled
Solution: Create an image pull secret and add it to the pod's service account

Pod Stuck in CrashLoopBackOff

Symptoms: Pod starts, crashes, and restarts repeatedly with exponentially increasing backoff.

Diagnosis:

View pod logs with previous container: GET /api/v1/pods/{namespace}/{name}/logs?previous=true
Check pod events: GET /api/v1/pods/{namespace}/{name}/events
Inspect the pod status: GET /api/v1/pods/{namespace}/{name}/status

Cause: Application Error

Logs show stack trace, exception, or error message
Container exit code is non-zero
Solution: Fix the application code based on the error in the logs

Cause: Missing Environment Variables or Config

Logs show "config not found", "missing required setting", or connection string errors
Check environment: GET /api/v1/pods/{namespace}/{name}/environment
Solution: Add missing environment variables to the deployment, verify ConfigMap/Secret references

Cause: Health Check Failing Too Quickly

Events show Unhealthy: Liveness probe failed shortly after container starts
The container doesn't have enough time to initialize before the liveness probe starts checking
Solution: Increase initialDelaySeconds on the liveness probe, or add a startup probe

Cause: Database or External Service Unavailable

Logs show connection refused, timeout, or authentication failures
Solution: Verify the dependent service is running, check network policies, verify credentials in secrets

ImagePullBackOff

Symptoms: Pod can't pull the container image.

Diagnosis: Check pod events for specific error messages.

Cause: Image Does Not Exist

Event: Failed to pull image "xxx": manifest unknown
Solution: Verify the image name and tag. Use GET /api/v1/pods/{namespace}/{name}/events to see the exact image reference that failed

Cause: Private Registry Authentication

Event: Failed to pull image: unauthorized or no basic auth credentials
Solution: Create a Docker registry secret and add it to the service account or pod spec

Cause: Docker Hub Rate Limiting

Event: toomanyrequests: You have reached your pull rate limit
Solution: Use authenticated Docker Hub pulls (add credentials as secret) or mirror images to a private registry

Cause: Network Issues

Event: Failed to pull image: dial tcp: lookup registry.example.com: no such host
Solution: Check DNS resolution, network policies (GET /api/v1/networking/network-policies), and security group rules

OOMKilled (Out of Memory)

Symptoms: Pod is killed with OOMKilled reason, restart count increases.

Diagnosis:

Check pod events for "OOMKilled" message
Review pod metrics: GET /api/v1/pods/{namespace}/{name}/metrics
Check container resource limits in pod YAML: GET /api/v1/pods/{namespace}/{name}/yaml

Solutions:

Increase memory limit: If the application legitimately needs more memory, update the deployment with a higher memory limit via PATCH /api/v1/workloads/deployments/{namespace}/{name}
Fix memory leak: If memory grows unbounded, profile the application
Optimize application: Reduce memory footprint (smaller batch sizes, connection pool limits, cache size limits)

Networking Issues

Service Not Accessible

Symptoms: Cannot reach a service from other pods or externally.

Diagnosis Steps:

Verify service exists and has endpoints: GET /api/v1/networking/services/{namespace}/{name}/endpoints
Test endpoint connectivity: POST /api/v1/networking/services/{namespace}/{name}/test-endpoint
Test all endpoints: POST /api/v1/networking/services/{namespace}/{name}/test-all-endpoints
Check related resources: GET /api/v1/networking/services/{namespace}/{name}/related
Check network policies: GET /api/v1/networking/network-policies

Cause: No Endpoints

Service has selector but no pods match it
Check pod labels against service selector
Solution: Fix the service selector or pod labels to match

Cause: Pods Not Ready

Pods exist but failing readiness probe
Endpoints only include pods that pass readiness checks
Solution: Fix the readiness probe or the application's health endpoint

Cause: Network Policy Blocking Traffic

Use impact analysis: GET /api/v1/networking/network-policies/{namespace}/{name}/impact-analysis
Test connectivity between pods: POST /api/v1/networking/network-policies/test-connectivity
Solution: Update network policy to allow traffic from the source pod's labels/namespace

Cause: Wrong Port Configuration

Service port doesn't match container port
Solution: Update the service's targetPort to match the container's listening port

Ingress Not Working

Symptoms: Cannot access application via ingress URL.

Diagnosis Steps:

Verify ingress exists: GET /api/v1/networking/ingresses/{namespace}/{name}
Check ingress rules: GET /api/v1/networking/ingresses/{namespace}/{name}/rules
Validate backend services: GET /api/v1/networking/ingresses/{namespace}/{name}/backend-services
Check ingress status: GET /api/v1/networking/ingresses/{namespace}/{name}/status
Verify ingress class: GET /api/v1/networking/ingress-classes

Cause: DNS Not Configured

Ingress has a host rule but DNS doesn't point to the load balancer
Check ingress status for the load balancer IP/hostname
Solution: Update DNS records to point to the ingress load balancer

Cause: TLS Certificate Issues

HTTPS failing with certificate errors
Check the TLS secret referenced in the ingress
Solution: Update the TLS secret with a valid certificate via GET /api/v1/storage/secrets/{namespace}/{name}

Cause: Backend Service Has No Ready Pods

Ingress is configured but the backend service has no healthy pods
Use GET /api/v1/networking/ingresses/{namespace}/{name}/backend-services to check
Solution: Fix the backend deployment/pods

Cause: Wrong Ingress Class

Ingress specifies a class that doesn't match the installed ingress controller
Solution: Update the ingress class annotation to match your controller

Storage Issues

PVC Stuck in Pending

Symptoms: PersistentVolumeClaim remains in Pending state.

Diagnosis:

Check PVC details: GET /api/v1/storage/persistent-volume-claims/{namespace}/{name}
Validate PVC configuration: POST /api/v1/storage/persistent-volume-claims/validate
Check available PVs: GET /api/v1/storage/persistent-volumes
Check storage info: GET /api/v1/cluster/storage-info

Cause: No Matching PV or StorageClass

The requested storage class doesn't exist or no PV matches
Solution: Use an existing storage class (check GET /api/v1/cluster/storage-info), or create a PV with matching criteria

Cause: Zone Mismatch

PV is in a different availability zone than the pod/node
Solution: Use WaitForFirstConsumer volume binding mode in the storage class

Cause: Capacity Exceeded

Cluster storage quota or cloud provider limits reached
Solution: Clean up unused PVCs, request quota increase

Volume Mount Failures

Symptoms: Pod stuck in ContainerCreating, events show mount errors.

Diagnosis:

Check pod events for FailedMount messages
Verify PVC is Bound: GET /api/v1/storage/persistent-volume-claims/{namespace}/{name}
Check PV status: GET /api/v1/storage/persistent-volumes/{name}

Cause: ReadWriteOnce Conflict

Volume is ReadWriteOnce but a pod on another node is already using it
Solution: Delete the other pod first, or switch to ReadWriteMany access mode

Cause: Stale Volume Attachment

The volume detach from a previous node didn't complete
Solution: Wait for the automatic detach timeout, or manually remove the VolumeAttachment

PVC Expansion Failed

Diagnosis:

Validate expansion: POST /api/v1/storage/persistent-volume-claims/{namespace}/{name}/validate-expansion
Check expansion history: GET /api/v1/storage/persistent-volume-claims/{namespace}/{name}/expansion-history

Cause: Storage Class Does Not Allow Expansion

The AllowVolumeExpansion flag is false on the storage class
Solution: Use a storage class that supports expansion, or create a new PVC with more space

Node Issues

Node Not Ready

Symptoms: Node shows NotReady status in the node list.

Diagnosis:

Check node conditions: GET /api/v1/cluster/nodes/{name}/conditions
View condition summary: GET /api/v1/cluster/nodes/{name}/conditions-summary
Run health check: GET /api/v1/cluster/nodes/{name}/health-check
View node-level events: check cluster events filtered by node

Common Causes:

kubelet is not responding (node crashed or kubelet service stopped)
Network partition between node and control plane
Node ran out of disk space (DiskPressure condition)
Node ran out of memory (MemoryPressure condition)
Too many processes (PIDPressure condition)

Solutions:

If the node is temporarily unavailable, wait for automatic recovery
For maintenance, cordon the node first: POST /api/v1/cluster/nodes/{name}/cordon
Drain the node to move pods safely: POST /api/v1/cluster/nodes/{name}/drain
If the node is unrecoverable, remove it: DELETE /api/v1/cluster/nodes/{name}

Node Under Pressure

Symptoms: Pods being evicted, node conditions show pressure.

Diagnosis:

Check node metrics: GET /api/v1/cluster/nodes/{name}/metrics
View pods on the node: GET /api/v1/cluster/nodes/{name}/pods
Get cluster-wide conditions: GET /api/v1/cluster/nodes/conditions-overview

Solutions:

Identify resource-heavy pods and scale them down or set lower limits
Add more nodes to distribute the load
Cordon the node to prevent new scheduling while resolving the issue

Configuration Issues

ConfigMap Changes Not Taking Effect

Symptoms: Application using old configuration after ConfigMap update.

Explanation: Pods do not automatically reload ConfigMaps. The behavior depends on how the ConfigMap is consumed:

Environment variables from ConfigMap: Never update automatically. The pod must be restarted.
Volume-mounted ConfigMap: Updates automatically after a delay (up to a few minutes), but the application must detect and reload the file.

Solutions:

Restart the deployment: POST /api/v1/workloads/deployments/{namespace}/{name}/restart triggers a rolling restart
Use volume mounts: Mount ConfigMaps as volumes and implement file-watching in the application

Secret Changes Not Taking Effect

Same behavior as ConfigMaps. Secrets mounted as volumes update automatically after a delay. Secrets injected as environment variables require a pod restart.

Verify the secret value: GET /api/v1/storage/secrets/{namespace}/{name}/data/{key}

Performance Issues

CPU Throttling

Symptoms: Application is slow despite metrics showing moderate CPU usage.

Diagnosis:

Check pod metrics: GET /api/v1/pods/{namespace}/{name}/metrics
Compare CPU requests vs. limits in the pod spec
Check deployment metrics: GET /api/v1/workloads/deployments/{namespace}/{name}/metrics
Review HPA status if autoscaling is configured

Solutions:

Increase CPU limits to allow bursting
Increase CPU requests to guarantee more baseline capacity
Scale the deployment horizontally: POST /api/v1/workloads/deployments/{namespace}/{name}/scale

Slow Pod Startup

Symptoms: Pods take a long time to become Ready.

Common Causes:

Large container image: Use smaller base images, multi-stage builds
Slow application initialization: Increase readiness probe initialDelaySeconds
Dependency waiting: Application waits for databases or other services at startup
CPU throttling during startup: Increase CPU requests

Quick Troubleshooting Checklist

Check pod events: GET /api/v1/pods/{namespace}/{name}/events -- events explain scheduling, image pull, and volume issues
View pod logs: GET /api/v1/pods/{namespace}/{name}/logs?previous=true -- include previous container logs for crash analysis
Check resource usage: GET /api/v1/pods/{namespace}/{name}/metrics -- verify CPU and memory are within limits
Test connectivity: POST /api/v1/networking/services/{namespace}/{name}/test-endpoint -- verify service reachability
Review recent events: GET /api/v1/events/recent -- see what changed recently in the cluster
Check node health: GET /api/v1/cluster/nodes/{name}/health-check -- verify the underlying node is healthy
Inspect environment: GET /api/v1/pods/{namespace}/{name}/environment -- verify environment variables
View YAML: GET /api/v1/pods/{namespace}/{name}/yaml -- check the full pod specification

Using Global Search for Troubleshooting

The global search endpoint (GET /api/v1/global-search) allows searching across all resource types:

Search by resource name to find related resources quickly
Search by label to find all resources with a specific label
Results include resource type, namespace, and current status

Using Resource Relationships

The relationships endpoint (GET /api/v1/relationships) helps trace resource dependencies:

View parent-child relationships (Deployment > ReplicaSet > Pod)
Identify which Service routes to which Pods
Find orphaned resources that might be causing issues
Understand the full resource chain for a failing application

tip

When troubleshooting, work from the outside in: Start with the pod events to understand scheduling and infrastructure issues, then move to container logs for application-level problems. Use the relationships API to understand how resources connect, and use global search to find related resources quickly.

Pod Issues​

Pod Stuck in Pending​

Pod Stuck in CrashLoopBackOff​

ImagePullBackOff​

OOMKilled (Out of Memory)​

Networking Issues​

Service Not Accessible​

Ingress Not Working​

Storage Issues​

PVC Stuck in Pending​

Volume Mount Failures​

PVC Expansion Failed​

Node Issues​

Node Not Ready​

Node Under Pressure​

Configuration Issues​

ConfigMap Changes Not Taking Effect​

Secret Changes Not Taking Effect​

Performance Issues​

CPU Throttling​

Slow Pod Startup​

Quick Troubleshooting Checklist​

Using Global Search for Troubleshooting​

Using Resource Relationships​

Pod Issues

Pod Stuck in Pending

Pod Stuck in CrashLoopBackOff

ImagePullBackOff

OOMKilled (Out of Memory)

Networking Issues

Service Not Accessible

Ingress Not Working

Storage Issues

PVC Stuck in Pending

Volume Mount Failures

PVC Expansion Failed

Node Issues

Node Not Ready

Node Under Pressure

Configuration Issues

ConfigMap Changes Not Taking Effect

Secret Changes Not Taking Effect

Performance Issues

CPU Throttling

Slow Pod Startup

Quick Troubleshooting Checklist

Using Global Search for Troubleshooting

Using Resource Relationships