Skip to main content

Troubleshooting

Common platform issues and how to resolve them using the platform's diagnostic tools.

Pod Issues

Pod Stuck in Pending

Symptoms: Pod remains in Pending state and never transitions to Running.

Diagnosis using the Platform UI:

  1. Navigate to Platform > Pods and find the pod
  2. Click the pod name and check the Events tab (GET /api/v1/pods/{namespace}/{name}/events)
  3. Look for specific event messages:

Cause: Insufficient Resources

  • Event message: FailedScheduling: Insufficient cpu or Insufficient memory
  • Check node capacity: Platform > Nodes or GET /api/v1/cluster/nodes/metrics
  • Check capacity analysis: GET /api/v1/capacity/analysis
  • Solution: Scale down other workloads, increase resource requests for under-utilized pods, or add more nodes

Cause: Node Selector/Affinity Mismatch

  • Event message: FailedScheduling: 0/N nodes are available: N node(s) didn't match node selector
  • Check node labels: GET /api/v1/cluster/nodes/{name} and examine labels
  • Solution: Update the pod's node selector to match existing node labels, or add labels to nodes via PATCH /api/v1/cluster/nodes/{name}/labels

Cause: PVC Not Bound

  • Event message: FailedScheduling: persistentvolumeclaim "xxx" not found or PVC shows as Pending
  • Check PVC status: GET /api/v1/storage/persistent-volume-claims/{namespace}/{name}
  • Solution: Create the required PVC, ensure the storage class exists, or check PV availability

Cause: Image Pull Secret Missing

  • Event message: FailedScheduling followed by ErrImagePull after pod is scheduled
  • Solution: Create an image pull secret and add it to the pod's service account

Pod Stuck in CrashLoopBackOff

Symptoms: Pod starts, crashes, and restarts repeatedly with exponentially increasing backoff.

Diagnosis:

  1. View pod logs with previous container: GET /api/v1/pods/{namespace}/{name}/logs?previous=true
  2. Check pod events: GET /api/v1/pods/{namespace}/{name}/events
  3. Inspect the pod status: GET /api/v1/pods/{namespace}/{name}/status

Cause: Application Error

  • Logs show stack trace, exception, or error message
  • Container exit code is non-zero
  • Solution: Fix the application code based on the error in the logs

Cause: Missing Environment Variables or Config

  • Logs show "config not found", "missing required setting", or connection string errors
  • Check environment: GET /api/v1/pods/{namespace}/{name}/environment
  • Solution: Add missing environment variables to the deployment, verify ConfigMap/Secret references

Cause: Health Check Failing Too Quickly

  • Events show Unhealthy: Liveness probe failed shortly after container starts
  • The container doesn't have enough time to initialize before the liveness probe starts checking
  • Solution: Increase initialDelaySeconds on the liveness probe, or add a startup probe

Cause: Database or External Service Unavailable

  • Logs show connection refused, timeout, or authentication failures
  • Solution: Verify the dependent service is running, check network policies, verify credentials in secrets

ImagePullBackOff

Symptoms: Pod can't pull the container image.

Diagnosis: Check pod events for specific error messages.

Cause: Image Does Not Exist

  • Event: Failed to pull image "xxx": manifest unknown
  • Solution: Verify the image name and tag. Use GET /api/v1/pods/{namespace}/{name}/events to see the exact image reference that failed

Cause: Private Registry Authentication

  • Event: Failed to pull image: unauthorized or no basic auth credentials
  • Solution: Create a Docker registry secret and add it to the service account or pod spec

Cause: Docker Hub Rate Limiting

  • Event: toomanyrequests: You have reached your pull rate limit
  • Solution: Use authenticated Docker Hub pulls (add credentials as secret) or mirror images to a private registry

Cause: Network Issues

  • Event: Failed to pull image: dial tcp: lookup registry.example.com: no such host
  • Solution: Check DNS resolution, network policies (GET /api/v1/networking/network-policies), and security group rules

OOMKilled (Out of Memory)

Symptoms: Pod is killed with OOMKilled reason, restart count increases.

Diagnosis:

  1. Check pod events for "OOMKilled" message
  2. Review pod metrics: GET /api/v1/pods/{namespace}/{name}/metrics
  3. Check container resource limits in pod YAML: GET /api/v1/pods/{namespace}/{name}/yaml

Solutions:

  1. Increase memory limit: If the application legitimately needs more memory, update the deployment with a higher memory limit via PATCH /api/v1/workloads/deployments/{namespace}/{name}
  2. Fix memory leak: If memory grows unbounded, profile the application
  3. Optimize application: Reduce memory footprint (smaller batch sizes, connection pool limits, cache size limits)

Networking Issues

Service Not Accessible

Symptoms: Cannot reach a service from other pods or externally.

Diagnosis Steps:

  1. Verify service exists and has endpoints: GET /api/v1/networking/services/{namespace}/{name}/endpoints
  2. Test endpoint connectivity: POST /api/v1/networking/services/{namespace}/{name}/test-endpoint
  3. Test all endpoints: POST /api/v1/networking/services/{namespace}/{name}/test-all-endpoints
  4. Check related resources: GET /api/v1/networking/services/{namespace}/{name}/related
  5. Check network policies: GET /api/v1/networking/network-policies

Cause: No Endpoints

  • Service has selector but no pods match it
  • Check pod labels against service selector
  • Solution: Fix the service selector or pod labels to match

Cause: Pods Not Ready

  • Pods exist but failing readiness probe
  • Endpoints only include pods that pass readiness checks
  • Solution: Fix the readiness probe or the application's health endpoint

Cause: Network Policy Blocking Traffic

  • Use impact analysis: GET /api/v1/networking/network-policies/{namespace}/{name}/impact-analysis
  • Test connectivity between pods: POST /api/v1/networking/network-policies/test-connectivity
  • Solution: Update network policy to allow traffic from the source pod's labels/namespace

Cause: Wrong Port Configuration

  • Service port doesn't match container port
  • Solution: Update the service's targetPort to match the container's listening port

Ingress Not Working

Symptoms: Cannot access application via ingress URL.

Diagnosis Steps:

  1. Verify ingress exists: GET /api/v1/networking/ingresses/{namespace}/{name}
  2. Check ingress rules: GET /api/v1/networking/ingresses/{namespace}/{name}/rules
  3. Validate backend services: GET /api/v1/networking/ingresses/{namespace}/{name}/backend-services
  4. Check ingress status: GET /api/v1/networking/ingresses/{namespace}/{name}/status
  5. Verify ingress class: GET /api/v1/networking/ingress-classes

Cause: DNS Not Configured

  • Ingress has a host rule but DNS doesn't point to the load balancer
  • Check ingress status for the load balancer IP/hostname
  • Solution: Update DNS records to point to the ingress load balancer

Cause: TLS Certificate Issues

  • HTTPS failing with certificate errors
  • Check the TLS secret referenced in the ingress
  • Solution: Update the TLS secret with a valid certificate via GET /api/v1/storage/secrets/{namespace}/{name}

Cause: Backend Service Has No Ready Pods

  • Ingress is configured but the backend service has no healthy pods
  • Use GET /api/v1/networking/ingresses/{namespace}/{name}/backend-services to check
  • Solution: Fix the backend deployment/pods

Cause: Wrong Ingress Class

  • Ingress specifies a class that doesn't match the installed ingress controller
  • Solution: Update the ingress class annotation to match your controller

Storage Issues

PVC Stuck in Pending

Symptoms: PersistentVolumeClaim remains in Pending state.

Diagnosis:

  1. Check PVC details: GET /api/v1/storage/persistent-volume-claims/{namespace}/{name}
  2. Validate PVC configuration: POST /api/v1/storage/persistent-volume-claims/validate
  3. Check available PVs: GET /api/v1/storage/persistent-volumes
  4. Check storage info: GET /api/v1/cluster/storage-info

Cause: No Matching PV or StorageClass

  • The requested storage class doesn't exist or no PV matches
  • Solution: Use an existing storage class (check GET /api/v1/cluster/storage-info), or create a PV with matching criteria

Cause: Zone Mismatch

  • PV is in a different availability zone than the pod/node
  • Solution: Use WaitForFirstConsumer volume binding mode in the storage class

Cause: Capacity Exceeded

  • Cluster storage quota or cloud provider limits reached
  • Solution: Clean up unused PVCs, request quota increase

Volume Mount Failures

Symptoms: Pod stuck in ContainerCreating, events show mount errors.

Diagnosis:

  1. Check pod events for FailedMount messages
  2. Verify PVC is Bound: GET /api/v1/storage/persistent-volume-claims/{namespace}/{name}
  3. Check PV status: GET /api/v1/storage/persistent-volumes/{name}

Cause: ReadWriteOnce Conflict

  • Volume is ReadWriteOnce but a pod on another node is already using it
  • Solution: Delete the other pod first, or switch to ReadWriteMany access mode

Cause: Stale Volume Attachment

  • The volume detach from a previous node didn't complete
  • Solution: Wait for the automatic detach timeout, or manually remove the VolumeAttachment

PVC Expansion Failed

Diagnosis:

  1. Validate expansion: POST /api/v1/storage/persistent-volume-claims/{namespace}/{name}/validate-expansion
  2. Check expansion history: GET /api/v1/storage/persistent-volume-claims/{namespace}/{name}/expansion-history

Cause: Storage Class Does Not Allow Expansion

  • The AllowVolumeExpansion flag is false on the storage class
  • Solution: Use a storage class that supports expansion, or create a new PVC with more space

Node Issues

Node Not Ready

Symptoms: Node shows NotReady status in the node list.

Diagnosis:

  1. Check node conditions: GET /api/v1/cluster/nodes/{name}/conditions
  2. View condition summary: GET /api/v1/cluster/nodes/{name}/conditions-summary
  3. Run health check: GET /api/v1/cluster/nodes/{name}/health-check
  4. View node-level events: check cluster events filtered by node

Common Causes:

  • kubelet is not responding (node crashed or kubelet service stopped)
  • Network partition between node and control plane
  • Node ran out of disk space (DiskPressure condition)
  • Node ran out of memory (MemoryPressure condition)
  • Too many processes (PIDPressure condition)

Solutions:

  • If the node is temporarily unavailable, wait for automatic recovery
  • For maintenance, cordon the node first: POST /api/v1/cluster/nodes/{name}/cordon
  • Drain the node to move pods safely: POST /api/v1/cluster/nodes/{name}/drain
  • If the node is unrecoverable, remove it: DELETE /api/v1/cluster/nodes/{name}

Node Under Pressure

Symptoms: Pods being evicted, node conditions show pressure.

Diagnosis:

  1. Check node metrics: GET /api/v1/cluster/nodes/{name}/metrics
  2. View pods on the node: GET /api/v1/cluster/nodes/{name}/pods
  3. Get cluster-wide conditions: GET /api/v1/cluster/nodes/conditions-overview

Solutions:

  • Identify resource-heavy pods and scale them down or set lower limits
  • Add more nodes to distribute the load
  • Cordon the node to prevent new scheduling while resolving the issue

Configuration Issues

ConfigMap Changes Not Taking Effect

Symptoms: Application using old configuration after ConfigMap update.

Explanation: Pods do not automatically reload ConfigMaps. The behavior depends on how the ConfigMap is consumed:

  • Environment variables from ConfigMap: Never update automatically. The pod must be restarted.
  • Volume-mounted ConfigMap: Updates automatically after a delay (up to a few minutes), but the application must detect and reload the file.

Solutions:

  1. Restart the deployment: POST /api/v1/workloads/deployments/{namespace}/{name}/restart triggers a rolling restart
  2. Use volume mounts: Mount ConfigMaps as volumes and implement file-watching in the application

Secret Changes Not Taking Effect

Same behavior as ConfigMaps. Secrets mounted as volumes update automatically after a delay. Secrets injected as environment variables require a pod restart.

Verify the secret value: GET /api/v1/storage/secrets/{namespace}/{name}/data/{key}

Performance Issues

CPU Throttling

Symptoms: Application is slow despite metrics showing moderate CPU usage.

Diagnosis:

  1. Check pod metrics: GET /api/v1/pods/{namespace}/{name}/metrics
  2. Compare CPU requests vs. limits in the pod spec
  3. Check deployment metrics: GET /api/v1/workloads/deployments/{namespace}/{name}/metrics
  4. Review HPA status if autoscaling is configured

Solutions:

  1. Increase CPU limits to allow bursting
  2. Increase CPU requests to guarantee more baseline capacity
  3. Scale the deployment horizontally: POST /api/v1/workloads/deployments/{namespace}/{name}/scale

Slow Pod Startup

Symptoms: Pods take a long time to become Ready.

Common Causes:

  1. Large container image: Use smaller base images, multi-stage builds
  2. Slow application initialization: Increase readiness probe initialDelaySeconds
  3. Dependency waiting: Application waits for databases or other services at startup
  4. CPU throttling during startup: Increase CPU requests

Quick Troubleshooting Checklist

  1. Check pod events: GET /api/v1/pods/{namespace}/{name}/events -- events explain scheduling, image pull, and volume issues
  2. View pod logs: GET /api/v1/pods/{namespace}/{name}/logs?previous=true -- include previous container logs for crash analysis
  3. Check resource usage: GET /api/v1/pods/{namespace}/{name}/metrics -- verify CPU and memory are within limits
  4. Test connectivity: POST /api/v1/networking/services/{namespace}/{name}/test-endpoint -- verify service reachability
  5. Review recent events: GET /api/v1/events/recent -- see what changed recently in the cluster
  6. Check node health: GET /api/v1/cluster/nodes/{name}/health-check -- verify the underlying node is healthy
  7. Inspect environment: GET /api/v1/pods/{namespace}/{name}/environment -- verify environment variables
  8. View YAML: GET /api/v1/pods/{namespace}/{name}/yaml -- check the full pod specification

Using Global Search for Troubleshooting

The global search endpoint (GET /api/v1/global-search) allows searching across all resource types:

  • Search by resource name to find related resources quickly
  • Search by label to find all resources with a specific label
  • Results include resource type, namespace, and current status

Using Resource Relationships

The relationships endpoint (GET /api/v1/relationships) helps trace resource dependencies:

  • View parent-child relationships (Deployment > ReplicaSet > Pod)
  • Identify which Service routes to which Pods
  • Find orphaned resources that might be causing issues
  • Understand the full resource chain for a failing application
tip

When troubleshooting, work from the outside in: Start with the pod events to understand scheduling and infrastructure issues, then move to container logs for application-level problems. Use the relationships API to understand how resources connect, and use global search to find related resources quickly.