Skip to main content

Troubleshooting

Common platform issues and how to resolve them.

Pod Issues

Pod Stuck in Pending

Symptoms: Pod remains in Pending state, never becomes Running

Common Causes:

  1. Insufficient Resources

    • Check node capacity in Nodes page
    • Look for "Insufficient cpu" or "Insufficient memory" in pod events
    • Solution: Scale down other workloads or add more nodes
  2. No Nodes Match Selector

    • Pod has node selector but no nodes match labels
    • Solution: Remove node selector or label nodes appropriately
  3. Volume Mount Issues

    • PVC not bound or doesn't exist
    • Solution: Check PVC status, ensure PV available
  4. Image Pull Secrets Missing

    • Pod can't pull from private registry
    • Solution: Add image pull secret to service account

Pod Stuck in CrashLoopBackOff

Symptoms: Pod starts, crashes, restarts repeatedly

Diagnosis Steps:

  1. Check pod logs (including previous logs)
  2. Look for application errors or stack traces
  3. Check pod events for additional context

Common Causes:

  1. Application Error

    • Application exits with error code
    • Solution: Fix application code, check logs for errors
  2. Missing Environment Variables

    • Application expects required env vars
    • Solution: Add environment variables to deployment
  3. Failed Health Checks

    • Liveness probe fails too quickly
    • Solution: Increase initialDelaySeconds or adjust probe
  4. Database Connection Failure

    • Can't connect to database
    • Solution: Check database credentials, network policies

ImagePullBackOff

Symptoms: Pod can't pull container image

Common Causes:

  1. Image Doesn't Exist

    • Typo in image name or tag
    • Solution: Verify image exists in registry
  2. Authentication Required

    • Private registry needs credentials
    • Solution: Create image pull secret, add to service account
  3. Rate Limiting

    • Docker Hub rate limits exceeded
    • Solution: Use authenticated pulls or mirror images
  4. Network Issues

    • Can't reach registry from cluster
    • Solution: Check network policies, security groups

OOMKilled (Out of Memory)

Symptoms: Pod killed for using too much memory

Diagnosis:

  1. Check pod events for "OOMKilled" message
  2. Review memory usage in metrics
  3. Check application for memory leaks

Solutions:

  1. Increase Memory Limit

    • If application legitimately needs more memory
    • Update deployment with higher memory limit
  2. Fix Memory Leak

    • Profile application to find leak
    • Update application code
  3. Optimize Application

    • Reduce memory footprint
    • Use memory more efficiently

Networking Issues

Service Not Accessible

Symptoms: Can't reach service from other pods or externally

Diagnosis Steps:

  1. Check service exists and has endpoints
  2. Verify pod selector matches pod labels
  3. Test from within cluster (exec into pod, curl service)
  4. Check network policies

Common Causes:

  1. No Endpoints

    • Service selector doesn't match any pods
    • Solution: Fix selector or pod labels
  2. Pods Not Ready

    • Pods exist but failing readiness probe
    • Solution: Fix readiness probe or application
  3. Network Policy Blocking

    • Network policy denies traffic
    • Solution: Update network policy to allow traffic
  4. Wrong Port

    • Service port doesn't match container port
    • Solution: Update service port or targetPort

Ingress Not Working

Symptoms: Can't access application via ingress URL

Diagnosis Steps:

  1. Check ingress exists and has rules
  2. Verify backend service exists and has endpoints
  3. Check ingress controller logs
  4. Test service directly (bypass ingress)

Common Causes:

  1. DNS Not Configured

    • Domain doesn't point to load balancer
    • Solution: Update DNS records
  2. TLS Certificate Issues

    • Certificate expired or doesn't match domain
    • Solution: Update certificate secret
  3. Backend Service Down

    • Service has no ready pods
    • Solution: Fix backend service/pods
  4. Ingress Class Mismatch

    • Ingress specifies wrong ingress class
    • Solution: Update ingress.class annotation

Storage Issues

PVC Pending

Symptoms: PersistentVolumeClaim stuck in Pending state

Common Causes:

  1. No Matching PV

    • No PV available with requested size/class
    • Solution: Create PV or increase PVC size if allowed
  2. Storage Class Not Found

    • Specified storage class doesn't exist
    • Solution: Use existing storage class or create new one
  3. Zone Mismatch

    • PV in different zone than pod
    • Solution: Use WaitForFirstConsumer binding mode

Volume Mount Failures

Symptoms: Pod can't mount volume, stuck in ContainerCreating

Diagnosis:

  1. Check pod events for mount errors
  2. Verify PVC is Bound
  3. Check PV access mode matches pod requirements

Common Causes:

  1. Access Mode Conflict

    • PV is ReadWriteOnce but pod on different node
    • Solution: Use ReadWriteMany or move pod to same node
  2. Volume Already Mounted

    • Another pod using ReadWriteOnce volume
    • Solution: Delete other pod or use ReadWriteMany

Resource Constraints

Node Out of Resources

Symptoms: Pods can't be scheduled, nodes under pressure

Diagnosis:

  1. Check node conditions (MemoryPressure, DiskPressure)
  2. Review resource usage vs. capacity
  3. Identify resource-hungry pods

Solutions:

  1. Scale Down Pods

    • Reduce replica count of non-critical services
  2. Increase Node Resources

    • Resize nodes (if cloud provider supports)
  3. Add More Nodes

    • Scale node group or add new nodes
  4. Evict Pods

    • Kubernetes will evict pods to free resources

CPU Throttling

Symptoms: Application slow despite low CPU usage shown

Diagnosis:

  1. Check throttled CPU time in metrics
  2. Compare CPU requests vs. limits
  3. Review application performance metrics

Solutions:

  1. Increase CPU Limit

    • Give application more CPU headroom
  2. Remove CPU Limit

    • If application has bursty CPU needs
  3. Optimize Application

    • Reduce CPU usage in code

Configuration Issues

ConfigMap/Secret Not Updating

Symptoms: Pod using old configuration after updating ConfigMap

Explanation: Pods don't automatically reload ConfigMaps/Secrets

Solutions:

  1. Restart Deployment

    • Trigger rolling restart to pick up new config
    • kubectl rollout restart deployment/myapp
  2. Use ConfigMap Reloader

    • Deploy reloader that watches for config changes
    • Automatically restarts pods on changes
  3. Mount as Volume

    • ConfigMaps mounted as volumes update automatically
    • Application must watch file for changes

Environment Variables Wrong

Symptoms: Application receives incorrect environment variables

Diagnosis:

  1. Exec into pod and check env vars with env command
  2. Verify deployment YAML has correct values
  3. Check if using wrong config/secret

Common Causes:

  1. Wrong ConfigMap Reference

    • Referencing old ConfigMap
    • Solution: Update configMapRef
  2. Typo in Key Name

    • Environment variable key doesn't match ConfigMap key
    • Solution: Fix key name

Performance Issues

Slow Pod Startup

Symptoms: Pods take long time to become Ready

Common Causes:

  1. Large Container Image

    • Image download takes long time
    • Solution: Use smaller images, cache images on nodes
  2. Slow Application Startup

    • Application initialization takes time
    • Solution: Increase readiness probe delay
  3. Resource Constraints

    • CPU throttling during startup
    • Solution: Increase CPU requests

High Latency

Symptoms: API requests taking longer than expected

Diagnosis Steps:

  1. Check service mesh metrics (if using)
  2. Review application logs for slow queries
  3. Check pod CPU/memory usage
  4. Look for network issues

Common Causes:

  1. Database Performance

    • Slow queries or connection pool exhaustion
    • Solution: Optimize queries, increase connection pool
  2. External API Calls

    • Slow third-party APIs
    • Solution: Implement caching, timeouts
  3. Insufficient Resources

    • CPU or memory constraints
    • Solution: Increase resources
Quick Troubleshooting Checklist
  1. Check pod events: kubectl describe pod <name>
  2. View pod logs: Platform → Pods → Logs
  3. Check resource usage: Platform → Monitoring
  4. Test connectivity: Exec into pod, use curl/ping
  5. Review recent changes: What was deployed recently?