Troubleshooting
Common platform issues and how to resolve them.
Pod Issues
Pod Stuck in Pending
Symptoms: Pod remains in Pending state, never becomes Running
Common Causes:
-
Insufficient Resources
- Check node capacity in Nodes page
- Look for "Insufficient cpu" or "Insufficient memory" in pod events
- Solution: Scale down other workloads or add more nodes
-
No Nodes Match Selector
- Pod has node selector but no nodes match labels
- Solution: Remove node selector or label nodes appropriately
-
Volume Mount Issues
- PVC not bound or doesn't exist
- Solution: Check PVC status, ensure PV available
-
Image Pull Secrets Missing
- Pod can't pull from private registry
- Solution: Add image pull secret to service account
Pod Stuck in CrashLoopBackOff
Symptoms: Pod starts, crashes, restarts repeatedly
Diagnosis Steps:
- Check pod logs (including previous logs)
- Look for application errors or stack traces
- Check pod events for additional context
Common Causes:
-
Application Error
- Application exits with error code
- Solution: Fix application code, check logs for errors
-
Missing Environment Variables
- Application expects required env vars
- Solution: Add environment variables to deployment
-
Failed Health Checks
- Liveness probe fails too quickly
- Solution: Increase initialDelaySeconds or adjust probe
-
Database Connection Failure
- Can't connect to database
- Solution: Check database credentials, network policies
ImagePullBackOff
Symptoms: Pod can't pull container image
Common Causes:
-
Image Doesn't Exist
- Typo in image name or tag
- Solution: Verify image exists in registry
-
Authentication Required
- Private registry needs credentials
- Solution: Create image pull secret, add to service account
-
Rate Limiting
- Docker Hub rate limits exceeded
- Solution: Use authenticated pulls or mirror images
-
Network Issues
- Can't reach registry from cluster
- Solution: Check network policies, security groups
OOMKilled (Out of Memory)
Symptoms: Pod killed for using too much memory
Diagnosis:
- Check pod events for "OOMKilled" message
- Review memory usage in metrics
- Check application for memory leaks
Solutions:
-
Increase Memory Limit
- If application legitimately needs more memory
- Update deployment with higher memory limit
-
Fix Memory Leak
- Profile application to find leak
- Update application code
-
Optimize Application
- Reduce memory footprint
- Use memory more efficiently
Networking Issues
Service Not Accessible
Symptoms: Can't reach service from other pods or externally
Diagnosis Steps:
- Check service exists and has endpoints
- Verify pod selector matches pod labels
- Test from within cluster (exec into pod, curl service)
- Check network policies
Common Causes:
-
No Endpoints
- Service selector doesn't match any pods
- Solution: Fix selector or pod labels
-
Pods Not Ready
- Pods exist but failing readiness probe
- Solution: Fix readiness probe or application
-
Network Policy Blocking
- Network policy denies traffic
- Solution: Update network policy to allow traffic
-
Wrong Port
- Service port doesn't match container port
- Solution: Update service port or targetPort
Ingress Not Working
Symptoms: Can't access application via ingress URL
Diagnosis Steps:
- Check ingress exists and has rules
- Verify backend service exists and has endpoints
- Check ingress controller logs
- Test service directly (bypass ingress)
Common Causes:
-
DNS Not Configured
- Domain doesn't point to load balancer
- Solution: Update DNS records
-
TLS Certificate Issues
- Certificate expired or doesn't match domain
- Solution: Update certificate secret
-
Backend Service Down
- Service has no ready pods
- Solution: Fix backend service/pods
-
Ingress Class Mismatch
- Ingress specifies wrong ingress class
- Solution: Update ingress.class annotation
Storage Issues
PVC Pending
Symptoms: PersistentVolumeClaim stuck in Pending state
Common Causes:
-
No Matching PV
- No PV available with requested size/class
- Solution: Create PV or increase PVC size if allowed
-
Storage Class Not Found
- Specified storage class doesn't exist
- Solution: Use existing storage class or create new one
-
Zone Mismatch
- PV in different zone than pod
- Solution: Use WaitForFirstConsumer binding mode
Volume Mount Failures
Symptoms: Pod can't mount volume, stuck in ContainerCreating
Diagnosis:
- Check pod events for mount errors
- Verify PVC is Bound
- Check PV access mode matches pod requirements
Common Causes:
-
Access Mode Conflict
- PV is ReadWriteOnce but pod on different node
- Solution: Use ReadWriteMany or move pod to same node
-
Volume Already Mounted
- Another pod using ReadWriteOnce volume
- Solution: Delete other pod or use ReadWriteMany
Resource Constraints
Node Out of Resources
Symptoms: Pods can't be scheduled, nodes under pressure
Diagnosis:
- Check node conditions (MemoryPressure, DiskPressure)
- Review resource usage vs. capacity
- Identify resource-hungry pods
Solutions:
-
Scale Down Pods
- Reduce replica count of non-critical services
-
Increase Node Resources
- Resize nodes (if cloud provider supports)
-
Add More Nodes
- Scale node group or add new nodes
-
Evict Pods
- Kubernetes will evict pods to free resources
CPU Throttling
Symptoms: Application slow despite low CPU usage shown
Diagnosis:
- Check throttled CPU time in metrics
- Compare CPU requests vs. limits
- Review application performance metrics
Solutions:
-
Increase CPU Limit
- Give application more CPU headroom
-
Remove CPU Limit
- If application has bursty CPU needs
-
Optimize Application
- Reduce CPU usage in code
Configuration Issues
ConfigMap/Secret Not Updating
Symptoms: Pod using old configuration after updating ConfigMap
Explanation: Pods don't automatically reload ConfigMaps/Secrets
Solutions:
-
Restart Deployment
- Trigger rolling restart to pick up new config
kubectl rollout restart deployment/myapp
-
Use ConfigMap Reloader
- Deploy reloader that watches for config changes
- Automatically restarts pods on changes
-
Mount as Volume
- ConfigMaps mounted as volumes update automatically
- Application must watch file for changes
Environment Variables Wrong
Symptoms: Application receives incorrect environment variables
Diagnosis:
- Exec into pod and check env vars with
envcommand - Verify deployment YAML has correct values
- Check if using wrong config/secret
Common Causes:
-
Wrong ConfigMap Reference
- Referencing old ConfigMap
- Solution: Update configMapRef
-
Typo in Key Name
- Environment variable key doesn't match ConfigMap key
- Solution: Fix key name
Performance Issues
Slow Pod Startup
Symptoms: Pods take long time to become Ready
Common Causes:
-
Large Container Image
- Image download takes long time
- Solution: Use smaller images, cache images on nodes
-
Slow Application Startup
- Application initialization takes time
- Solution: Increase readiness probe delay
-
Resource Constraints
- CPU throttling during startup
- Solution: Increase CPU requests
High Latency
Symptoms: API requests taking longer than expected
Diagnosis Steps:
- Check service mesh metrics (if using)
- Review application logs for slow queries
- Check pod CPU/memory usage
- Look for network issues
Common Causes:
-
Database Performance
- Slow queries or connection pool exhaustion
- Solution: Optimize queries, increase connection pool
-
External API Calls
- Slow third-party APIs
- Solution: Implement caching, timeouts
-
Insufficient Resources
- CPU or memory constraints
- Solution: Increase resources
- Check pod events:
kubectl describe pod <name> - View pod logs: Platform → Pods → Logs
- Check resource usage: Platform → Monitoring
- Test connectivity: Exec into pod, use curl/ping
- Review recent changes: What was deployed recently?