Kubernetes Debugging Toolkit: From kubectl to Victory

Kubernetes does not tell you what is wrong - it tells you what state things are in. Your job is to interpret the state, trace the root cause, and fix it. This guide covers the debugging commands and patterns that platform engineers use daily, organized by the problems you actually encounter.

Essential Debugging Commands

Before diving into specific problems, master these five commands. They solve 80% of issues:

# 1. What is happening right now?
kubectl get pods -n my-namespace -o wide

# 2. Why is this pod unhappy?
kubectl describe pod my-pod-abc123 -n my-namespace

# 3. What is the application saying?
kubectl logs my-pod-abc123 -n my-namespace --tail=100

# 4. What happened recently?
kubectl get events -n my-namespace --sort-by='.lastTimestamp'

# 5. Get inside and look around
kubectl exec -it my-pod-abc123 -n my-namespace -- /bin/sh

Problem: Pod Stuck in CrashLoopBackOff

The pod starts, crashes, Kubernetes restarts it, it crashes again. The backoff delay grows exponentially (10s, 20s, 40s, up to 5 minutes).

# Step 1: Check why it crashed
kubectl logs my-pod --previous  # Logs from the LAST crashed container

# Step 2: Check the exit code
kubectl describe pod my-pod | grep -A 5 "Last State"
#   Last State:  Terminated
#     Reason:    Error
#     Exit Code: 137    ← OOMKilled (out of memory)
#     Exit Code: 1      ← Application error
#     Exit Code: 143    ← SIGTERM (graceful shutdown)

# Step 3: If OOMKilled (exit code 137), increase memory limits
# Check current usage:
kubectl top pod my-pod

# Fix: increase memory in deployment spec
# resources:
#   requests:
#     memory: "256Mi"
#   limits:
#     memory: "512Mi"     ← increase this

# Step 4: If exit code 1, check application logs
kubectl logs my-pod --previous | tail -50
# Common causes: missing env var, wrong DB connection string,
# missing config file, permission denied

Problem: Pod Stuck in Pending

The pod is created but never gets scheduled to a node.

# Step 1: Check events
kubectl describe pod my-pod | grep -A 10 "Events"
# Common messages:
#   "0/3 nodes are available: 3 Insufficient cpu"
#   "0/3 nodes are available: 3 Insufficient memory"
#   "no nodes match pod topology spread constraints"
#   "0/3 nodes are available: 3 node(s) had taint"

# Step 2: Check node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"

# Step 3: Check if requests are too high
kubectl get pod my-pod -o yaml | grep -A 4 "resources"

# Fix for insufficient resources:
# - Reduce resource requests
# - Add more nodes (cluster autoscaler)
# - Evict low-priority pods

# Fix for taints:
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
# Add toleration to pod spec or remove taint from node

Problem: Service Not Routing Traffic

# Step 1: Verify the service exists and has endpoints
kubectl get svc my-service -n my-namespace
kubectl get endpoints my-service -n my-namespace
# If ENDPOINTS is empty (<none>), no pods match the selector!

# Step 2: Check service selector matches pod labels
kubectl get svc my-service -o yaml | grep -A 5 "selector"
#   selector:
#     app: my-app          ← Service looks for this label

kubectl get pods --show-labels | grep my-app
#   my-pod   1/1   Running   app=myapp    ← Notice: "myapp" not "my-app"!
# Label mismatch! Fix the selector or the pod labels.

# Step 3: Test connectivity from inside the cluster
kubectl run debug --rm -it --image=busybox -- /bin/sh
# Inside the debug pod:
wget -qO- http://my-service.my-namespace.svc.cluster.local:8080/health
nslookup my-service.my-namespace.svc.cluster.local

# Step 4: Check if the pod is listening on the right port
kubectl exec my-pod -- netstat -tlnp
# Verify the application listens on the port the service targets

Problem: Ingress Not Working

# Step 1: Check ingress resource
kubectl get ingress -n my-namespace
kubectl describe ingress my-ingress -n my-namespace

# Step 2: Check ingress controller logs
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=50

# Step 3: Check if the backend service is healthy
kubectl get endpoints my-service
# Must have at least one endpoint IP

# Step 4: Check TLS certificate
kubectl describe ingress my-ingress | grep -A 3 "TLS"
kubectl get secret my-tls-secret -o yaml

# Step 5: Test from outside
curl -v -H "Host: my-app.example.com" http://INGRESS_IP/
# The -v flag shows headers, redirects, and SSL handshake details

Problem: Node Issues

# Check node status and conditions
kubectl get nodes
kubectl describe node worker-2 | grep -A 10 "Conditions"
#   MemoryPressure   True    ← Node is running low on memory
#   DiskPressure     True    ← Disk space critical
#   PIDPressure      True    ← Too many processes
#   Ready            False   ← Node is NOT accepting pods

# Check resource usage across all nodes
kubectl top nodes

# Find pods consuming the most resources on a node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=worker-2
kubectl top pods --all-namespaces --sort-by=memory | head -20

# Drain a problematic node (move all pods to other nodes)
kubectl drain worker-2 --ignore-daemonsets --delete-emptydir-data

# Cordon a node (prevent new pods, keep existing)
kubectl cordon worker-2

Problem: ConfigMap/Secret Not Loading

# Verify the ConfigMap exists
kubectl get configmap my-config -n my-namespace -o yaml

# Check if the pod mounts it correctly
kubectl describe pod my-pod | grep -A 10 "Mounts"
kubectl describe pod my-pod | grep -A 10 "Volumes"

# Common issue: pod was created BEFORE the ConfigMap
# ConfigMaps are loaded at pod start, not dynamically
# Fix: restart the deployment
kubectl rollout restart deployment my-app

# Check environment variables inside the pod
kubectl exec my-pod -- env | grep MY_VAR

# For mounted files:
kubectl exec my-pod -- cat /etc/config/my-setting

Problem: Persistent Volume Issues

# Check PV and PVC status
kubectl get pv
kubectl get pvc -n my-namespace

# PVC stuck in Pending:
kubectl describe pvc my-claim -n my-namespace
# Common causes:
#   "no persistent volumes available for this claim"
#   "storageclass not found"
#   "waiting for first consumer to be created"

# Check storage classes
kubectl get storageclass
kubectl describe storageclass standard

# Multi-attach error (ReadWriteOnce volume on multiple nodes):
# Pod cannot start because the PV is attached to another node
# Fix: ensure pods using RWO volumes are on the same node
# Or use ReadWriteMany (RWX) volumes (requires NFS or similar)

Advanced Debugging Tools

Ephemeral Debug Containers

# Attach a debug container to a running pod (without restarting it)
kubectl debug my-pod -it --image=busybox --target=my-container

# Debug a node directly
kubectl debug node/worker-2 -it --image=ubuntu

# Create a copy of a crashing pod with a different command
kubectl debug my-pod -it --copy-to=debug-pod --container=app -- /bin/sh

Network Debugging

# DNS resolution
kubectl run dns-test --rm -it --image=busybox -- nslookup kubernetes.default

# Check network policies blocking traffic
kubectl get networkpolicies -n my-namespace
kubectl describe networkpolicy my-policy -n my-namespace

# Port-forward for local testing
kubectl port-forward svc/my-service 8080:80 -n my-namespace
# Now access http://localhost:8080

Debugging Cheat Sheet

Symptom	First Command	Likely Cause
CrashLoopBackOff	kubectl logs --previous	App error, OOM, missing config
Pending pod	kubectl describe pod	Insufficient resources, taints
Service no endpoints	kubectl get endpoints	Label selector mismatch
ImagePullBackOff	kubectl describe pod	Wrong image name, missing pull secret
Node NotReady	kubectl describe node	Kubelet down, resource pressure
Permission denied	kubectl auth can-i	RBAC misconfiguration
DNS not resolving	kubectl logs -n kube-system coredns	CoreDNS crash, network policy
OOMKilled (137)	kubectl top pod	Memory limit too low

Key Takeaways

describe, logs, events - these three commands solve 80% of Kubernetes problems
Always check logs from the previous container with --previous for CrashLoopBackOff
Exit code 137 means OOMKilled - increase memory limits, not requests
Empty endpoints means label mismatch - the most common service routing issue
Use ephemeral debug containers to debug pods without restarting them
kubectl top shows real-time resource usage - compare against requests and limits
Port-forward is your friend for testing services locally without ingress

Kubernetes debugging is pattern recognition. Once you have seen CrashLoopBackOff with exit code 137 a few times, you instantly know it is an OOM kill. Build your mental library of symptoms-to-causes, and every production incident becomes a 5-minute fix instead of a 2-hour investigation.

Kubernetes Debugging Toolkit: From kubectl to Victory

Essential Debugging Commands

Problem: Pod Stuck in CrashLoopBackOff

Problem: Pod Stuck in Pending

Problem: Service Not Routing Traffic

Problem: Ingress Not Working

Problem: Node Issues

Problem: ConfigMap/Secret Not Loading

Problem: Persistent Volume Issues

Advanced Debugging Tools

Ephemeral Debug Containers

Network Debugging

Debugging Cheat Sheet

Key Takeaways

Stuck on implementation?

Related Production Resources

Free learning tracks

Interactive engineering labs

Production cheatsheets

Key terms

Discussion

Discussion is unavailable

Essential Debugging Commands

Problem: Pod Stuck in CrashLoopBackOff

Problem: Pod Stuck in Pending

Problem: Service Not Routing Traffic

Problem: Ingress Not Working

Problem: Node Issues

Problem: ConfigMap/Secret Not Loading

Problem: Persistent Volume Issues

Advanced Debugging Tools

Ephemeral Debug Containers

Network Debugging

Debugging Cheat Sheet

Key Takeaways

Stuck on implementation?

Related Production Resources

Free learning tracks

Interactive engineering labs

Production cheatsheets

Key terms

Discussion

Discussion is unavailable

Continue Reading

OIDC Workload Federation: Build Secretless Service Access

Modern Data Platforms Compared: Snowflake, Databricks, BigQuery, and e6data