Debugging Kubernetes Networking with kubectl and CNI Plugins
Overview and What You Will Learn
Kubernetes networking failures are among the hardest production issues to diagnose β a pod can be running perfectly but completely unable to reach another service due to a CNI misconfiguration, a missing NetworkPolicy rule, a DNS resolution failure, or a kube-proxy iptables issue. This lab walks you through a systematic debugging methodology for every layer of Kubernetes networking using real kubectl commands, the netshoot debug container, and CNI-specific diagnostic tools.
By the end of this guide you will be able to:
- Diagnose pod-to-pod, pod-to-service, and pod-to-external connectivity failures systematically
- Debug DNS resolution failures using CoreDNS logs and nslookup inside pods
- Identify CNI plugin misconfiguration causing pods to stay in ContainerCreating state
- Trace kube-proxy iptables rules to verify Service routing is correctly programmed
- Write and debug Kubernetes NetworkPolicies that restrict inter-pod traffic
Why This Matters in Production
At Zerodha, a silent network partition between the order matching engine and the risk management service caused trades to execute without risk checks for 4 minutes during a CNI upgrade β not detected by application logs because the service call was timing out silently and falling back to a cached value. The incident was only discovered through a networking-level trace.
Kubernetes networking has five distinct layers that can fail independently β the pod network interface, CNI plugin, kube-proxy service rules, CoreDNS, and NetworkPolicies. Engineers who cannot debug each layer independently will waste hours on production incidents that a systematic approach resolves in minutes.
Core Principles
The five networking layers in Kubernetes and their failure modes: Layer 1 β Pod Network Interface (veth pair + CNI assignment) Failure: Pod stuck in ContainerCreating, no IP assigned Tool: kubectl describe pod, ip addr inside pod Layer 2 β CNI Plugin (Calico / Cilium / Flannel overlay) Failure: Pod has IP but cannot reach pods on OTHER nodes Tool: CNI logs, calicoctl, cilium status Layer 3 β kube-proxy (iptables / IPVS Service rules) Failure: Pod can reach pod IPs directly but Service IP fails Tool: iptables -L -n, kubectl get endpoints Layer 4 β CoreDNS (cluster DNS resolution) Failure: Service IPs work but DNS names fail Tool: kubectl logs coredns, nslookup inside pod Layer 5 β NetworkPolicy (ingress/egress firewall rules) Failure: Connection refused or timeout with no obvious cause Tool: kubectl get networkpolicies, policy trace tools
Always debug from Layer 1 upward β never assume the problem is DNS when the pod might not have a network interface at all.
Detailed Step-by-Step Practical Lab
Step 1 β Deploy the netshoot Debug Container
netshoot is the standard Swiss Army knife for Kubernetes network debugging β it contains curl, nslookup, dig, tcpdump, netstat, traceroute, and dozens of other tools:
1# Run netshoot as a temporary debug pod in the same namespace as the failing service2kubectl run netshoot \3 --image=nicolaka/netshoot \4 -it --rm \5 -n production \6 -- bash7 8# Run netshoot sharing the SAME network namespace as a specific pod9# (sees the exact same network interfaces and routes as that pod)10kubectl debug -it \11 --image=nicolaka/netshoot \12 --target=order-service-7d9f8b-xkp2q \13 order-service-7d9f8b-xkp2q \14 -n production \15 -- bashπ Remember: kubectl debug --target=<pod> shares the pod's network namespace β it sees the same IP, the same routes, and the same DNS config as the application pod. This is the most accurate way to reproduce networking issues exactly as the application experiences them.Step 2 β Debug Layer 1: Verify Pod Has a Network Interface and IP
1# Check if pod has an IP assigned2kubectl get pod order-service-7d9f8b-xkp2q -n production -o wide3# NAME READY STATUS IP NODE4# order-service-7d9f8b-xkp2q 0/1 ContainerCreating <none> mumbai-worker-15 6# If IP is <none> β CNI has not assigned an address. Describe for the reason:7kubectl describe pod order-service-7d9f8b-xkp2q -n production8# Look for in Events:9# Failed to create pod sandbox: rpc error: code = Unknown10# desc = failed to set up sandbox container network: plugin type="calico"11# failed (add): error getting ClusterInformation12 13# Inside the pod (if it started) β verify network interface exists14ip addr show eth015# 2: eth0@if45: <BROADCAST,MULTICAST,UP,LOWER_UP>16# inet 10.244.2.15/24 brd 10.244.2.255 scope global eth017 18# Check routing table19ip route show20# default via 169.254.1.1 dev eth021# 10.244.0.0/16 via 10.244.2.1 dev eth0 β cluster pod CIDR routeStep 3 β Debug Layer 2: Test Pod-to-Pod Connectivity Across Nodes
1# Get IPs of pods on DIFFERENT nodes2kubectl get pods -n production -o wide | grep order-service3# order-service-7d9f8b-xkp2q 10.244.2.15 mumbai-worker-14# order-service-7d9f8b-mn3lp 10.244.3.22 mumbai-worker-2 β different node5 6# From inside netshoot β ping the pod on the other node directly by IP7ping 10.244.3.228# PING 10.244.3.22: 56 data bytes9# 64 bytes from 10.244.3.22: icmp_seq=0 ttl=62 time=0.8ms β working10# Request timeout for icmp_seq 0 β CNI overlay broken11 12# If ping fails β check CNI pods on both nodes13kubectl get pods -n kube-system -o wide | grep -E "calico|cilium|flannel"14 15# Check CNI logs on the node where the failing pod lives16kubectl logs -n kube-system calico-node-xxxxx | tail -5017 18# For Calico β check node BGP peering status19kubectl exec -n kube-system calico-node-xxxxx -- calicoctl node statusStep 4 β Debug Layer 3: Verify kube-proxy Service Rules
1# Test direct pod IP (bypasses kube-proxy completely)2# Inside netshoot:3curl http://10.244.2.15:8080/health4# 200 OK β pod is healthy5 6# Test Service ClusterIP (goes through kube-proxy iptables)7kubectl get service order-service -n production8# NAME TYPE CLUSTER-IP PORT(S)9# order-service ClusterIP 10.96.45.200 8080/TCP10 11curl http://10.96.45.200:8080/health12# Connection refused β kube-proxy rule is broken13 14# Verify the Service has endpoints (pods are registered as backends)15kubectl get endpoints order-service -n production16# NAME ENDPOINTS AGE17# order-service 10.244.2.15:8080,10.244.3.22:8080 5d β healthy18# order-service <none> 5d β no pods matching selector19 20# If endpoints show <none> β the Service selector doesn't match pod labels21kubectl get service order-service -n production -o jsonpath='{.spec.selector}'22# {"app":"order-service"}23 24kubectl get pods -n production --show-labels | grep order-service25# order-service-7d9f8b-xkp2q app=orders-service β typo! "orders" not "order"1# SSH onto the node and verify iptables rules were programmed correctly2ssh rahul@mumbai-worker-13 4# Check if kube-proxy created rules for the Service ClusterIP5sudo iptables -t nat -L KUBE-SERVICES -n | grep 10.96.45.2006# KUBE-SVC-XYZ tcp -- 0.0.0.0/0 10.96.45.200 tcp dpt:80807 8# Check kube-proxy is running correctly9kubectl get pods -n kube-system | grep kube-proxy10kubectl logs -n kube-system kube-proxy-xxxxx | grep -i errorπ‘ Tip: If the Service has correct endpoints but the ClusterIP still fails, the kube-proxy iptables rules may be stale. Restarting the kube-proxy DaemonSet pod on the affected node forces a full iptables resync: kubectl delete pod kube-proxy-xxxxx -n kube-system.Step 5 β Debug Layer 4: CoreDNS and DNS Resolution Failures
1# Inside netshoot β test DNS resolution step by step2# Test 1: Can we resolve the short service name?3nslookup order-service4# Server: 10.96.0.10 β CoreDNS ClusterIP5# Address: 10.96.0.10#536# ** server can't find order-service: NXDOMAIN β DNS failure7 8# Test 2: Try the full qualified name9nslookup order-service.production.svc.cluster.local10# 10.96.45.200 β works with FQDN but not short name11 12# Test 3: Check the pod's DNS search domains13cat /etc/resolv.conf14# nameserver 10.96.0.1015# search production.svc.cluster.local svc.cluster.local cluster.local16# options ndots:517 18# Test 4: Test external DNS (is it just internal or everything?)19nslookup google.com20# ** server can't find google.com: SERVFAIL β CoreDNS cannot reach upstream1# Check CoreDNS pod status2kubectl get pods -n kube-system | grep coredns3 4# Check CoreDNS logs for errors5kubectl logs -n kube-system -l k8s-app=kube-dns --tail=1006 7# Common CoreDNS error signatures:8# [ERROR] plugin/errors: 2 google.com. A: read udp β upstream timeout9# [ERROR] plugin/errors: 2 SERVFAIL β CoreDNS cannot reach upstream DNS10 11# Check CoreDNS ConfigMap for upstream DNS configuration12kubectl get configmap coredns -n kube-system -o yaml1# Fix β update CoreDNS ConfigMap to use reliable upstream resolvers2apiVersion: v13kind: ConfigMap4metadata:5 name: coredns6 namespace: kube-system7data:8 Corefile: |9 .:53 {10 errors11 health {12 lameduck 5s13 }14 ready15 kubernetes cluster.local in-addr.arpa ip6.arpa {16 pods insecure17 fallthrough in-addr.arpa ip6.arpa18 ttl 3019 }20 prometheus :915321 forward . 8.8.8.8 8.8.4.4 { # Use Google DNS as upstream22 max_concurrent 100023 }24 cache 3025 loop26 reload27 loadbalance28 }1kubectl apply -f coredns-configmap.yaml2 3# Restart CoreDNS to pick up new config4kubectl rollout restart deployment/coredns -n kube-systemStep 6 β Debug Layer 5: NetworkPolicy Blocking Traffic
1# Check if any NetworkPolicies exist in the namespace2kubectl get networkpolicies -n production3 4# If NetworkPolicies exist β describe them to see what they allow/deny5kubectl describe networkpolicy payments-isolation -n production6 7# Common symptom: curl to a pod IP works, curl to the same pod from8# a different namespace times out β NetworkPolicy is blocking cross-namespace traffic1# networkpolicy-debug.yaml β allow order-service to call payments-service2apiVersion: networking.k8s.io/v13kind: NetworkPolicy4metadata:5 name: allow-order-to-payments6 namespace: payments-production7spec:8 podSelector:9 matchLabels:10 app: payments-service # This policy applies TO payments pods11 policyTypes:12 - Ingress13 ingress:14 - from:15 - namespaceSelector:16 matchLabels:17 kubernetes.io/metadata.name: orders-production # Allow from orders namespace18 podSelector:19 matchLabels:20 app: order-service # Only from order-service pods specifically21 ports:22 - protocol: TCP23 port: 40001kubectl apply -f networkpolicy-debug.yaml2 3# For Cilium clusters β use Hubble for real-time policy trace4cilium hubble observe \5 --namespace production \6 --type drop \7 --last 1008# Shows every dropped packet with the policy rule that caused it9 10# For Calico clusters β use policy trace tool11kubectl exec -n kube-system calico-node-xxxxx -- \12 calicoctl policy trace \13 --src-pod production/order-service-7d9f8b-xkp2q \14 --dst-pod payments-production/payments-service-6c9d4f-mn3lpβ οΈ Security: Once you create any NetworkPolicy in a namespace, all traffic not explicitly allowed is denied by default. This means adding your first NetworkPolicy to a namespace can instantly break all existing connectivity if you don't also add allow rules for every legitimate traffic flow.
Step 7 β Run a Full Connectivity Matrix Test
For a comprehensive network health check across all services:
1# Deploy a test pod that checks connectivity to every service in the namespace2kubectl run connectivity-test \3 --image=nicolaka/netshoot \4 -n production \5 --restart=Never \6 -- sh -c "7 echo '=== Testing order-service ===' && \8 curl -s -o /dev/null -w '%{http_code}' http://order-service:8080/health && \9 echo '=== Testing payments-service ===' && \10 curl -s -o /dev/null -w '%{http_code}' http://payments-service.payments-production.svc.cluster.local:4000/health && \11 echo '=== Testing external DNS ===' && \12 nslookup google.com && \13 echo '=== All tests complete ==='14 "15 16# View results17kubectl logs connectivity-test -n production18 19# Clean up20kubectl delete pod connectivity-test -n productionProduction Best Practices & Common Pitfalls
- Always test cross-namespace connectivity after adding any NetworkPolicy. A policy in the target namespace affects all inbound traffic including from other namespaces that previously worked without any policy.
- Use Cilium with Hubble in production β the real-time policy trace and drop visibility is worth the migration cost. Debugging NetworkPolicies without Hubble is guesswork.
- Label your namespaces explicitly with
kubernetes.io/metadata.nameβ this label is auto-applied in Kubernetes 1.21+ and is required for reliable namespace-based NetworkPolicy selectors. - Monitor CoreDNS with Prometheus and alert on
coredns_dns_response_rcode_count_total{rcode="SERVFAIL"}β a spike indicates upstream DNS failure that will cause cascading service discovery failures across the entire cluster. - Never run tcpdump directly on a node in production without approval β packet capture on a financial services cluster is a compliance event that must be logged and justified.
π΄ Common Mistake: Checking pod logs to diagnose networking issues. Application logs say "connection refused" or "timeout" β they cannot tell you whether the failure is at Layer 2 (CNI), Layer 3 (kube-proxy), Layer 4 (DNS), or Layer 5 (NetworkPolicy). Always use network-level tools like netshoot, not application logs, for network debugging.
Quick Reference & Troubleshooting Commands
| Command | Purpose |
|---|---|
kubectl run netshoot --image=nicolaka/netshoot -it --rm -n <ns> -- bash |
Launch network debug container |
kubectl debug -it --image=nicolaka/netshoot --target=<pod> <pod> -n <ns> -- bash |
Debug sharing pod's network namespace |
kubectl get endpoints <service> -n <ns> |
Verify pods are registered as Service backends |
kubectl get networkpolicies -n <ns> |
List all NetworkPolicies in a namespace |
kubectl get pods -n kube-system | grep coredns |
Check CoreDNS pod health |
kubectl logs -n kube-system -l k8s-app=kube-dns |
CoreDNS logs for DNS failure diagnosis |
nslookup <service>.<ns>.svc.cluster.local |
Test DNS resolution from inside a pod |
curl -v http://<clusterip>:<port>/path |
Test Service IP directly bypassing DNS |
kubectl get pods -n production -o wide |
Show pod IPs and which node they are on |
kubectl logs -n kube-system kube-proxy-<id> |
kube-proxy logs for Service routing issues |
Asset Tracker Update: