What is the career path for learning Debugging Kubernetes Networking with kubectl and CNI Plugins?

Mastering Debugging Kubernetes Networking with kubectl and CNI Plugins enables engineering opportunities in DevOps, SRE, and cloud platform automation.

Debugging Kubernetes Networking with kubectl and CNI Plugins | DevOps Network

Q: How long does it take to learn Debugging Kubernetes Networking with kubectl and CNI Plugins?

Most students gain core proficiency in Debugging Kubernetes Networking with kubectl and CNI Plugins in 2–3 weeks of active hands-on labs.

Debugging Kubernetes Networking with kubectl and CNI Plugins

Overview and What You Will Learn

Kubernetes networking failures are among the hardest production issues to diagnose — a pod can be running perfectly but completely unable to reach another service due to a CNI misconfiguration, a missing NetworkPolicy rule, a DNS resolution failure, or a kube-proxy iptables issue. This lab walks you through a systematic debugging methodology for every layer of Kubernetes networking using real kubectl commands, the netshoot debug container, and CNI-specific diagnostic tools.

By the end of this guide you will be able to:

Diagnose pod-to-pod, pod-to-service, and pod-to-external connectivity failures systematically
Debug DNS resolution failures using CoreDNS logs and nslookup inside pods
Identify CNI plugin misconfiguration causing pods to stay in ContainerCreating state
Trace kube-proxy iptables rules to verify Service routing is correctly programmed
Write and debug Kubernetes NetworkPolicies that restrict inter-pod traffic

Why This Matters in Production

At Zerodha, a silent network partition between the order matching engine and the risk management service caused trades to execute without risk checks for 4 minutes during a CNI upgrade — not detected by application logs because the service call was timing out silently and falling back to a cached value. The incident was only discovered through a networking-level trace.

Kubernetes networking has five distinct layers that can fail independently — the pod network interface, CNI plugin, kube-proxy service rules, CoreDNS, and NetworkPolicies. Engineers who cannot debug each layer independently will waste hours on production incidents that a systematic approach resolves in minutes.

Core Principles

The five networking layers in Kubernetes and their failure modes: Layer 1 — Pod Network Interface (veth pair + CNI assignment) Failure: Pod stuck in ContainerCreating, no IP assigned Tool: kubectl describe pod, ip addr inside pod Layer 2 — CNI Plugin (Calico / Cilium / Flannel overlay) Failure: Pod has IP but cannot reach pods on OTHER nodes Tool: CNI logs, calicoctl, cilium status Layer 3 — kube-proxy (iptables / IPVS Service rules) Failure: Pod can reach pod IPs directly but Service IP fails Tool: iptables -L -n, kubectl get endpoints Layer 4 — CoreDNS (cluster DNS resolution) Failure: Service IPs work but DNS names fail Tool: kubectl logs coredns, nslookup inside pod Layer 5 — NetworkPolicy (ingress/egress firewall rules) Failure: Connection refused or timeout with no obvious cause Tool: kubectl get networkpolicies, policy trace tools

Always debug from Layer 1 upward — never assume the problem is DNS when the pod might not have a network interface at all.

Detailed Step-by-Step Practical Lab

Step 1 — Deploy the netshoot Debug Container

netshoot is the standard Swiss Army knife for Kubernetes network debugging — it contains curl, nslookup, dig, tcpdump, netstat, traceroute, and dozens of other tools:

Bash

1# Run netshoot as a temporary debug pod in the same namespace as the failing service
2kubectl run netshoot \
3  --image=nicolaka/netshoot \
4  -it --rm \
5  -n production \
6  -- bash
7 
8# Run netshoot sharing the SAME network namespace as a specific pod
9# (sees the exact same network interfaces and routes as that pod)
10kubectl debug -it \
11  --image=nicolaka/netshoot \
12  --target=order-service-7d9f8b-xkp2q \
13  order-service-7d9f8b-xkp2q \
14  -n production \
15  -- bash

📌 Remember: kubectl debug --target=<pod> shares the pod's network namespace — it sees the same IP, the same routes, and the same DNS config as the application pod. This is the most accurate way to reproduce networking issues exactly as the application experiences them.

Step 2 — Debug Layer 1: Verify Pod Has a Network Interface and IP

Bash

1# Check if pod has an IP assigned
2kubectl get pod order-service-7d9f8b-xkp2q -n production -o wide
3# NAME                           READY   STATUS    IP            NODE
4# order-service-7d9f8b-xkp2q   0/1     ContainerCreating   <none>   mumbai-worker-1
5 
6# If IP is <none> — CNI has not assigned an address. Describe for the reason:
7kubectl describe pod order-service-7d9f8b-xkp2q -n production
8# Look for in Events:
9# Failed to create pod sandbox: rpc error: code = Unknown
10# desc = failed to set up sandbox container network: plugin type="calico"
11# failed (add): error getting ClusterInformation
12 
13# Inside the pod (if it started) — verify network interface exists
14ip addr show eth0
15# 2: eth0@if45: <BROADCAST,MULTICAST,UP,LOWER_UP>
16#     inet 10.244.2.15/24 brd 10.244.2.255 scope global eth0
17 
18# Check routing table
19ip route show
20# default via 169.254.1.1 dev eth0
21# 10.244.0.0/16 via 10.244.2.1 dev eth0   ← cluster pod CIDR route

Step 3 — Debug Layer 2: Test Pod-to-Pod Connectivity Across Nodes

Bash

1# Get IPs of pods on DIFFERENT nodes
2kubectl get pods -n production -o wide | grep order-service
3# order-service-7d9f8b-xkp2q   10.244.2.15   mumbai-worker-1
4# order-service-7d9f8b-mn3lp   10.244.3.22   mumbai-worker-2  ← different node
5 
6# From inside netshoot — ping the pod on the other node directly by IP
7ping 10.244.3.22
8# PING 10.244.3.22: 56 data bytes
9# 64 bytes from 10.244.3.22: icmp_seq=0 ttl=62 time=0.8ms  ← working
10# Request timeout for icmp_seq 0                             ← CNI overlay broken
11 
12# If ping fails — check CNI pods on both nodes
13kubectl get pods -n kube-system -o wide | grep -E "calico|cilium|flannel"
14 
15# Check CNI logs on the node where the failing pod lives
16kubectl logs -n kube-system calico-node-xxxxx | tail -50
17 
18# For Calico — check node BGP peering status
19kubectl exec -n kube-system calico-node-xxxxx -- calicoctl node status

Step 4 — Debug Layer 3: Verify kube-proxy Service Rules

Bash

1# Test direct pod IP (bypasses kube-proxy completely)
2# Inside netshoot:
3curl http://10.244.2.15:8080/health
4# 200 OK  ← pod is healthy
5 
6# Test Service ClusterIP (goes through kube-proxy iptables)
7kubectl get service order-service -n production
8# NAME            TYPE        CLUSTER-IP      PORT(S)
9# order-service   ClusterIP   10.96.45.200    8080/TCP
10 
11curl http://10.96.45.200:8080/health
12# Connection refused  ← kube-proxy rule is broken
13 
14# Verify the Service has endpoints (pods are registered as backends)
15kubectl get endpoints order-service -n production
16# NAME            ENDPOINTS                                   AGE
17# order-service   10.244.2.15:8080,10.244.3.22:8080           5d  ← healthy
18# order-service   <none>                                      5d  ← no pods matching selector
19 
20# If endpoints show <none> — the Service selector doesn't match pod labels
21kubectl get service order-service -n production -o jsonpath='{.spec.selector}'
22# {"app":"order-service"}
23 
24kubectl get pods -n production --show-labels | grep order-service
25# order-service-7d9f8b-xkp2q   app=orders-service  ← typo! "orders" not "order"

Bash

1# SSH onto the node and verify iptables rules were programmed correctly
2ssh rahul@mumbai-worker-1
3 
4# Check if kube-proxy created rules for the Service ClusterIP
5sudo iptables -t nat -L KUBE-SERVICES -n | grep 10.96.45.200
6# KUBE-SVC-XYZ  tcp -- 0.0.0.0/0  10.96.45.200  tcp dpt:8080
7 
8# Check kube-proxy is running correctly
9kubectl get pods -n kube-system | grep kube-proxy
10kubectl logs -n kube-system kube-proxy-xxxxx | grep -i error

💡 Tip: If the Service has correct endpoints but the ClusterIP still fails, the kube-proxy iptables rules may be stale. Restarting the kube-proxy DaemonSet pod on the affected node forces a full iptables resync: kubectl delete pod kube-proxy-xxxxx -n kube-system.

Step 5 — Debug Layer 4: CoreDNS and DNS Resolution Failures

Bash

1# Inside netshoot — test DNS resolution step by step
2# Test 1: Can we resolve the short service name?
3nslookup order-service
4# Server: 10.96.0.10  ← CoreDNS ClusterIP
5# Address: 10.96.0.10#53
6# ** server can't find order-service: NXDOMAIN  ← DNS failure
7 
8# Test 2: Try the full qualified name
9nslookup order-service.production.svc.cluster.local
10# 10.96.45.200  ← works with FQDN but not short name
11 
12# Test 3: Check the pod's DNS search domains
13cat /etc/resolv.conf
14# nameserver 10.96.0.10
15# search production.svc.cluster.local svc.cluster.local cluster.local
16# options ndots:5
17 
18# Test 4: Test external DNS (is it just internal or everything?)
19nslookup google.com
20# ** server can't find google.com: SERVFAIL  ← CoreDNS cannot reach upstream

Bash

1# Check CoreDNS pod status
2kubectl get pods -n kube-system | grep coredns
3 
4# Check CoreDNS logs for errors
5kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100
6 
7# Common CoreDNS error signatures:
8# [ERROR] plugin/errors: 2 google.com. A: read udp → upstream timeout
9# [ERROR] plugin/errors: 2 SERVFAIL → CoreDNS cannot reach upstream DNS
10 
11# Check CoreDNS ConfigMap for upstream DNS configuration
12kubectl get configmap coredns -n kube-system -o yaml

YAML

1# Fix — update CoreDNS ConfigMap to use reliable upstream resolvers
2apiVersion: v1
3kind: ConfigMap
4metadata:
5  name: coredns
6  namespace: kube-system
7data:
8  Corefile: |
9    .:53 {
10        errors
11        health {
12           lameduck 5s
13        }
14        ready
15        kubernetes cluster.local in-addr.arpa ip6.arpa {
16           pods insecure
17           fallthrough in-addr.arpa ip6.arpa
18           ttl 30
19        }
20        prometheus :9153
21        forward . 8.8.8.8 8.8.4.4 {    # Use Google DNS as upstream
22           max_concurrent 1000
23        }
24        cache 30
25        loop
26        reload
27        loadbalance
28    }

Bash

1kubectl apply -f coredns-configmap.yaml
2 
3# Restart CoreDNS to pick up new config
4kubectl rollout restart deployment/coredns -n kube-system

Step 6 — Debug Layer 5: NetworkPolicy Blocking Traffic

Bash

1# Check if any NetworkPolicies exist in the namespace
2kubectl get networkpolicies -n production
3 
4# If NetworkPolicies exist — describe them to see what they allow/deny
5kubectl describe networkpolicy payments-isolation -n production
6 
7# Common symptom: curl to a pod IP works, curl to the same pod from
8# a different namespace times out → NetworkPolicy is blocking cross-namespace traffic

YAML

1# networkpolicy-debug.yaml — allow order-service to call payments-service
2apiVersion: networking.k8s.io/v1
3kind: NetworkPolicy
4metadata:
5  name: allow-order-to-payments
6  namespace: payments-production
7spec:
8  podSelector:
9    matchLabels:
10      app: payments-service       # This policy applies TO payments pods
11  policyTypes:
12    - Ingress
13  ingress:
14    - from:
15        - namespaceSelector:
16            matchLabels:
17              kubernetes.io/metadata.name: orders-production   # Allow from orders namespace
18          podSelector:
19            matchLabels:
20              app: order-service  # Only from order-service pods specifically
21      ports:
22        - protocol: TCP
23          port: 4000

Bash

1kubectl apply -f networkpolicy-debug.yaml
2 
3# For Cilium clusters — use Hubble for real-time policy trace
4cilium hubble observe \
5  --namespace production \
6  --type drop \
7  --last 100
8# Shows every dropped packet with the policy rule that caused it
9 
10# For Calico clusters — use policy trace tool
11kubectl exec -n kube-system calico-node-xxxxx -- \
12  calicoctl policy trace \
13  --src-pod production/order-service-7d9f8b-xkp2q \
14  --dst-pod payments-production/payments-service-6c9d4f-mn3lp

⚠️ Security: Once you create any NetworkPolicy in a namespace, all traffic not explicitly allowed is denied by default. This means adding your first NetworkPolicy to a namespace can instantly break all existing connectivity if you don't also add allow rules for every legitimate traffic flow.

Step 7 — Run a Full Connectivity Matrix Test

For a comprehensive network health check across all services:

Bash

1# Deploy a test pod that checks connectivity to every service in the namespace
2kubectl run connectivity-test \
3  --image=nicolaka/netshoot \
4  -n production \
5  --restart=Never \
6  -- sh -c "
7    echo '=== Testing order-service ===' && \
8    curl -s -o /dev/null -w '%{http_code}' http://order-service:8080/health && \
9    echo '=== Testing payments-service ===' && \
10    curl -s -o /dev/null -w '%{http_code}' http://payments-service.payments-production.svc.cluster.local:4000/health && \
11    echo '=== Testing external DNS ===' && \
12    nslookup google.com && \
13    echo '=== All tests complete ==='
14  "
15 
16# View results
17kubectl logs connectivity-test -n production
18 
19# Clean up
20kubectl delete pod connectivity-test -n production

Production Best Practices & Common Pitfalls

Always test cross-namespace connectivity after adding any NetworkPolicy. A policy in the target namespace affects all inbound traffic including from other namespaces that previously worked without any policy.
Use Cilium with Hubble in production — the real-time policy trace and drop visibility is worth the migration cost. Debugging NetworkPolicies without Hubble is guesswork.
Label your namespaces explicitly with kubernetes.io/metadata.name — this label is auto-applied in Kubernetes 1.21+ and is required for reliable namespace-based NetworkPolicy selectors.
Monitor CoreDNS with Prometheus and alert on coredns_dns_response_rcode_count_total{rcode="SERVFAIL"} — a spike indicates upstream DNS failure that will cause cascading service discovery failures across the entire cluster.
Never run tcpdump directly on a node in production without approval — packet capture on a financial services cluster is a compliance event that must be logged and justified.

🔴 Common Mistake: Checking pod logs to diagnose networking issues. Application logs say "connection refused" or "timeout" — they cannot tell you whether the failure is at Layer 2 (CNI), Layer 3 (kube-proxy), Layer 4 (DNS), or Layer 5 (NetworkPolicy). Always use network-level tools like netshoot, not application logs, for network debugging.

Quick Reference & Troubleshooting Commands

Command	Purpose
`kubectl run netshoot --image=nicolaka/netshoot -it --rm -n <ns> -- bash`	Launch network debug container
`kubectl debug -it --image=nicolaka/netshoot --target=<pod> <pod> -n <ns> -- bash`	Debug sharing pod's network namespace
`kubectl get endpoints <service> -n <ns>`	Verify pods are registered as Service backends
`kubectl get networkpolicies -n <ns>`	List all NetworkPolicies in a namespace
`kubectl get pods -n kube-system \| grep coredns`	Check CoreDNS pod health
`kubectl logs -n kube-system -l k8s-app=kube-dns`	CoreDNS logs for DNS failure diagnosis
`nslookup <service>.<ns>.svc.cluster.local`	Test DNS resolution from inside a pod
`curl -v http://<clusterip>:<port>/path`	Test Service IP directly bypassing DNS
`kubectl get pods -n production -o wide`	Show pod IPs and which node they are on
`kubectl logs -n kube-system kube-proxy-<id>`	kube-proxy logs for Service routing issues

Asset Tracker Update:

Syncing Data

Debugging Kubernetes Networking with kubectl and CNI Plugins

Debugging Kubernetes Networking with kubectl and CNI Plugins

Overview and What You Will Learn

Why This Matters in Production

Core Principles

Detailed Step-by-Step Practical Lab

Step 1 — Deploy the netshoot Debug Container

Step 2 — Debug Layer 1: Verify Pod Has a Network Interface and IP

Step 3 — Debug Layer 2: Test Pod-to-Pod Connectivity Across Nodes

Step 4 — Debug Layer 3: Verify kube-proxy Service Rules

Step 5 — Debug Layer 4: CoreDNS and DNS Resolution Failures

Step 6 — Debug Layer 5: NetworkPolicy Blocking Traffic

Step 7 — Run a Full Connectivity Matrix Test

Production Best Practices & Common Pitfalls

Quick Reference & Troubleshooting Commands

Resources

Explore More in Kubernetes Workload Management

Troubleshooting Kubernetes Pod OOMKilled and CrashLoopBackOff Errors

Configuring Ingress Controllers with NGINX for Production Traffic

Managing Kubernetes Secrets with Vault and ConfigMaps

Scaling Deployments with Horizontal Pod Autoscaler (HPA)