What is the career path for learning Configuring Pod Disruption Budgets for Zero-Downtime Upgrades?

Mastering Configuring Pod Disruption Budgets for Zero-Downtime Upgrades enables engineering opportunities in DevOps, SRE, and cloud platform automation.

Configuring Pod Disruption Budgets for Zero-Downtime Upgrades | DevOps Network

Q: How long does it take to learn Configuring Pod Disruption Budgets for Zero-Downtime Upgrades?

Most students gain core proficiency in Configuring Pod Disruption Budgets for Zero-Downtime Upgrades in 2–3 weeks of active hands-on labs.

Syncing Data

Elite DevOps Network

Contact & Suggestions 💬

Configuring Pod Disruption Budgets for Zero-Downtime Upgrades | DevOps Network | DevOps Network

A Pod Disruption Budget (PDB) is a policy that tells Kubernetes the minimum number of pods that must stay running during voluntary disruptions — node drains, cluster upgrades, and rolling deployments. Without one, a node drain can terminate all pods of a service simultaneously, causing a complete outage.

+++

Configuring Pod Disruption Budgets for Zero-Downtime Upgrades

The Problem PDB Solves

Imagine your payments API has 3 pods spread across 3 nodes. A cluster upgrade requires draining all nodes one by one. Without a PDB, Kubernetes can drain Node 1, terminating its pod — that is fine, you still have 2. But it can immediately drain Node 2 next. Now you have 1 pod serving all production traffic. Then Node 3. Zero pods. Complete outage.

A PDB prevents this by telling the cluster: "Never let availability drop below 2 pods while you drain nodes."

◈ DIAGRAM

WITHOUT PDB:                        WITH PDB (minAvailable: 2):
 
Node drain sequence:                Node drain sequence:
 
Node-1 drained → 2 pods running    Node-1 drained → 2 pods running ✓
Node-2 drained → 1 pod running     Node-2 drain attempt:
Node-3 drained → 0 pods ← OUTAGE     Kubernetes checks PDB
                                      2 pods available = minimum met
                                      WAIT — cannot proceed
                                      New pod scheduled first
                                      3 pods running again
                                      Node-2 drained → 2 pods ✓

Voluntary vs Involuntary Disruptions

PDBs only apply to voluntary disruptions — actions an administrator or the cluster itself initiates intentionally.

Type	Examples	PDB Applies?
Voluntary	`kubectl drain`, cluster upgrade, node scaling down, admin deletes pod	✅ Yes
Involuntary	Node hardware failure, kernel panic, out-of-memory kill	❌ No

PDB cannot protect you from a node dying unexpectedly. It only governs intentional operations.

Two Ways to Define a PDB

Option 1 — minAvailable: At least this many pods must be running at all times.

YAML

1# pdb-payments-api.yaml
2apiVersion: policy/v1
3kind: PodDisruptionBudget
4metadata:
5  name: payments-api-pdb
6  namespace: production
7spec:
8  minAvailable: 2          # At least 2 pods must be available during any disruption
9  selector:
10    matchLabels:
11      app: payments-api    # Targets pods with this label

Option 2 — maxUnavailable: At most this many pods can be down at the same time.

YAML

1apiVersion: policy/v1
2kind: PodDisruptionBudget
3metadata:
4  name: payments-api-pdb
5  namespace: production
6spec:
7  maxUnavailable: 1        # Only 1 pod can be unavailable at a time
8  selector:
9    matchLabels:
10      app: payments-api

`minAvailable` vs `maxUnavailable` — Which to Use

◈ DIAGRAM

Deployment: 5 replicas
 
minAvailable: 3
→ Kubernetes can disrupt at most 2 pods at a time
→ Absolute number — stays fixed even if you scale the deployment
 
maxUnavailable: 1
→ Kubernetes can disrupt at most 1 pod at a time
→ Percentage option: maxUnavailable: "20%" adjusts as replicas scale

Setting	Best For	Watch Out
`minAvailable: N`	Critical services where you know the exact floor (e.g. "always 2 payment pods")	If replicas drop below N for any reason, node drains will block indefinitely
`maxUnavailable: N`	Services where you want proportional safety as replicas scale	Less intuitive for ops teams to reason about in an incident
`maxUnavailable: "10%"`	Large deployments (20+ replicas)	Rounds down — 10% of 5 pods = 0, meaning nothing can be disrupted

⚠️ Critical mistake: Setting minAvailable equal to your replica count. Example: 3 replicas with minAvailable: 3. Kubernetes can never drain a node because draining any node would violate the budget. Cluster upgrades will stall permanently until someone deletes the PDB.

Using Percentages

YAML

1spec:
2  minAvailable: "60%"   # At least 60% of matched pods must be available
3 
4# With 10 replicas: at least 6 must be running → up to 4 can be disrupted
5# With 5 replicas:  at least 3 must be running → up to 2 can be disrupted
6# With 3 replicas:  at least 2 must be running → up to 1 can be disrupted

Percentages are useful for autoscaled deployments where replica count fluctuates — your PDB stays proportionally correct without manual updates.

Checking PDB Status

Bash

1# List all PDBs in a namespace
2kubectl get pdb -n production
3 
4# Output:
5# NAME               MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
6# payments-api-pdb   2               N/A               1                     5d
7# auth-service-pdb   N/A             1                 1                     5d
8 
9# "ALLOWED DISRUPTIONS" = how many pods can currently be taken down
10# If this is 0, node drains will block
11 
12# Describe for full details
13kubectl describe pdb payments-api-pdb -n production

Why a Node Drain Gets Stuck

The most common scenario at Razorpay or Hotstar: a cluster upgrade is running, and one node refuses to drain. The drain command hangs. The reason is almost always a PDB with ALLOWED DISRUPTIONS: 0.

Bash

1# Drain a node during cluster upgrade
2kubectl drain node mumbai-worker-3 \
3  --ignore-daemonsets \
4  --delete-emptydir-data
5 
6# Output when PDB is blocking:
7# error when evicting pods/"payments-api-7d9f8b-xk2p9" -n "production"
8# (will retry after 5s): Cannot evict pod as it would violate
9# the pod's disruption budget.

Bash

1# Diagnose why ALLOWED DISRUPTIONS is 0
2kubectl get pdb payments-api-pdb -n production
3 
4# Then check the actual pod count vs minAvailable
5kubectl get pods -l app=payments-api -n production
6 
7# If only 2 pods are running and minAvailable is 2:
8# No pod can be evicted — evicting any one drops below the minimum
9# Fix: Scale up the deployment to 3+ replicas first, then drain
10kubectl scale deployment payments-api --replicas=4 -n production

Full Production Setup — Deployment + PDB Together

YAML

1# deployment.yaml
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5  name: payments-api
6  namespace: production
7spec:
8  replicas: 3
9  selector:
10    matchLabels:
11      app: payments-api
12  strategy:
13    type: RollingUpdate
14    rollingUpdate:
15      maxUnavailable: 1     # Rolling update can take down 1 pod at a time
16      maxSurge: 1           # Can temporarily create 1 extra pod during rollout
17  template:
18    metadata:
19      labels:
20        app: payments-api   # ← Must match the PDB selector exactly
21    spec:
22      containers:
23        - name: api
24          image: registry.razorpay.in/payments-api:v2.5.1
25 
26# pdb.yaml — Apply this alongside the Deployment
27apiVersion: policy/v1
28kind: PodDisruptionBudget
29metadata:
30  name: payments-api-pdb
31  namespace: production
32spec:
33  maxUnavailable: 1
34  selector:
35    matchLabels:
36      app: payments-api     # ← Must match the Deployment pod labels exactly

Bash

1# Apply both together
2kubectl apply -f deployment.yaml -f pdb.yaml -n production
3 
4# Verify the PDB is correctly targeting pods
5kubectl get pdb payments-api-pdb -n production
6# ALLOWED DISRUPTIONS should be 1 if 3 pods are running and maxUnavailable is 1

PDB for StatefulSets

StatefulSets (databases, Kafka, Zookeeper) are particularly important to protect because they have no load balancer in front — each pod is individually addressable and a quorum may be required.

YAML

1# zookeeper-pdb.yaml
2apiVersion: policy/v1
3kind: PodDisruptionBudget
4metadata:
5  name: zookeeper-pdb
6  namespace: production
7spec:
8  minAvailable: 2    # Zookeeper 3-node cluster: must keep 2 for quorum
9  selector:
10    matchLabels:
11      app: zookeeper

PGSQL

13-node Zookeeper cluster requires quorum of 2 to elect a leader.
2minAvailable: 2 ensures Kubernetes never drains 2 nodes simultaneously,
3which would break quorum and make the cluster read-only.

Checking PDB During Cluster Upgrade (Incident Workflow)

Bash

1# 1. Before draining, check all PDB statuses across the cluster
2kubectl get pdb --all-namespaces
3 
4# 2. Identify any PDB with ALLOWED DISRUPTIONS = 0
5kubectl get pdb --all-namespaces | grep " 0 "
6 
7# 3. For each blocking PDB, check actual pod count
8kubectl get pods -n <namespace> -l <label-from-pdb-selector>
9 
10# 4. If pods are fewer than minAvailable due to earlier failures:
11kubectl scale deployment <name> --replicas=<higher-count> -n <namespace>
12 
13# 5. Then retry the drain
14kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

🔴 Common Mistake: Creating a PDB but using a selector that does not match any pods. The PDB exists but protects nothing. Always verify kubectl get pdb shows a non-zero ALLOWED DISRUPTIONS value after creation — if it shows N/A or the pod count is 0, your selector is wrong.

💡 Tip: In clusters running on EKS or GKE where the cloud provider performs node upgrades automatically, PDBs are your last line of defense against upgrade-caused outages. The cloud upgrade process respects PDBs before draining nodes. At Hotstar scale, every Deployment with more than 1 replica should have a PDB — with maxUnavailable: 1 as the safe default for stateless services.