A Pod Disruption Budget (PDB) is a policy that tells Kubernetes the minimum number of pods that must stay running during voluntary disruptions β node drains, cluster upgrades, and rolling deployments. Without one, a node drain can terminate all pods of a service simultaneously, causing a complete outage.
+++
Configuring Pod Disruption Budgets for Zero-Downtime Upgrades
The Problem PDB Solves
Imagine your payments API has 3 pods spread across 3 nodes. A cluster upgrade requires draining all nodes one by one. Without a PDB, Kubernetes can drain Node 1, terminating its pod β that is fine, you still have 2. But it can immediately drain Node 2 next. Now you have 1 pod serving all production traffic. Then Node 3. Zero pods. Complete outage.
A PDB prevents this by telling the cluster: "Never let availability drop below 2 pods while you drain nodes."
WITHOUT PDB: WITH PDB (minAvailable: 2): Node drain sequence: Node drain sequence: Node-1 drained β 2 pods running Node-1 drained β 2 pods running βNode-2 drained β 1 pod running Node-2 drain attempt:Node-3 drained β 0 pods β OUTAGE Kubernetes checks PDB 2 pods available = minimum met WAIT β cannot proceed New pod scheduled first 3 pods running again Node-2 drained β 2 pods βVoluntary vs Involuntary Disruptions
PDBs only apply to voluntary disruptions β actions an administrator or the cluster itself initiates intentionally.
| Type | Examples | PDB Applies? |
|---|---|---|
| Voluntary | kubectl drain, cluster upgrade, node scaling down, admin deletes pod |
β Yes |
| Involuntary | Node hardware failure, kernel panic, out-of-memory kill | β No |
PDB cannot protect you from a node dying unexpectedly. It only governs intentional operations.
Two Ways to Define a PDB
Option 1 β minAvailable: At least this many pods must be running at all times.
1# pdb-payments-api.yaml2apiVersion: policy/v13kind: PodDisruptionBudget4metadata:5 name: payments-api-pdb6 namespace: production7spec:8 minAvailable: 2 # At least 2 pods must be available during any disruption9 selector:10 matchLabels:11 app: payments-api # Targets pods with this labelOption 2 β maxUnavailable: At most this many pods can be down at the same time.
1apiVersion: policy/v12kind: PodDisruptionBudget3metadata:4 name: payments-api-pdb5 namespace: production6spec:7 maxUnavailable: 1 # Only 1 pod can be unavailable at a time8 selector:9 matchLabels:10 app: payments-api`minAvailable` vs `maxUnavailable` β Which to Use
Deployment: 5 replicas minAvailable: 3β Kubernetes can disrupt at most 2 pods at a timeβ Absolute number β stays fixed even if you scale the deployment maxUnavailable: 1β Kubernetes can disrupt at most 1 pod at a timeβ Percentage option: maxUnavailable: "20%" adjusts as replicas scale| Setting | Best For | Watch Out |
|---|---|---|
minAvailable: N |
Critical services where you know the exact floor (e.g. "always 2 payment pods") | If replicas drop below N for any reason, node drains will block indefinitely |
maxUnavailable: N |
Services where you want proportional safety as replicas scale | Less intuitive for ops teams to reason about in an incident |
maxUnavailable: "10%" |
Large deployments (20+ replicas) | Rounds down β 10% of 5 pods = 0, meaning nothing can be disrupted |
β οΈ Critical mistake: SettingminAvailableequal to your replica count. Example: 3 replicas withminAvailable: 3. Kubernetes can never drain a node because draining any node would violate the budget. Cluster upgrades will stall permanently until someone deletes the PDB.
Using Percentages
1spec:2 minAvailable: "60%" # At least 60% of matched pods must be available3 4# With 10 replicas: at least 6 must be running β up to 4 can be disrupted5# With 5 replicas: at least 3 must be running β up to 2 can be disrupted6# With 3 replicas: at least 2 must be running β up to 1 can be disruptedPercentages are useful for autoscaled deployments where replica count fluctuates β your PDB stays proportionally correct without manual updates.
Checking PDB Status
1# List all PDBs in a namespace2kubectl get pdb -n production3 4# Output:5# NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE6# payments-api-pdb 2 N/A 1 5d7# auth-service-pdb N/A 1 1 5d8 9# "ALLOWED DISRUPTIONS" = how many pods can currently be taken down10# If this is 0, node drains will block11 12# Describe for full details13kubectl describe pdb payments-api-pdb -n productionWhy a Node Drain Gets Stuck
The most common scenario at Razorpay or Hotstar: a cluster upgrade is running, and one node refuses to drain. The drain command hangs. The reason is almost always a PDB with ALLOWED DISRUPTIONS: 0.
1# Drain a node during cluster upgrade2kubectl drain node mumbai-worker-3 \3 --ignore-daemonsets \4 --delete-emptydir-data5 6# Output when PDB is blocking:7# error when evicting pods/"payments-api-7d9f8b-xk2p9" -n "production"8# (will retry after 5s): Cannot evict pod as it would violate9# the pod's disruption budget.1# Diagnose why ALLOWED DISRUPTIONS is 02kubectl get pdb payments-api-pdb -n production3 4# Then check the actual pod count vs minAvailable5kubectl get pods -l app=payments-api -n production6 7# If only 2 pods are running and minAvailable is 2:8# No pod can be evicted β evicting any one drops below the minimum9# Fix: Scale up the deployment to 3+ replicas first, then drain10kubectl scale deployment payments-api --replicas=4 -n productionFull Production Setup β Deployment + PDB Together
1# deployment.yaml2apiVersion: apps/v13kind: Deployment4metadata:5 name: payments-api6 namespace: production7spec:8 replicas: 39 selector:10 matchLabels:11 app: payments-api12 strategy:13 type: RollingUpdate14 rollingUpdate:15 maxUnavailable: 1 # Rolling update can take down 1 pod at a time16 maxSurge: 1 # Can temporarily create 1 extra pod during rollout17 template:18 metadata:19 labels:20 app: payments-api # β Must match the PDB selector exactly21 spec:22 containers:23 - name: api24 image: registry.razorpay.in/payments-api:v2.5.125 26# pdb.yaml β Apply this alongside the Deployment27apiVersion: policy/v128kind: PodDisruptionBudget29metadata:30 name: payments-api-pdb31 namespace: production32spec:33 maxUnavailable: 134 selector:35 matchLabels:36 app: payments-api # β Must match the Deployment pod labels exactly1# Apply both together2kubectl apply -f deployment.yaml -f pdb.yaml -n production3 4# Verify the PDB is correctly targeting pods5kubectl get pdb payments-api-pdb -n production6# ALLOWED DISRUPTIONS should be 1 if 3 pods are running and maxUnavailable is 1PDB for StatefulSets
StatefulSets (databases, Kafka, Zookeeper) are particularly important to protect because they have no load balancer in front β each pod is individually addressable and a quorum may be required.
1# zookeeper-pdb.yaml2apiVersion: policy/v13kind: PodDisruptionBudget4metadata:5 name: zookeeper-pdb6 namespace: production7spec:8 minAvailable: 2 # Zookeeper 3-node cluster: must keep 2 for quorum9 selector:10 matchLabels:11 app: zookeeper13-node Zookeeper cluster requires quorum of 2 to elect a leader.2minAvailable: 2 ensures Kubernetes never drains 2 nodes simultaneously,3which would break quorum and make the cluster read-only.Checking PDB During Cluster Upgrade (Incident Workflow)
1# 1. Before draining, check all PDB statuses across the cluster2kubectl get pdb --all-namespaces3 4# 2. Identify any PDB with ALLOWED DISRUPTIONS = 05kubectl get pdb --all-namespaces | grep " 0 "6 7# 3. For each blocking PDB, check actual pod count8kubectl get pods -n <namespace> -l <label-from-pdb-selector>9 10# 4. If pods are fewer than minAvailable due to earlier failures:11kubectl scale deployment <name> --replicas=<higher-count> -n <namespace>12 13# 5. Then retry the drain14kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-dataπ΄ Common Mistake: Creating a PDB but using a selector that does not match any pods. The PDB exists but protects nothing. Always verify kubectl get pdb shows a non-zero ALLOWED DISRUPTIONS value after creation β if it shows N/A or the pod count is 0, your selector is wrong.
π‘ Tip: In clusters running on EKS or GKE where the cloud provider performs node upgrades automatically, PDBs are your last line of defense against upgrade-caused outages. The cloud upgrade process respects PDBs before draining nodes. At Hotstar scale, every Deployment with more than 1 replica should have a PDB β with maxUnavailable: 1 as the safe default for stateless services.