StatefulSet ā Running Stateful Apps in Kubernetes
Why Deployments Break for Databases
A standard Deployment treats all pods as identical and disposable. Pod names are random (postgres-7d9f8c-xkqzp), storage is not guaranteed to follow a pod, and all pods can start or die simultaneously.
For a PostgreSQL primary-replica setup at Zerodha, this is catastrophic:
+------------------------------------------+| Deployment (WRONG for databases) || || postgres-7d9f8c-xkqzp <- random name | <- If pod dies, new pod gets| postgres-7d9f8c-ab1cd <- random name | a different random name.| | Replicas lose their target.| Shared PVC: postgres-data | <- Two pods writing to the same| ^ both pods mount this | disk = data corruption+------------------------------------------+ +------------------------------------------+| StatefulSet (CORRECT for databases) || || postgres-0 <- stable, permanent name | <- Always the primary.| postgres-1 <- stable, permanent name | Replicas always know where| postgres-2 <- stable, permanent name | to replicate from.| || postgres-data-0 <- dedicated PVC | <- Each pod owns its own disk.| postgres-data-1 <- dedicated PVC | PVC survives pod restarts.| postgres-data-2 <- dedicated PVC |+------------------------------------------+The Three Guarantees StatefulSet Provides
1. Stable Network Identity
Each pod gets a permanent DNS name that never changes, even after restarts:
+------------------------------------------+| DNS record format: || || <pod-name>.<headless-svc>.<ns>.svc.cluster.local| || postgres-0.postgres-svc.production.svc.cluster.local| postgres-1.postgres-svc.production.svc.cluster.local| postgres-2.postgres-svc.production.svc.cluster.local+------------------------------------------+2. Ordered Startup and Shutdown
+------------+ +------------+ +------------+| postgres-0 | | postgres-1 | | postgres-2 || | | | | || Starts 1st | --> | Starts 2nd | --> | Starts 3rd || (primary) | | (Running?) | | (Running?) |+------------+ +------------+ +------------+ Shutdown order is reversed:postgres-2 stops -> postgres-1 stops -> postgres-0 stops (primary last)3. Persistent Volume Per Pod
Each pod gets its own PVC via volumeClaimTemplates. The PVC survives pod deletion ā when the pod recreates, it reattaches to the same volume automatically.
A Real StatefulSet Manifest ā Redis Cluster
1apiVersion: apps/v12kind: StatefulSet3metadata:4 name: redis5 namespace: production6spec:7 serviceName: redis-headless # Must reference a Headless Service8 replicas: 39 selector:10 matchLabels:11 app: redis12 template:13 metadata:14 labels:15 app: redis16 spec:17 containers:18 - name: redis19 image: redis:7.2-alpine20 ports:21 - containerPort: 637922 volumeMounts:23 - name: redis-data24 mountPath: /data # Each pod mounts its own /data volume25 resources:26 requests:27 memory: "256Mi"28 cpu: "100m"29 limits:30 memory: "512Mi"31 cpu: "500m"32 volumeClaimTemplates: # This is what makes it a StatefulSet33 - metadata:34 name: redis-data35 spec:36 accessModes: ["ReadWriteOnce"]37 storageClassName: gp3-encrypted38 resources:39 requests:40 storage: 10Gi # Each of 3 pods gets its own 10Gi PVC41 # Total: 30Gi provisioned automaticallyHeadless Service ā The Required Companion
StatefulSets require a Headless Service (clusterIP: None) to assign stable DNS records to each individual pod:
1apiVersion: v12kind: Service3metadata:4 name: redis-headless5 namespace: production6spec:7 clusterIP: None # "Headless" ā no virtual IP, just DNS per pod8 selector:9 app: redis10 ports:11 - port: 637912 name: redis+------------------------------------------+ +------------------------------------------+| Regular Service (clusterIP: 10.96.0.5) | | Headless Service (clusterIP: None) || | | || DNS: redis-svc.production.svc... | | DNS: redis-0.redis-headless.production.. || Routes to ANY redis pod (random) | | DNS: redis-1.redis-headless.production.. || Good for stateless read balancing | | Routes to SPECIFIC pod (by ordinal) || | | Required for primary-replica topology |+------------------------------------------+ +------------------------------------------+š Remember: Without the Headless Service, pods don't get their individual DNS records. TheserviceNamein your StatefulSet spec MUST exactly match the Headless Servicemetadata.nameā a mismatch means pods start but DNS never resolves.
Managing StatefulSet Pods
1# List all pods ā notice the stable ordinal names2kubectl get pods -n production -l app=redis3# NAME READY STATUS RESTARTS AGE4# redis-0 1/1 Running 0 5d5# redis-1 1/1 Running 0 5d6# redis-2 1/1 Running 0 5d7 8# List the auto-created PVCs (one per pod)9kubectl get pvc -n production -l app=redis10# NAME STATUS VOLUME CAPACITY11# redis-data-redis-0 Bound pvc-3a8f2c1d-... 10Gi12# redis-data-redis-1 Bound pvc-5c2e4b1a-... 10Gi13# redis-data-redis-2 Bound pvc-7d3f9c2b-... 10Gi14 15# Scale up ā redis-3 will be created with its own PVC16kubectl scale statefulset redis --replicas=4 -n production17 18# Rolling update (in reverse order: 2, 1, 0)19kubectl rollout status statefulset/redis -n productionStatefulSet vs Deployment ā Decision Guide
| Factor | Use Deployment | Use StatefulSet |
|---|---|---|
| App stores state to disk | No | Yes |
| Pods need unique identity | No | Yes |
| All pods are interchangeable | Yes | No |
| Startup order matters | No | Yes |
| Examples | API servers, web frontends | PostgreSQL, Kafka, Elasticsearch, Redis Cluster, ZooKeeper |
Troubleshooting Common StatefulSet Problems
| Problem | Symptom | Fix |
|---|---|---|
Pod stuck in Pending |
redis-1 never starts |
redis-0 is not yet in Running state ā fix the first pod before others proceed |
| PVC not binding | redis-data-redis-0 stuck in Pending |
StorageClass missing or wrong AZ ā check kubectl describe pvc redis-data-redis-0 events |
| Pod recreated but lost data | Data empty after restart | Pod reattached to wrong PVC ā verify PVC name matches pod ordinal exactly |
| Headless Service DNS not resolving | nslookup redis-0.redis-headless fails |
serviceName in StatefulSet spec doesn't match the Headless Service name |
| Scale-down left orphan PVCs | PVCs remain after replicas: 1 |
StatefulSet never auto-deletes PVCs on scale-down ā delete manually after confirming data is safe |
š“ Common Mistake: Running a database as a Deployment with a single shared PVC. When the pod reschedules to a different node, theReadWriteOncePVC cannot follow to a different node, leaving the pod permanently stuck inPending.
ā ļø Security: Never run StatefulSet pods asrootinside the container. A pod breakout on a database StatefulSet with root access can corrupt data or exfiltrate the entire volume. Always setsecurityContext.runAsNonRoot: trueandreadOnlyRootFilesystem: trueon StatefulSet containers.
š” Tip: When upgrading StatefulSets (e.g., Redis version bump), useupdateStrategy: RollingUpdatewithpartition: 2to canary-test the upgrade on only the last pod first. Once verified healthy, setpartition: 0to roll out to all pods.