Implementing Liveness and Readiness Probes for Zero-Downtime Deploys
Overview and What You Will Learn
Without probes, Kubernetes has no way to know whether your application is actually healthy β it only knows if the container process is running. A Node.js API that is running but stuck in an infinite loop, a Spring Boot service that started but cannot connect to its database, or a pod that is ready to receive traffic before its cache is warmed β all of these appear healthy to Kubernetes without probes. This lab walks you through configuring all three probe types to achieve genuinely zero-downtime rolling deployments.
By the end of this guide you will be able to:
- Configure liveness probes to detect and automatically restart deadlocked or hung containers
- Configure readiness probes to gate traffic until the application is genuinely ready to serve requests
- Configure startup probes to protect slow-starting applications from premature liveness kills
- Design probe endpoints in your application that accurately reflect health state
- Tune probe timing parameters to balance responsiveness against false-positive restarts
Why This Matters in Production
During a Swiggy deployment at peak dinner hour, a new API version rolled out across 20 pods. The new version had a subtle bug β it started successfully, passed the liveness probe, but the readiness probe was not configured. Kubernetes sent live traffic to pods that had not finished loading their restaurant menu cache β resulting in 8 seconds of 500 errors for every user whose request hit an unready pod before the cache warmed up.
A correctly configured readiness probe would have held all traffic on the old pods until every new pod's cache was fully loaded β zero errors, zero user impact. Probes are the single most impactful configuration for achieving true zero-downtime deployments.
Core Principles
The three probe types and their distinct purposes: STARTUP PROBE LIVENESS PROBE READINESS PROBE βββββββββββββ ββββββββββββββ βββββββββββββββ "Has the app finished "Is the app still "Is the app ready starting up?" alive and not hung?" to receive traffic?"
Runs ONCE at startup Runs continuously Runs continuously until it succeeds throughout pod life throughout pod life
On failure β On failure β On failure β kubelet waits and kubelet RESTARTS pod REMOVED from retries (backoff) the container Service endpoints (no restart)
Disables liveness Triggers after Independent of probe while running startup probe liveness probe succeeds
The probe execution order for a new pod: Pod starts β βΌ startupProbe runs (liveness disabled during this phase) β βββΊ Fails β kubelet waits periodSeconds β retries β (up to failureThreshold Γ periodSeconds total) β βββΊ Succeeds β startup complete β βΌ livenessProbe + readinessProbe both begin running in parallel β β βΌ βΌ Failure β restart Failure β removed from container Service endpoints
Detailed Step-by-Step Practical Lab
Step 1 β Understand Probe Mechanisms
Kubernetes supports three probe mechanisms β choose based on what your application exposes:
1# Mechanism 1 β HTTP GET (most common for web services)2# Kubernetes makes an HTTP GET to the specified path and port3# Success: HTTP status 200-3994# Failure: HTTP status 400+ or connection refused or timeout5livenessProbe:6 httpGet:7 path: /healthz # Your health check endpoint8 port: 80809 httpHeaders:10 - name: Custom-Header11 value: kubernetes-probe12 13# Mechanism 2 β TCP Socket (for non-HTTP services like databases, message queues)14# Kubernetes attempts to open a TCP connection to the port15# Success: connection established16# Failure: connection refused or timeout17livenessProbe:18 tcpSocket:19 port: 5432 # PostgreSQL port β if TCP accepts, pod is alive20 21# Mechanism 3 β Exec Command (run a command inside the container)22# Success: command exits with code 023# Failure: command exits with non-zero code or times out24livenessProbe:25 exec:26 command:27 - pg_isready # PostgreSQL readiness check binary28 - -U29 - postgres30 - -d31 - zerodha_tradingStep 2 β Configure a Complete Probe Suite for a Node.js API
1# deployment-api-probes.yaml β full probe configuration for Swiggy order API2apiVersion: apps/v13kind: Deployment4metadata:5 name: order-api6 namespace: production7spec:8 replicas: 59 strategy:10 type: RollingUpdate11 rollingUpdate:12 maxSurge: 2 # Allow 2 extra pods during update13 maxUnavailable: 0 # Never take a pod down before a new one is ready14 # This is what makes rolling updates truly zero-downtime15 selector:16 matchLabels:17 app: order-api18 template:19 metadata:20 labels:21 app: order-api22 spec:23 containers:24 - name: order-api25 image: registry.swiggy.in/order-api:v6.2.126 ports:27 - containerPort: 808028 resources:29 requests:30 cpu: "250m"31 memory: "256Mi"32 limits:33 cpu: "500m"34 memory: "512Mi"35 36 # STARTUP PROBE β protects slow-starting apps from liveness kills37 # Total startup budget: 30 attempts Γ 10s = 300s (5 minutes maximum)38 startupProbe:39 httpGet:40 path: /healthz41 port: 808042 failureThreshold: 30 # Allow up to 5 minutes for startup43 periodSeconds: 10 # Check every 10 seconds44 successThreshold: 1 # One success is enough45 46 # LIVENESS PROBE β detects hung or deadlocked application47 # Only runs AFTER startupProbe succeeds48 livenessProbe:49 httpGet:50 path: /healthz51 port: 808052 initialDelaySeconds: 0 # startupProbe handles the delay β set to 053 periodSeconds: 15 # Check every 15 seconds54 timeoutSeconds: 5 # Fail if no response within 5 seconds55 failureThreshold: 3 # Restart after 3 consecutive failures (45s)56 successThreshold: 157 58 # READINESS PROBE β gates traffic until app is truly ready59 readinessProbe:60 httpGet:61 path: /ready62 port: 8080 # Different endpoint from liveness63 initialDelaySeconds: 564 periodSeconds: 5 # Check every 5 seconds β faster than liveness65 timeoutSeconds: 366 failureThreshold: 3 # Remove from endpoints after 3 failures (15s)67 successThreshold: 2 # Require 2 consecutive successes before adding backπ Remember: UsemaxUnavailable: 0combined withmaxSurge: 1or higher for zero-downtime deployments. This forces Kubernetes to bring up new pods and wait for them to pass readiness before terminating old ones. Without this, Kubernetes may terminate old pods while new ones are still starting.
Step 3 β Implement Correct Health Endpoints in Your Application
The probe endpoint must accurately reflect the application's actual health β not just return 200 unconditionally:
1// health.js β Node.js health endpoints for Swiggy order API2const express = require('express');3const router = express.Router();4 5// LIVENESS endpoint β am I alive and not deadlocked?6// Should only check things that would require a restart to fix7router.get('/healthz', async (req, res) => {8 try {9 // Check if the event loop is responsive (not deadlocked)10 // Check if critical in-process state is valid11 const memUsage = process.memoryUsage();12 const heapUsedPercent = memUsage.heapUsed / memUsage.heapTotal;13 14 if (heapUsedPercent > 0.95) {15 // Heap is 95%+ full β restart before OOMKill hits16 return res.status(503).json({17 reason: 'heap_pressure',18 heapUsedPercent19 });20 }21 22 res.status(200).json({ status: 'alive', uptime: process.uptime() });23 } catch (err) {24 res.status(503).json({ status: 'unhealthy', error: err.message });25 }26});27 28// READINESS endpoint β am I ready to serve traffic?29// Check all dependencies the application needs to function30router.get('/ready', async (req, res) => {31 const checks = {};32 33 try {34 // Check database connectivity35 await db.query('SELECT 1');36 checks.database = 'ok';37 } catch (err) {38 checks.database = 'failed';39 }40 41 try {42 // Check Redis connectivity43 await redis.ping();44 checks.redis = 'ok';45 } catch (err) {46 checks.redis = 'failed';47 }48 49 try {50 // Check that restaurant menu cache is warmed (Swiggy-specific)51 const cacheReady = await menuCache.isWarm();52 checks.menuCache = cacheReady ? 'ok' : 'warming';53 } catch (err) {54 checks.menuCache = 'failed';55 }56 57 const allReady = Object.values(checks).every(v => v === 'ok');58 59 res.status(allReady ? 200 : 503).json({60 checks61 });62});63 64module.exports = router;β οΈ Security: Never expose sensitive information in health endpoints β database connection strings, internal IPs, or stack traces. Health endpoints are often public-facing. Return only status strings and boolean indicators, never raw error objects.
Step 4 β Configure Probes for a Spring Boot Service
Spring Boot has built-in Actuator health endpoints that map perfectly to Kubernetes probes:
1# application.yaml β Spring Boot Actuator probe configuration2management:3 endpoint:4 health:5 probes:6 enabled: true # Enable /actuator/health/liveness and /readiness7 show-details: never # Never expose internals in health response8 endpoints:9 web:10 exposure:11 include: health, info, prometheus12 health:13 livenessstate:14 enabled: true15 readinessstate:16 enabled: true17 db:18 enabled: true # Check database connectivity for readiness19 redis:20 enabled: true # Check Redis for readiness1# deployment-springboot.yaml β probe config for Razorpay payments service (Spring Boot)2startupProbe:3 httpGet:4 path: /actuator/health/liveness5 port: 80806 failureThreshold: 60 # Spring Boot with heavy migrations needs up to 10 minutes7 periodSeconds: 108 9livenessProbe:10 httpGet:11 path: /actuator/health/liveness12 port: 808013 periodSeconds: 1514 timeoutSeconds: 515 failureThreshold: 316 17readinessProbe:18 httpGet:19 path: /actuator/health/readiness20 port: 808021 periodSeconds: 522 timeoutSeconds: 323 failureThreshold: 324 successThreshold: 1π‘ Tip: Spring Boot's/actuator/health/livenessonly checks the application state (not external dependencies), while/actuator/health/readinesschecks all configured health indicators including database and Redis. This separation maps perfectly to the Kubernetes liveness/readiness model.
Step 5 β Configure Probes for TCP Services (Databases, Kafka)
1# StatefulSet probes for PostgreSQL using exec mechanism2livenessProbe:3 exec:4 command:5 - pg_isready6 - -U7 - postgres8 - -h9 - localhost10 initialDelaySeconds: 3011 periodSeconds: 1012 timeoutSeconds: 513 failureThreshold: 6 # Allow 6 failures (60s) before restart β DB restarts are slow14 15readinessProbe:16 exec:17 command:18 - pg_isready19 - -U20 - postgres21 - -h22 - localhost23 initialDelaySeconds: 524 periodSeconds: 525 timeoutSeconds: 326 failureThreshold: 3