What is the career path for learning Implementing Liveness and Readiness Probes for Zero-Downtime Deploys?

Mastering Implementing Liveness and Readiness Probes for Zero-Downtime Deploys enables engineering opportunities in DevOps, SRE, and cloud platform automation.

Implementing Liveness and Readiness Probes for Zero-Downtime Deploys | DevOps Network

Q: How long does it take to learn Implementing Liveness and Readiness Probes for Zero-Downtime Deploys?

Most students gain core proficiency in Implementing Liveness and Readiness Probes for Zero-Downtime Deploys in 2–3 weeks of active hands-on labs.

Syncing Data

Elite DevOps Network

Contact & Suggestions 💬

Implementing Liveness and Readiness Probes for Zero-Downtime Deploys | DevOps Network | DevOps Network

Implementing Liveness and Readiness Probes for Zero-Downtime Deploys

Overview and What You Will Learn

Without probes, Kubernetes has no way to know whether your application is actually healthy — it only knows if the container process is running. A Node.js API that is running but stuck in an infinite loop, a Spring Boot service that started but cannot connect to its database, or a pod that is ready to receive traffic before its cache is warmed — all of these appear healthy to Kubernetes without probes. This lab walks you through configuring all three probe types to achieve genuinely zero-downtime rolling deployments.

By the end of this guide you will be able to:

Configure liveness probes to detect and automatically restart deadlocked or hung containers
Configure readiness probes to gate traffic until the application is genuinely ready to serve requests
Configure startup probes to protect slow-starting applications from premature liveness kills
Design probe endpoints in your application that accurately reflect health state
Tune probe timing parameters to balance responsiveness against false-positive restarts

Why This Matters in Production

During a Swiggy deployment at peak dinner hour, a new API version rolled out across 20 pods. The new version had a subtle bug — it started successfully, passed the liveness probe, but the readiness probe was not configured. Kubernetes sent live traffic to pods that had not finished loading their restaurant menu cache — resulting in 8 seconds of 500 errors for every user whose request hit an unready pod before the cache warmed up.

A correctly configured readiness probe would have held all traffic on the old pods until every new pod's cache was fully loaded — zero errors, zero user impact. Probes are the single most impactful configuration for achieving true zero-downtime deployments.

Core Principles

The three probe types and their distinct purposes: STARTUP PROBE LIVENESS PROBE READINESS PROBE ───────────── ────────────── ─────────────── "Has the app finished "Is the app still "Is the app ready starting up?" alive and not hung?" to receive traffic?"

Runs ONCE at startup Runs continuously Runs continuously until it succeeds throughout pod life throughout pod life

On failure → On failure → On failure → kubelet waits and kubelet RESTARTS pod REMOVED from retries (backoff) the container Service endpoints (no restart)

Disables liveness Triggers after Independent of probe while running startup probe liveness probe succeeds

The probe execution order for a new pod: Pod starts │ ▼ startupProbe runs (liveness disabled during this phase) │ ├─► Fails → kubelet waits periodSeconds → retries │ (up to failureThreshold × periodSeconds total) │ └─► Succeeds → startup complete │ ▼ livenessProbe + readinessProbe both begin running in parallel │ │ ▼ ▼ Failure → restart Failure → removed from container Service endpoints

Detailed Step-by-Step Practical Lab

Step 1 — Understand Probe Mechanisms

Kubernetes supports three probe mechanisms — choose based on what your application exposes:

YAML

1# Mechanism 1 — HTTP GET (most common for web services)
2# Kubernetes makes an HTTP GET to the specified path and port
3# Success: HTTP status 200-399
4# Failure: HTTP status 400+ or connection refused or timeout
5livenessProbe:
6  httpGet:
7    path: /healthz          # Your health check endpoint
8    port: 8080
9    httpHeaders:
10      - name: Custom-Header
11        value: kubernetes-probe
12 
13# Mechanism 2 — TCP Socket (for non-HTTP services like databases, message queues)
14# Kubernetes attempts to open a TCP connection to the port
15# Success: connection established
16# Failure: connection refused or timeout
17livenessProbe:
18  tcpSocket:
19    port: 5432              # PostgreSQL port — if TCP accepts, pod is alive
20 
21# Mechanism 3 — Exec Command (run a command inside the container)
22# Success: command exits with code 0
23# Failure: command exits with non-zero code or times out
24livenessProbe:
25  exec:
26    command:
27      - pg_isready           # PostgreSQL readiness check binary
28      - -U
29      - postgres
30      - -d
31      - zerodha_trading

Step 2 — Configure a Complete Probe Suite for a Node.js API

YAML

1# deployment-api-probes.yaml — full probe configuration for Swiggy order API
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5  name: order-api
6  namespace: production
7spec:
8  replicas: 5
9  strategy:
10    type: RollingUpdate
11    rollingUpdate:
12      maxSurge: 2           # Allow 2 extra pods during update
13      maxUnavailable: 0     # Never take a pod down before a new one is ready
14                            # This is what makes rolling updates truly zero-downtime
15  selector:
16    matchLabels:
17      app: order-api
18  template:
19    metadata:
20      labels:
21        app: order-api
22    spec:
23      containers:
24        - name: order-api
25          image: registry.swiggy.in/order-api:v6.2.1
26          ports:
27            - containerPort: 8080
28          resources:
29            requests:
30              cpu: "250m"
31              memory: "256Mi"
32            limits:
33              cpu: "500m"
34              memory: "512Mi"
35 
36          # STARTUP PROBE — protects slow-starting apps from liveness kills
37          # Total startup budget: 30 attempts × 10s = 300s (5 minutes maximum)
38          startupProbe:
39            httpGet:
40              path: /healthz
41              port: 8080
42            failureThreshold: 30      # Allow up to 5 minutes for startup
43            periodSeconds: 10         # Check every 10 seconds
44            successThreshold: 1       # One success is enough
45 
46          # LIVENESS PROBE — detects hung or deadlocked application
47          # Only runs AFTER startupProbe succeeds
48          livenessProbe:
49            httpGet:
50              path: /healthz
51              port: 8080
52            initialDelaySeconds: 0    # startupProbe handles the delay — set to 0
53            periodSeconds: 15         # Check every 15 seconds
54            timeoutSeconds: 5         # Fail if no response within 5 seconds
55            failureThreshold: 3       # Restart after 3 consecutive failures (45s)
56            successThreshold: 1
57 
58          # READINESS PROBE — gates traffic until app is truly ready
59          readinessProbe:
60            httpGet:
61              path: /ready
62              port: 8080              # Different endpoint from liveness
63            initialDelaySeconds: 5
64            periodSeconds: 5          # Check every 5 seconds — faster than liveness
65            timeoutSeconds: 3
66            failureThreshold: 3       # Remove from endpoints after 3 failures (15s)
67            successThreshold: 2       # Require 2 consecutive successes before adding back

📌 Remember: Use maxUnavailable: 0 combined with maxSurge: 1 or higher for zero-downtime deployments. This forces Kubernetes to bring up new pods and wait for them to pass readiness before terminating old ones. Without this, Kubernetes may terminate old pods while new ones are still starting.

Step 3 — Implement Correct Health Endpoints in Your Application

The probe endpoint must accurately reflect the application's actual health — not just return 200 unconditionally:

JAVASCRIPT

1// health.js — Node.js health endpoints for Swiggy order API
2const express = require('express');
3const router = express.Router();
4 
5// LIVENESS endpoint — am I alive and not deadlocked?
6// Should only check things that would require a restart to fix
7router.get('/healthz', async (req, res) => {
8  try {
9    // Check if the event loop is responsive (not deadlocked)
10    // Check if critical in-process state is valid
11    const memUsage = process.memoryUsage();
12    const heapUsedPercent = memUsage.heapUsed / memUsage.heapTotal;
13 
14    if (heapUsedPercent > 0.95) {
15      // Heap is 95%+ full — restart before OOMKill hits
16      return res.status(503).json({
17        reason: 'heap_pressure',
18        heapUsedPercent
19      });
20    }
21 
22    res.status(200).json({ status: 'alive', uptime: process.uptime() });
23  } catch (err) {
24    res.status(503).json({ status: 'unhealthy', error: err.message });
25  }
26});
27 
28// READINESS endpoint — am I ready to serve traffic?
29// Check all dependencies the application needs to function
30router.get('/ready', async (req, res) => {
31  const checks = {};
32 
33  try {
34    // Check database connectivity
35    await db.query('SELECT 1');
36    checks.database = 'ok';
37  } catch (err) {
38    checks.database = 'failed';
39  }
40 
41  try {
42    // Check Redis connectivity
43    await redis.ping();
44    checks.redis = 'ok';
45  } catch (err) {
46    checks.redis = 'failed';
47  }
48 
49  try {
50    // Check that restaurant menu cache is warmed (Swiggy-specific)
51    const cacheReady = await menuCache.isWarm();
52    checks.menuCache = cacheReady ? 'ok' : 'warming';
53  } catch (err) {
54    checks.menuCache = 'failed';
55  }
56 
57  const allReady = Object.values(checks).every(v => v === 'ok');
58 
59  res.status(allReady ? 200 : 503).json({
60    checks
61  });
62});
63 
64module.exports = router;

⚠️ Security: Never expose sensitive information in health endpoints — database connection strings, internal IPs, or stack traces. Health endpoints are often public-facing. Return only status strings and boolean indicators, never raw error objects.

Step 4 — Configure Probes for a Spring Boot Service

Spring Boot has built-in Actuator health endpoints that map perfectly to Kubernetes probes:

YAML

1# application.yaml — Spring Boot Actuator probe configuration
2management:
3  endpoint:
4    health:
5      probes:
6        enabled: true           # Enable /actuator/health/liveness and /readiness
7      show-details: never       # Never expose internals in health response
8  endpoints:
9    web:
10      exposure:
11        include: health, info, prometheus
12  health:
13    livenessstate:
14      enabled: true
15    readinessstate:
16      enabled: true
17    db:
18      enabled: true             # Check database connectivity for readiness
19    redis:
20      enabled: true             # Check Redis for readiness

YAML

1# deployment-springboot.yaml — probe config for Razorpay payments service (Spring Boot)
2startupProbe:
3  httpGet:
4    path: /actuator/health/liveness
5    port: 8080
6  failureThreshold: 60          # Spring Boot with heavy migrations needs up to 10 minutes
7  periodSeconds: 10
8 
9livenessProbe:
10  httpGet:
11    path: /actuator/health/liveness
12    port: 8080
13  periodSeconds: 15
14  timeoutSeconds: 5
15  failureThreshold: 3
16 
17readinessProbe:
18  httpGet:
19    path: /actuator/health/readiness
20    port: 8080
21  periodSeconds: 5
22  timeoutSeconds: 3
23  failureThreshold: 3
24  successThreshold: 1

💡 Tip: Spring Boot's /actuator/health/liveness only checks the application state (not external dependencies), while /actuator/health/readiness checks all configured health indicators including database and Redis. This separation maps perfectly to the Kubernetes liveness/readiness model.

Step 5 — Configure Probes for TCP Services (Databases, Kafka)

YAML

1# StatefulSet probes for PostgreSQL using exec mechanism
2livenessProbe:
3  exec:
4    command:
5      - pg_isready
6      - -U
7      - postgres
8      - -h
9      - localhost
10  initialDelaySeconds: 30
11  periodSeconds: 10
12  timeoutSeconds: 5
13  failureThreshold: 6           # Allow 6 failures (60s) before restart — DB restarts are slow
14 
15readinessProbe:
16  exec:
17    command:
18      - pg_isready
19      - -U
20      - postgres
21      - -h
22      - localhost
23  initialDelaySeconds: 5
24  periodSeconds: 5
25  timeoutSeconds: 3
26  failureThreshold: 3