What is HPA? | DevOps Dictionary

HPA - Automatic Scaling Based on Real Traffic

What is HPA in Simple Terms?

HPA is your automatic traffic manager. During Swiggy's dinner rush (7pm–9pm), order volume spikes 10x. HPA detects that CPU is hitting 80%, automatically adds more pods to handle the load, then scales back down at midnight when traffic drops — saving infrastructure costs without any human intervention.

How HPA Decides to Scale

◈ DIAGRAM

+--------------------------------------------------+
| HPA controller polls Metrics Server every 15s   | <- continuous monitoring
+--------------------------------------------------+
                        |
                        v
+--------------------------------------------------+
| Current avg CPU across pods: 85%                |
| Target avg CPU configured: 70%                  |
| Pods needed = ceil(current / target * replicas) |
+--------------------------------------------------+
                        |
            +-----------+-----------+
            |                       |
            v                       v
+---------------------+   +---------------------+
| CPU > target        |   | CPU < target         |
| -> Scale UP         |   | -> Scale DOWN        |
| Add pods (up to max)|   | Remove pods (to min) |
+---------------------+   +---------------------+

The scale formula Kubernetes uses: desiredReplicas = ceil(currentReplicas * (currentMetric / targetMetric))

Example HPA Manifest

YAML

1# hpa.yaml — autoscale the order-service based on CPU and memory
2apiVersion: autoscaling/v2
3kind: HorizontalPodAutoscaler
4metadata:
5  name: order-service-hpa
6  namespace: production
7spec:
8  scaleTargetRef:
9    apiVersion: apps/v1
10    kind: Deployment
11    name: order-service
12  minReplicas: 3         # Always keep at least 3 pods — never scale to zero
13  maxReplicas: 20        # Hard ceiling — prevents runaway scaling
14  metrics:
15    - type: Resource
16      resource:
17        name: cpu
18        target:
19          type: Utilization
20          averageUtilization: 70    # Scale when avg CPU across all pods hits 70%
21    - type: Resource
22      resource:
23        name: memory
24        target:
25          type: Utilization
26          averageUtilization: 80    # Also scale if memory pressure builds up

Deployment Must Have Resource Requests Defined

HPA cannot calculate utilization percentage without knowing what the pod requested. This is a mandatory prerequisite:

YAML

1# deployment.yaml — resources.requests MUST be set for HPA to work
2spec:
3  containers:
4    - name: order-service
5      image: registry.swiggy.in/order-service:v3.1.0
6      resources:
7        requests:
8          cpu: "250m"        # HPA uses this as the baseline for % calculation
9          memory: "256Mi"
10        limits:
11          cpu: "1000m"
12          memory: "512Mi"

Checking HPA Status

Bash

1# See current replica count, targets, and scaling activity
2kubectl get hpa -n production
3 
4# Output:
5# NAME                REFERENCE               TARGETS         MINPODS   MAXPODS   REPLICAS
6# order-service-hpa   Deployment/order-svc    68%/70%         3         20        8
7 
8# Detailed status including last scale event
9kubectl describe hpa order-service-hpa -n production
10 
11# Watch HPA react to live traffic in real time
12kubectl get hpa order-service-hpa -n production -w

Scaling Behavior — Controlling Scale Up and Scale Down Speed

By default, HPA can be aggressive on scale-up and slow on scale-down. You can tune this:

YAML

1# Add behavior block to control scaling velocity
2spec:
3  behavior:
4    scaleUp:
5      stabilizationWindowSeconds: 0      # Scale up immediately when needed
6      policies:
7        - type: Pods
8          value: 4                        # Add at most 4 pods per scaling event
9          periodSeconds: 60
10    scaleDown:
11      stabilizationWindowSeconds: 300    # Wait 5 minutes before scaling down
12      policies:
13        - type: Percent
14          value: 10                       # Remove at most 10% of pods per minute
15          periodSeconds: 60

The stabilizationWindowSeconds on scale-down prevents flapping — HPA won't remove pods immediately after a traffic spike drops, giving headroom for the next wave.

Troubleshooting HPA

Symptom	Likely Cause	Fix
HPA shows `<unknown>/70%`	Metrics Server not installed	Install `metrics-server` in kube-system
HPA not scaling up	`resources.requests` not set on deployment	Add CPU/memory requests to pod spec
HPA stuck at `minReplicas`	Current CPU below threshold	Check actual usage with `kubectl top pods`
HPA scaling too aggressively	No stabilization window	Add `behavior.scaleDown.stabilizationWindowSeconds`
Replicas hit `maxReplicas` and stop	Max ceiling reached	Raise `maxReplicas` or investigate pod performance

Bash

1# Verify Metrics Server is working (prerequisite for HPA)
2kubectl top pods -n production
3 
4# Check why HPA is not scaling — events section is key
5kubectl describe hpa order-service-hpa -n production | grep -A 20 Events
6 
7# Manually simulate load to test HPA behavior
8kubectl run load-gen --image=busybox -it --rm -- \
9  /bin/sh -c "while true; do wget -q -O- http://order-service.production; done"

📌 Remember: HPA requires the Metrics Server to be installed in the cluster. Without it, HPA cannot read CPU or memory metrics and will report <unknown> targets — staying completely inactive. Run kubectl top pods to verify Metrics Server is working before creating any HPA.

🔴 Common Mistake: Setting minReplicas: 1 in production. If that single pod is being replaced during a scale-up event, your service has zero availability for the seconds it takes to start a new pod. Always set minReplicas: 3 or higher for any production workload at Razorpay or PhonePe scale.

💡 Tip: Set scaleDown.stabilizationWindowSeconds: 300 in the HPA behavior block. Without this, HPA will remove pods immediately after a traffic spike drops — only to add them back 2 minutes later when the next spike arrives. The stabilization window prevents this flapping and keeps your pod count stable during volatile traffic patterns like Hotstar's live streaming events.

⚠️ Security: HPA with maxReplicas set too high can become a cost explosion vector. A traffic spike — or a DDoS — can trigger HPA to spin up hundreds of pods, consuming all cluster node capacity and triggering expensive cloud autoscaling. Always set a sensible maxReplicas ceiling and configure cluster-level resource quotas per namespace to cap total pod resource consumption.

Syncing Data

HPA

Technical Explanation & Usage

HPA - Automatic Scaling Based on Real Traffic

What is HPA in Simple Terms?

How HPA Decides to Scale

Example HPA Manifest

Deployment Must Have Resource Requests Defined

Checking HPA Status

Scaling Behavior — Controlling Scale Up and Scale Down Speed

Troubleshooting HPA

Related Terms

Namespace

Pod

Deployment