Scaling Deployments with Horizontal Pod Autoscaler (HPA)
Overview and What You Will Learn
Manual scaling is reactive and slow β by the time a human notices CPU spiking at 90%, users are already experiencing timeouts. Kubernetes HPA eliminates this by automatically scaling pod replicas up or down based on real-time metrics, keeping your application responsive under any traffic pattern without over-provisioning infrastructure during quiet periods.
By the end of this guide you will be able to:
- Deploy and configure HPA using both
kubectl autoscaleand manifest-based approaches - Scale on CPU, memory, and custom application metrics using the autoscaling/v2 API
- Install and verify the Metrics Server required for HPA to function
- Tune scale-up and scale-down stabilisation windows to prevent thrashing
- Combine HPA with PodDisruptionBudgets to maintain availability during scaling events
Why This Matters in Production
Swiggy's order volume peaks between 7pm and 9pm every evening β sometimes 8-10x their 3am baseline. Provisioning enough pods to handle peak traffic 24 hours a day wastes enormous cost. HPA solves this by scaling from 5 pods at 3am to 40 pods at 8pm automatically, then scaling back down overnight. The same pattern applies to Hotstar during live cricket matches, Zerodha during market open and close, and any platform with predictable or unpredictable traffic spikes.
Without HPA, engineers either over-provision (expensive) or under-provision (outages). HPA is the correct production answer to this tradeoff.
Core Principles
How the HPA control loop works: