Containers Azure Kubernetes Service scaling premium

AKS horizontal pod autoscaler

AKS Horizontal Pod Autoscaler, or HPA, automatically changes how many copies of a pod run for a workload. If demand rises and the configured metric crosses the target, HPA increases replicas. If demand falls, it can reduce replicas. It is different from the cluster autoscaler: HPA scales pods, while the cluster autoscaler scales nodes. In AKS, HPA helps applications handle traffic changes without requiring someone to manually edit replica counts during busy or quiet periods.

Aliases
HPA on AKS, horizontal pod autoscaler, Kubernetes HPA, pod autoscaling
Difficulty
intermediate
CLI mappings
3
Last verified
2026-05-09

Microsoft Learn

The AKS Horizontal Pod Autoscaler is the Kubernetes HPA capability used in AKS to automatically change the number of pod replicas for a workload based on observed metrics such as CPU, memory, or custom metrics.

Microsoft Learn: Scaling options for applications in Azure Kubernetes Service2026-05-09

Technical context

Technically, HPA is a Kubernetes controller running in the AKS cluster. It watches a target such as a Deployment, StatefulSet, or ReplicaSet and calculates desired replicas from metrics. CPU and memory metrics usually come from Metrics Server, while custom metrics require additional adapters or monitoring integration. HPA respects configured minimum and maximum replicas, scaling behavior, readiness, resource requests, and stabilization windows. It sits in the workload scaling layer and often works together with cluster autoscaler capacity planning.

Why it matters

HPA matters because traffic rarely arrives in a perfectly flat line. Without automated pod scaling, teams either overprovision replicas for rare peaks or risk slow responses during busy periods. HPA lets the workload respond to real demand while staying inside defined limits. It is especially useful for stateless APIs, ingress controllers, queue processors, and microservices that can run multiple independent replicas. The value is not just extra pods; it is controlled elasticity. Operators define the signal, minimum, maximum, and expected behavior, then verify whether scaling actually protects user experience. Practically, AKS Horizontal Pod Autoscaler becomes safer when teams save metric target, current metric value, replica bounds, scaling events, and pending-pod state, because reviewers can compare the intended design with the running state.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

kubectl get hpa output, deployment manifests, Metrics API data, Container insights HPA metrics, and workload replica counts

Signal 02

Azure portal, CLI output, IaC templates, monitoring dashboards, and incident runbooks

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

  • Scale stateless web APIs when CPU or request demand rises.
  • Increase ingress-controller replicas during regional or tenant traffic spikes.
  • Scale queue-processing workloads from custom metrics such as queue depth.
  • Reduce idle replica count outside business hours while preserving a safe minimum.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

AKS horizontal pod autoscaler in action

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

FlashCart, a flash-sale retailer, used AKS for checkout APIs that spiked sharply during limited product drops. The team implemented AKS Horizontal Pod Autoscaler so checkout pods could expand quickly without keeping peak replicas online all day.

Business/Technical Objectives
  • Keep checkout p95 latency below 350 milliseconds during sale launches
  • Reduce idle replica spend outside promotions by at least 30 percent
  • Avoid manual replica changes during high-pressure release windows
  • Document evidence that autoscaling protected customer checkout flow
Solution Using AKS horizontal pod autoscaler

Engineers configured HPA on the checkout Deployment with CPU utilization targets, a safe minimum replica count, and a maximum sized against node-pool and database limits. Metrics Server health was verified before launch, and cluster autoscaler limits were checked so new pods could be scheduled when demand exceeded current capacity. The runbook included kubectl commands to watch HPA status, replica changes, events, and pending pods. Azure Monitor dashboards tracked latency, error rate, node utilization, and database saturation so the team could prove scaling helped instead of masking a dependency bottleneck.

Results & Business Impact
  • Checkout p95 latency stayed at 312 milliseconds during the largest product drop
  • Idle checkout replica hours fell by 38 percent between sale windows
  • No engineer manually changed replica counts during the three-hour launch period
  • Post-event evidence showed HPA scaled from 8 to 46 replicas with no pending-pod backlog
Key Takeaway for Glossary Readers

AKS Horizontal Pod Autoscaler is valuable because it turns traffic variation into controlled pod elasticity instead of manual emergency scaling.

Case study 02

AKS horizontal pod autoscaler in action

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

MediLink Connect, a telehealth provider, ran appointment scheduling services on AKS and saw morning traffic surges when clinics opened. Fixed replica counts either wasted capacity overnight or produced slow responses during the first hour of business.

Business/Technical Objectives
  • Maintain scheduling API availability during daily morning peaks
  • Reduce overnight pod capacity without harming readiness at opening time
  • Give support engineers clear autoscaling evidence during incidents
  • Keep database connection growth within approved safety limits
Solution Using AKS horizontal pod autoscaler

The platform team configured HPA for the scheduling API with CPU and memory-aware targets, a conservative minimum for overnight readiness, and a maximum aligned to database connection pool limits. Readiness probes were tuned so newly started pods received traffic only after dependencies were ready. Engineers paired HPA with cluster autoscaler and reviewed node-pool maximums to prevent unschedulable pods. During go-live, operators watched HPA events, desired replicas, Metrics Server output, Application Insights latency, and database connection counts. A rollback procedure restored the prior replica setting if scaling increased errors.

Results & Business Impact
  • Morning API availability stayed above 99.98 percent for six consecutive weeks
  • Overnight pod replica count dropped by 42 percent without missed readiness checks
  • Support engineers reduced autoscaling triage time from 40 minutes to 11 minutes
  • Database connections remained 18 percent below the approved safety threshold during peaks
Key Takeaway for Glossary Readers

HPA works best when pod scaling is designed together with readiness, node capacity, and downstream dependency limits.

Case study 03

AKS horizontal pod autoscaler in action

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

CivicMap Services, a public-sector mapping provider, hosted geospatial tile APIs on AKS for emergency response agencies. Demand jumped during storms, and the operations team needed predictable scaling without granting broad production edit rights.

Business/Technical Objectives
  • Absorb storm-related map traffic without service degradation
  • Limit production scaling permissions to approved platform automation
  • Reduce monthly compute waste from fixed high replica counts
  • Capture autoscaling evidence for after-action reviews
Solution Using AKS horizontal pod autoscaler

The team deployed HPA on the tile API and routing service Deployments with CPU targets and maximum replicas based on load-test results. Application teams could update manifests through GitOps, but production replica changes were controlled by HPA. Platform engineers verified resource requests, probes, node-pool headroom, and cluster autoscaler settings before storm season. Dashboards combined HPA state, Kubernetes events, pod restart counts, ingress latency, and node utilization. After each incident window, the team exported HPA and Azure Monitor data to compare traffic, replica growth, and user-facing response time.

Results & Business Impact
  • The platform handled a 5.4x traffic surge with p95 response time under 480 milliseconds
  • Fixed high replica capacity was reduced, cutting monthly compute waste by 29 percent
  • No direct production replica edits were needed during three storm events
  • After-action reports included replica and latency evidence within one business day
Key Takeaway for Glossary Readers

AKS Horizontal Pod Autoscaler gives public-sector workloads elastic response while preserving governance over who can change production.

Why use Azure CLI for this?

Azure CLI and kubectl are useful for HPA because operators need to connect Azure-side AKS capacity with Kubernetes-side scaling state and prove whether pods, nodes, and metrics agree.

CLI use cases

  • Show AKS cluster and node-pool capacity before raising HPA maximum replicas.
  • Inspect HPA current and desired replicas during a traffic spike.
  • Confirm Metrics Server or custom metrics are available before relying on autoscaling.
  • Compare pending pods with cluster autoscaler behavior when HPA wants more replicas.
  • Export HPA, workload, and event output as evidence after a performance incident.

Before you run CLI

  • Confirm kubeconfig points to the correct AKS cluster and namespace.
  • Verify you have permission to view workloads, events, and HPA resources.
  • Know the target Deployment or workload name before changing autoscaler settings.
  • Check node-pool limits and quotas before increasing maximum replicas.
  • Understand whether scaling could overload databases, queues, or external dependencies.

What output tells you

  • Current and desired replica counts show whether HPA is actively scaling.
  • Metric values show whether the observed signal is above or below the target.
  • Minimum and maximum bounds explain why scaling stopped at a specific number.
  • Events reveal missing metrics, failed calculations, or repeated scale attempts.
  • Pending pods indicate the cluster may need more node capacity, not just more replicas.

Mapped Azure CLI commands

Inspect and operate AKS horizontal pod autoscaler

diagnostic
kubectl get hpa -A
kubectl describe hpa <hpa-name> -n <namespace>
kubectl autoscale deployment <deployment> --cpu-percent=60 --min=2 --max=10

Architecture context

Technically, HPA is a Kubernetes controller running in the AKS cluster. It watches a target such as a Deployment, StatefulSet, or ReplicaSet and calculates desired replicas from metrics. CPU and memory metrics usually come from Metrics Server, while custom metrics require additional adapters or monitoring integration. HPA respects configured minimum and maximum replicas, scaling behavior, readiness, resource requests, and stabilization windows. It sits in the workload scaling layer and often works together with cluster autoscaler capacity planning.

Security

Security for HPA is indirect but important. Scaling replicas can increase the number of pods holding identities, mounting secrets, reaching databases, or calling downstream services. If a workload is compromised, more replicas can mean a larger blast radius or more outbound traffic. Teams should pair HPA with least-privilege managed identities, Kubernetes RBAC, network policies, secret controls, and resource limits. The metrics source also matters: only trusted metrics should influence scaling decisions. A manipulated metric or noisy dependency could trigger unnecessary scaling and hide a real operational or security issue. The evidence to retain is metric target, current metric value, replica bounds, scaling events, and pending-pod state, because those details show who can change the boundary and whether exposure matches policy.

Cost

HPA controls pod count, so it has a direct relationship to AKS cost even when nodes are charged rather than pods. More replicas consume CPU and memory, which can force larger node pools or trigger the cluster autoscaler to add nodes. Too few replicas can lower cost but increase latency or error rates. Too many replicas can waste capacity and create extra monitoring, logging, and downstream service costs. FinOps reviews should compare HPA limits, observed utilization, node scaling history, and business traffic patterns so elasticity saves money instead of hiding inefficient sizing. A FinOps review should connect metric target, current metric value, replica bounds, scaling events, and pending-pod state to owner, environment, expected utilization, and review date so spend stays explainable.

Reliability

HPA improves reliability when it prevents demand spikes from overwhelming a fixed replica count. It can also create reliability problems if the scaling signal is wrong, the maximum is too low, or the cluster lacks node capacity. Reliable HPA design starts with realistic resource requests, readiness probes, max replica limits, and testing under load. HPA should be paired with cluster autoscaler when pod growth can exceed current node capacity. Operators should watch pending pods, scaling events, throttling, and downstream dependency limits so extra replicas do not simply move the bottleneck elsewhere. During incidents, metric target, current metric value, replica bounds, scaling events, and pending-pod state helps responders decide whether the issue is workload behavior, platform capacity, or a misconfigured release.

Performance

HPA affects performance by distributing work across more pod replicas when demand rises. For well-designed stateless services, this can reduce queueing, protect latency, and improve throughput. For services with slow startup, missing readiness probes, poor connection pooling, or constrained downstream dependencies, adding replicas may not help quickly. Performance tuning should include target metric selection, stabilization windows, pod startup time, load test evidence, and maximum replica limits. HPA should be measured from the user experience outward: lower latency and fewer errors matter more than simply seeing replica count increase. Teams should compare performance before and after changing AKS Horizontal Pod Autoscaler, using metric target, current metric value, replica bounds, scaling events, and pending-pod state to separate real bottlenecks from configuration assumptions.

Operations

Operationally, HPA must be inspected as part of every AKS workload review. Operators check the target workload, current replicas, desired replicas, metric value, threshold, min and max bounds, and recent scaling events. They should confirm Metrics Server or custom metrics are healthy and that resource requests exist for CPU-based scaling. Release runbooks should explain when to change thresholds, when to raise maximum replicas, and when not to scale because a downstream service is saturated. Good operations treat HPA as a tuned control loop, not a checkbox. The runbook should capture metric target, current metric value, replica bounds, scaling events, and pending-pod state, assign an owner, and define when to roll back, escalate, or accept a documented exception.

Common mistakes

  • Using HPA without CPU requests, which breaks CPU utilization calculations.
  • Setting maximum replicas too low for real peak demand.
  • Expecting HPA to fix downstream database or queue bottlenecks.
  • Forgetting that HPA scales pods while cluster autoscaler scales nodes.
  • Changing thresholds in production without load-test evidence or rollback criteria.