A rolling upgrade changes an AKS node pool gradually instead of replacing every node at once. AKS adds surge capacity when configured, cordons old nodes, drains workloads, waits for optional soak time, reimages or upgrades nodes, and repeats until the pool is finished. The goal is to keep applications running while the cluster moves to a new Kubernetes or node image version. It works best when workloads have enough replicas, useful health probes, and Pod Disruption Budgets that allow safe movement.
AKS rolling upgrade, node pool rolling upgrade, rolling node upgrade, AKS node pool upgrade, Kubernetes rolling upgrade
Difficulty
intermediate
CLI mappings
6
Last verified
2026-05-22
Microsoft Learn
Microsoft Learn defines a rolling upgrade strategy for AKS node pools as upgrading one node, or a small group of nodes, at a time so workloads remain available. AKS uses surge nodes, cordon and drain, optional soak time, reimage steps, and persistent node-pool upgrade settings.
In Azure architecture, rolling upgrade is part of AKS lifecycle operations on node pools, not application code deployment by itself. It touches the managed cluster control plane, agent pools, VM scale sets or virtual-machine pools, subnet IP capacity, quotas, Pod Disruption Budgets, autoscaler behavior, and workload scheduling. Operators configure max surge, drain timeout, node soak duration, and sometimes upgrade strategy through Azure CLI, ARM, or portal. The outcome appears in node-pool provisioning state, Kubernetes node versions, events, Activity Log, and workload availability metrics.
Why it matters
Rolling upgrades matter because Kubernetes infrastructure changes are unavoidable, but outages during upgrades are not acceptable. Security patches, supported Kubernetes versions, OS image updates, and platform fixes all require node movement. A thoughtful rolling upgrade gives teams a controlled way to absorb that change while preserving capacity and rollback options. A careless upgrade exposes hidden weakness: single-replica workloads, strict disruption budgets, exhausted subnet IPs, insufficient quota, slow drains, and apps that cannot restart cleanly. The term helps operators see upgrades as reliability exercises, not button clicks. It also creates a safe place to expose fragile workloads before customers do. That preparation matters before release windows.
⌁
Where you see it
Signals, screens, and Azure surfaces where this term usually becomes operational.
Signal 01
In Azure CLI, rolling-upgrade settings appear in az aks nodepool show output through maxSurge, drain timeout, node soak duration, version, and provisioningState. during maintenance planning
Signal 02
In Kubernetes events and node listings, operators notice cordoned nodes, draining pods, new node versions, pending workloads, and readiness changes during the upgrade. with scheduling pressure visible
Signal 03
In Activity Log and monitoring workbooks, the upgrade appears as agent-pool write operations, duration, failures, latency changes, pod restarts, and availability signals. during the maintenance window
✦
When this becomes relevant
Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.
Apply AKS node image security fixes while keeping customer-facing workloads available during a maintenance window.
Upgrade a user node pool separately after the control plane reaches the target supported Kubernetes version.
Tune max surge and soak duration for a latency-sensitive service that needs capacity during node replacement.
Find workloads with bad disruption budgets or single replicas before an upgrade turns them into an outage.
Collect command and Activity Log evidence proving a regulated cluster upgrade followed the approved plan.
◆
Real-world case studies
Different enterprise-style examples that show the term being used to hit measurable objectives.
Case study 01
Online education platform upgrades AKS before exam week
Scenario, objectives, solution, measured impact, and takeaway.
📌Scenario
An online education platform needed to patch an AKS node pool before national exam week. The previous upgrade caused thirty minutes of login failures because every authentication pod restarted together.
🎯Business/Technical Objectives
Upgrade the user node pool before the exam traffic freeze.
Keep login availability above 99.95 percent during maintenance.
Validate Pod Disruption Budgets before draining nodes.
Record command evidence for the change board.
✅Solution Using Rolling upgrade
The platform team planned a rolling upgrade rather than replacing the pool all at once. They reviewed available AKS versions, upgraded the control plane first, and used Azure CLI to set max surge to 33 percent with a short node soak duration. Authentication deployments were scaled to three replicas across zones, and Pod Disruption Budgets were adjusted to allow one pod disruption while preserving service. During the window, engineers monitored node-pool provisioning state, pod restarts, pending pods, login latency, and Activity Log entries. The final report included CLI output before and after the upgrade.
📈Results & Business Impact
The node pool upgraded in 52 minutes with no full login outage.
Peak login latency rose only 7 percent during pod movement.
Pending pods stayed below five for the entire maintenance window.
Change-board evidence preparation fell from six hours to forty minutes.
💡Key Takeaway for Glossary Readers
A rolling upgrade succeeds when infrastructure settings and workload disruption rules are planned together.
Case study 02
Food delivery marketplace patches GPU routing service
Scenario, objectives, solution, measured impact, and takeaway.
📌Scenario
A food delivery marketplace ran a GPU-backed routing optimizer on AKS. A required node image update risked interrupting driver-sensitive workloads during dinner peak.
🎯Business/Technical Objectives
Apply the node image update outside peak traffic.
Avoid exhausting scarce regional GPU quota with excessive surge.
Detect scheduling failures before orders missed delivery promises.
Keep a rollback decision point after the first batch.
✅Solution Using Rolling upgrade
The SRE team split the upgrade plan by node pool and focused first on the smaller GPU pool. They used Azure CLI to inspect node versions, verify quota, and set a conservative max surge that the region could satisfy. A soak period after the first upgraded node let the team test CUDA initialization, routing latency, and pod readiness before continuing. Cluster autoscaler settings were frozen for the window to avoid confusing capacity signals. Activity Log, node-pool state, and application metrics were reviewed at each checkpoint before the next nodes drained.
📈Results & Business Impact
The GPU pool completed the rolling upgrade with zero missed routing jobs.
Dinner-hour routing latency stayed within the 120 millisecond objective.
Quota use peaked at one extra GPU node instead of the originally proposed four.
The first-batch soak caught a driver warning before it affected all pods.
💡Key Takeaway for Glossary Readers
Rolling upgrades let teams pace risky node changes around scarce capacity and workload-specific health checks.
Scenario, objectives, solution, measured impact, and takeaway.
📌Scenario
An industrial IoT vendor operated AKS clusters for factory telemetry in eight regions. Customers asked for proof that platform patches happened without violating data-collection availability.
🎯Business/Technical Objectives
Create one repeatable node-pool upgrade runbook for all regions.
Keep telemetry ingestion loss under one percent during maintenance.
Identify clusters with unsafe disruption settings before upgrades.
✅Solution Using Rolling upgrade
The platform engineering group created a CLI-driven rolling upgrade workflow. Each run started by checking available upgrades, node-pool state, max surge, subnet IP capacity, and PDB health. The workflow exported node-pool versions and Activity Log details into the customer evidence package. Workloads with single replicas or blocking disruption budgets were remediated before their clusters entered the maintenance queue. Regional upgrades were staggered so support engineers could observe ingestion metrics and pause the next region if pending pods or error rates crossed thresholds. A dry-run checklist was completed for every region one week before its scheduled window.
📈Results & Business Impact
All eight regional node pools reached the target version in three maintenance nights.
Telemetry ingestion loss stayed below 0.4 percent in every region.
Prechecks found eleven unsafe disruption budgets before they caused drain failures.
Customer evidence packs were generated automatically within an hour of each window.
💡Key Takeaway for Glossary Readers
Rolling upgrades become safer at scale when prechecks, telemetry, and evidence capture are standardized.
Why use Azure CLI for this?
I use Azure CLI for rolling upgrades because production AKS upgrades demand exact, repeatable settings. The portal can start an upgrade, but CLI lets me check available versions, inspect every node pool, set max surge or soak settings, run the upgrade with no-wait, and capture Activity Log evidence. After ten years of Azure operations, I want command output in the change record before and after the maintenance window. CLI also exposes quota, version, and provisioning-state clues quickly, which helps me pause when a system pool, subnet, or disruption budget could turn a routine upgrade into an outage. The same commands can be reused across regions without reinterpreting portal screens.
CLI use cases
Check available Kubernetes and node image upgrades before scheduling the node-pool maintenance window.
Set max surge, drain timeout, and node soak duration consistently across comparable production node pools.
Start, monitor, and document a node-pool upgrade while comparing provisioning state, version, and Activity Log output.
Before you run CLI
Confirm tenant, subscription, resource group, cluster, node pool, region, Kubernetes version, control-plane version, and output format.
Check Microsoft.ContainerService permissions, provider registration, quota, subnet IP headroom, autoscaler settings, and maintenance window approval.
Review destructive risk to workloads, surge cost, Pod Disruption Budgets, identity behavior, zone placement, and rollback or support plan.
What output tells you
Upgrade-profile output shows which Kubernetes versions are available and whether the node pool can move to the target version.
Activity and metric output tells when the operation started, whether it failed, and whether user latency or pod availability changed.
Mapped Azure CLI commands
AKS rolling upgrade CLI commands
direct-management
az aks get-upgrades --name <cluster-name> --resource-group <resource-group>
az aksdiscoverDevOps
az aks nodepool list --cluster-name <cluster-name> --resource-group <resource-group> --output table
az aks nodepooldiscoverContainers
az aks nodepool show --cluster-name <cluster-name> --resource-group <resource-group> --name <node-pool>
az aks nodepooldiscoverDevOps
az aks nodepool update --cluster-name <cluster-name> --resource-group <resource-group> --name <node-pool> --max-surge 33% --drain-timeout 30 --node-soak-duration 5
az aks nodepoolconfigureDevOps
az aks nodepool upgrade --cluster-name <cluster-name> --resource-group <resource-group> --name <node-pool> --kubernetes-version <version> --no-wait
az aks nodepooloperateDevOps
az monitor activity-log list --resource-group <resource-group> --start-time <utc-start> --end-time <utc-end>
az monitor activity-logdiscoverDevOps
Architecture context
Architecturally, a rolling upgrade is where AKS platform maintenance meets application resilience. The cluster may be healthy, but the upgrade will prove whether workloads tolerate eviction, rescheduling, and node replacement. I design node pools so system workloads, user workloads, availability zones, autoscaling, and subnet IP capacity are known before upgrades. Production pools need max surge capacity and quotas; critical workloads need multiple replicas and Pod Disruption Budgets that are strict enough to protect users but not so strict that nodes cannot drain. The maintenance plan should define order, observation time, rollback options, and communication triggers. Upgrade design should be reviewed whenever workload criticality or node-pool topology changes.
Security
Security impact is indirect but important. Rolling upgrades are how many node image fixes, Kubernetes patch versions, and platform security updates reach running workloads. Delaying them can leave nodes on vulnerable or unsupported versions. The risk is that upgrade permissions are powerful: users with agentPools write access can affect production capacity, workload placement, and node configuration. Secure operations require least-privilege access, approved maintenance windows, Activity Log monitoring, and no hidden kubeconfig sharing. Also check that workloads rescheduled during upgrades keep using the expected managed identities, secrets, network policies, and admission controls. Upgrade evidence should be retained with the security exception or patch record.
Cost
Rolling upgrades can increase cost temporarily because max surge adds extra nodes during the operation. The bill impact is usually short-lived, but it matters for large node pools, GPU pools, premium disks, and constrained quotas. Longer soak times and slow drains extend the period where extra capacity exists. Failed upgrades also create operational cost through emergency labor and delayed security maintenance. FinOps review should include node count, VM size, max surge percentage, regional quota, reserved capacity assumptions, and whether unnecessary idle pools can be scaled down before maintenance. Do not reduce surge blindly if availability is business-critical. Budget owners should know the temporary surge profile before approving the window.
Reliability
Reliability impact is direct. Rolling upgrades are designed to keep the node pool available by upgrading a few nodes at a time, but they only work when the surrounding architecture allows it. Problems appear when surge quota is missing, Azure CNI subnets lack IPs, Pod Disruption Budgets block drains, replicas are too low, or workloads take too long to become ready. Operators should validate control-plane version rules, node-pool state, max surge, drain timeout, soak duration, and recent cluster events. A rollback plan matters because new node versions can expose workload or driver incompatibilities. The first batch should be treated as a canary for the rest of the pool.
Performance
Performance impact appears during workload movement. A good rolling upgrade preserves throughput by keeping enough nodes available while pods drain and restart on fresh nodes. A weak plan can create pending pods, cold starts, cache loss, throttling, or overloaded remaining nodes. Max surge improves capacity but needs quota and subnet IPs. Soak duration gives workloads time to stabilize before the next batch. Measure node readiness time, pod startup time, pending pods, CPU and memory pressure, request latency, error rate, and autoscaler behavior. Performance validation should continue after the last node reports upgraded. Do not close the window until caches, warm paths, and autoscaler behavior have settled.
Operations
Operators run rolling upgrades as planned maintenance with prechecks, command evidence, live monitoring, and postchecks. They inspect available upgrades, node-pool versions, provisioning state, max surge, subnet IP headroom, quota, autoscaler settings, and workload disruption rules. During the upgrade they watch node readiness, pod restarts, pending pods, error rates, and Activity Log. After completion they confirm all pools reached Succeeded, nodes run the expected version, and application SLOs stayed within target. Runbooks should include who may start upgrades, when to pause, how to communicate impact, and how rollback is handled. A good operator also records the decision points used to continue, pause, or roll back.
Common mistakes
Starting a node-pool upgrade before upgrading the control plane or checking version-skew rules.
Forgetting that max surge needs extra compute quota and, with Azure CNI, available subnet IP addresses.
Ignoring Pod Disruption Budgets, single-replica apps, daemon sets, or slow readiness probes until nodes refuse to drain.