DevOps Deployment workflows template-specs-upgraded

Rolling upgrade

A rolling upgrade changes an AKS node pool gradually instead of replacing every node at once. AKS adds surge capacity when configured, cordons old nodes, drains workloads, waits for optional soak time, reimages or upgrades nodes, and repeats until the pool is finished. The goal is to keep applications running while the cluster moves to a new Kubernetes or node image version. It works best when workloads have enough replicas, useful health probes, and Pod Disruption Budgets that allow safe movement.

Back to glossary browser Open Microsoft Learn source

Aliases: AKS rolling upgrade, node pool rolling upgrade, rolling node upgrade, AKS node pool upgrade, Kubernetes rolling upgrade
Difficulty: intermediate
CLI mappings: 6
Last verified: 2026-05-22

Browse trail Learn DevOps Deployment workflows Rolling upgrade

Learning map Graph DevOps concept cluster Rolling upgrade

Context Concept cluster: DevOps concept cluster

Microsoft Learn

Microsoft Learn defines a rolling upgrade strategy for AKS node pools as upgrading one node, or a small group of nodes, at a time so workloads remain available. AKS uses surge nodes, cordon and drain, optional soak time, reimage steps, and persistent node-pool upgrade settings.

Microsoft Learn: Configure rolling upgrades for Azure Kubernetes Service node pools2026-05-22

Technical context

In Azure architecture, rolling upgrade is part of AKS lifecycle operations on node pools, not application code deployment by itself. It touches the managed cluster control plane, agent pools, VM scale sets or virtual-machine pools, subnet IP capacity, quotas, Pod Disruption Budgets, autoscaler behavior, and workload scheduling. Operators configure max surge, drain timeout, node soak duration, and sometimes upgrade strategy through Azure CLI, ARM, or portal. The outcome appears in node-pool provisioning state, Kubernetes node versions, events, Activity Log, and workload availability metrics.

Why it matters

Rolling upgrades matter because Kubernetes infrastructure changes are unavoidable, but outages during upgrades are not acceptable. Security patches, supported Kubernetes versions, OS image updates, and platform fixes all require node movement. A thoughtful rolling upgrade gives teams a controlled way to absorb that change while preserving capacity and rollback options. A careless upgrade exposes hidden weakness: single-replica workloads, strict disruption budgets, exhausted subnet IPs, insufficient quota, slow drains, and apps that cannot restart cleanly. The term helps operators see upgrades as reliability exercises, not button clicks. It also creates a safe place to expose fragile workloads before customers do. That preparation matters before release windows.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

In Azure CLI, rolling-upgrade settings appear in az aks nodepool show output through maxSurge, drain timeout, node soak duration, version, and provisioningState. during maintenance planning

Signal 02

In Kubernetes events and node listings, operators notice cordoned nodes, draining pods, new node versions, pending workloads, and readiness changes during the upgrade. with scheduling pressure visible

Signal 03

In Activity Log and monitoring workbooks, the upgrade appears as agent-pool write operations, duration, failures, latency changes, pod restarts, and availability signals. during the maintenance window

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

Apply AKS node image security fixes while keeping customer-facing workloads available during a maintenance window.
Upgrade a user node pool separately after the control plane reaches the target supported Kubernetes version.
Tune max surge and soak duration for a latency-sensitive service that needs capacity during node replacement.
Find workloads with bad disruption budgets or single replicas before an upgrade turns them into an outage.
Collect command and Activity Log evidence proving a regulated cluster upgrade followed the approved plan.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

Online education platform upgrades AKS before exam week

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

An online education platform needed to patch an AKS node pool before national exam week. The previous upgrade caused thirty minutes of login failures because every authentication pod restarted together.

Business/Technical Objectives

Upgrade the user node pool before the exam traffic freeze.
Keep login availability above 99.95 percent during maintenance.
Validate Pod Disruption Budgets before draining nodes.
Record command evidence for the change board.

Solution Using Rolling upgrade

The platform team planned a rolling upgrade rather than replacing the pool all at once. They reviewed available AKS versions, upgraded the control plane first, and used Azure CLI to set max surge to 33 percent with a short node soak duration. Authentication deployments were scaled to three replicas across zones, and Pod Disruption Budgets were adjusted to allow one pod disruption while preserving service. During the window, engineers monitored node-pool provisioning state, pod restarts, pending pods, login latency, and Activity Log entries. The final report included CLI output before and after the upgrade.

Results & Business Impact

The node pool upgraded in 52 minutes with no full login outage.
Peak login latency rose only 7 percent during pod movement.
Pending pods stayed below five for the entire maintenance window.
Change-board evidence preparation fell from six hours to forty minutes.

Key Takeaway for Glossary Readers

A rolling upgrade succeeds when infrastructure settings and workload disruption rules are planned together.

Case study 02

Food delivery marketplace patches GPU routing service

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A food delivery marketplace ran a GPU-backed routing optimizer on AKS. A required node image update risked interrupting driver-sensitive workloads during dinner peak.

Business/Technical Objectives

Apply the node image update outside peak traffic.
Avoid exhausting scarce regional GPU quota with excessive surge.
Detect scheduling failures before orders missed delivery promises.
Keep a rollback decision point after the first batch.

Solution Using Rolling upgrade

The SRE team split the upgrade plan by node pool and focused first on the smaller GPU pool. They used Azure CLI to inspect node versions, verify quota, and set a conservative max surge that the region could satisfy. A soak period after the first upgraded node let the team test CUDA initialization, routing latency, and pod readiness before continuing. Cluster autoscaler settings were frozen for the window to avoid confusing capacity signals. Activity Log, node-pool state, and application metrics were reviewed at each checkpoint before the next nodes drained.

Results & Business Impact

The GPU pool completed the rolling upgrade with zero missed routing jobs.
Dinner-hour routing latency stayed within the 120 millisecond objective.
Quota use peaked at one extra GPU node instead of the originally proposed four.
The first-batch soak caught a driver warning before it affected all pods.

Key Takeaway for Glossary Readers

Rolling upgrades let teams pace risky node changes around scarce capacity and workload-specific health checks.

Case study 03

Industrial IoT vendor standardizes cluster maintenance evidence

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

An industrial IoT vendor operated AKS clusters for factory telemetry in eight regions. Customers asked for proof that platform patches happened without violating data-collection availability.

Business/Technical Objectives

Create one repeatable node-pool upgrade runbook for all regions.
Keep telemetry ingestion loss under one percent during maintenance.
Capture before-and-after node versions automatically.
Identify clusters with unsafe disruption settings before upgrades.

Solution Using Rolling upgrade

The platform engineering group created a CLI-driven rolling upgrade workflow. Each run started by checking available upgrades, node-pool state, max surge, subnet IP capacity, and PDB health. The workflow exported node-pool versions and Activity Log details into the customer evidence package. Workloads with single replicas or blocking disruption budgets were remediated before their clusters entered the maintenance queue. Regional upgrades were staggered so support engineers could observe ingestion metrics and pause the next region if pending pods or error rates crossed thresholds. A dry-run checklist was completed for every region one week before its scheduled window.

Results & Business Impact

All eight regional node pools reached the target version in three maintenance nights.
Telemetry ingestion loss stayed below 0.4 percent in every region.
Prechecks found eleven unsafe disruption budgets before they caused drain failures.
Customer evidence packs were generated automatically within an hour of each window.

Key Takeaway for Glossary Readers

Rolling upgrades become safer at scale when prechecks, telemetry, and evidence capture are standardized.

Why use Azure CLI for this?

I use Azure CLI for rolling upgrades because production AKS upgrades demand exact, repeatable settings. The portal can start an upgrade, but CLI lets me check available versions, inspect every node pool, set max surge or soak settings, run the upgrade with no-wait, and capture Activity Log evidence. After ten years of Azure operations, I want command output in the change record before and after the maintenance window. CLI also exposes quota, version, and provisioning-state clues quickly, which helps me pause when a system pool, subnet, or disruption budget could turn a routine upgrade into an outage. The same commands can be reused across regions without reinterpreting portal screens.

CLI use cases

Check available Kubernetes and node image upgrades before scheduling the node-pool maintenance window.
Set max surge, drain timeout, and node soak duration consistently across comparable production node pools.
Start, monitor, and document a node-pool upgrade while comparing provisioning state, version, and Activity Log output.

Before you run CLI

Confirm tenant, subscription, resource group, cluster, node pool, region, Kubernetes version, control-plane version, and output format.
Check Microsoft.ContainerService permissions, provider registration, quota, subnet IP headroom, autoscaler settings, and maintenance window approval.
Review destructive risk to workloads, surge cost, Pod Disruption Budgets, identity behavior, zone placement, and rollback or support plan.

What output tells you

Upgrade-profile output shows which Kubernetes versions are available and whether the node pool can move to the target version.
Node-pool output shows count, vmSize, orchestratorVersion, nodeImageVersion, maxSurge, provisioningState, mode, zones, and upgrade settings.
Activity and metric output tells when the operation started, whether it failed, and whether user latency or pod availability changed.

Mapped Azure CLI commands

AKS rolling upgrade CLI commands

direct-management

az aks get-upgrades --name <cluster-name> --resource-group <resource-group>

az aksdiscoverDevOps

az aks nodepool list --cluster-name <cluster-name> --resource-group <resource-group> --output table

az aks nodepooldiscoverContainers

az aks nodepool show --cluster-name <cluster-name> --resource-group <resource-group> --name <node-pool>

az aks nodepooldiscoverDevOps

az aks nodepool update --cluster-name <cluster-name> --resource-group <resource-group> --name <node-pool> --max-surge 33% --drain-timeout 30 --node-soak-duration 5

az aks nodepoolconfigureDevOps

az aks nodepool upgrade --cluster-name <cluster-name> --resource-group <resource-group> --name <node-pool> --kubernetes-version <version> --no-wait

az aks nodepooloperateDevOps

az monitor activity-log list --resource-group <resource-group> --start-time <utc-start> --end-time <utc-end>

az monitor activity-logdiscoverDevOps

Architecture context

Architecturally, a rolling upgrade is where AKS platform maintenance meets application resilience. The cluster may be healthy, but the upgrade will prove whether workloads tolerate eviction, rescheduling, and node replacement. I design node pools so system workloads, user workloads, availability zones, autoscaling, and subnet IP capacity are known before upgrades. Production pools need max surge capacity and quotas; critical workloads need multiple replicas and Pod Disruption Budgets that are strict enough to protect users but not so strict that nodes cannot drain. The maintenance plan should define order, observation time, rollback options, and communication triggers. Upgrade design should be reviewed whenever workload criticality or node-pool topology changes.

Security

Security impact is indirect but important. Rolling upgrades are how many node image fixes, Kubernetes patch versions, and platform security updates reach running workloads. Delaying them can leave nodes on vulnerable or unsupported versions. The risk is that upgrade permissions are powerful: users with agentPools write access can affect production capacity, workload placement, and node configuration. Secure operations require least-privilege access, approved maintenance windows, Activity Log monitoring, and no hidden kubeconfig sharing. Also check that workloads rescheduled during upgrades keep using the expected managed identities, secrets, network policies, and admission controls. Upgrade evidence should be retained with the security exception or patch record.

Cost

Rolling upgrades can increase cost temporarily because max surge adds extra nodes during the operation. The bill impact is usually short-lived, but it matters for large node pools, GPU pools, premium disks, and constrained quotas. Longer soak times and slow drains extend the period where extra capacity exists. Failed upgrades also create operational cost through emergency labor and delayed security maintenance. FinOps review should include node count, VM size, max surge percentage, regional quota, reserved capacity assumptions, and whether unnecessary idle pools can be scaled down before maintenance. Do not reduce surge blindly if availability is business-critical. Budget owners should know the temporary surge profile before approving the window.

Reliability

Reliability impact is direct. Rolling upgrades are designed to keep the node pool available by upgrading a few nodes at a time, but they only work when the surrounding architecture allows it. Problems appear when surge quota is missing, Azure CNI subnets lack IPs, Pod Disruption Budgets block drains, replicas are too low, or workloads take too long to become ready. Operators should validate control-plane version rules, node-pool state, max surge, drain timeout, soak duration, and recent cluster events. A rollback plan matters because new node versions can expose workload or driver incompatibilities. The first batch should be treated as a canary for the rest of the pool.

Performance

Performance impact appears during workload movement. A good rolling upgrade preserves throughput by keeping enough nodes available while pods drain and restart on fresh nodes. A weak plan can create pending pods, cold starts, cache loss, throttling, or overloaded remaining nodes. Max surge improves capacity but needs quota and subnet IPs. Soak duration gives workloads time to stabilize before the next batch. Measure node readiness time, pod startup time, pending pods, CPU and memory pressure, request latency, error rate, and autoscaler behavior. Performance validation should continue after the last node reports upgraded. Do not close the window until caches, warm paths, and autoscaler behavior have settled.

Operations

Operators run rolling upgrades as planned maintenance with prechecks, command evidence, live monitoring, and postchecks. They inspect available upgrades, node-pool versions, provisioning state, max surge, subnet IP headroom, quota, autoscaler settings, and workload disruption rules. During the upgrade they watch node readiness, pod restarts, pending pods, error rates, and Activity Log. After completion they confirm all pools reached Succeeded, nodes run the expected version, and application SLOs stayed within target. Runbooks should include who may start upgrades, when to pause, how to communicate impact, and how rollback is handled. A good operator also records the decision points used to continue, pause, or roll back.

Common mistakes

Starting a node-pool upgrade before upgrading the control plane or checking version-skew rules.
Forgetting that max surge needs extra compute quota and, with Azure CNI, available subnet IP addresses.
Ignoring Pod Disruption Budgets, single-replica apps, daemon sets, or slow readiness probes until nodes refuse to drain.

Operator quick checks

Run upgrade discovery and confirm the target node-pool version is supported by the current control plane.
Check max surge, quota, subnet IP capacity, node count, PDBs, pending pods, and recent cluster events before starting.
After completion, verify node versions, node-pool Succeeded state, workload replicas, pod restarts, latency, and error rate.

Questions to ask

What workload boundary does this upgrade touch: system pool, user pool, zone, subnet, GPU pool, or critical application tier?
Which capacity signal proves the cluster can absorb drained nodes without starving pods or violating SLOs?
What rollback, pause, or escalation path exists if the new node version exposes a workload defect?

Related terms

No related terms mapped yet.

Graph connections

Graph edges are queued for this term.

Learn next

Use related terms, graph links, command groups, and comparison cards to keep moving through Azure without losing context.

Open relationship graph