Containers Azure Kubernetes Service premium

AKS cluster upgrade

An AKS cluster upgrade is the planned process of moving a cluster, node pools, or node images to newer supported versions. It is not only a version-number change. Upgrades can affect the Kubernetes API, workloads, add-ons, networking, node images, admission policies, and available capacity during node drain and replacement. A safe upgrade starts with checking supported versions, reviewing workload readiness, validating in staging, planning surge capacity, and capturing evidence before and after the change. Saved upgrade evidence prevents guesswork.

Back to glossary browser Open Microsoft Learn source

Aliases: AKS upgrade, Kubernetes version upgrade, AKS node upgrade, cluster version upgrade
Difficulty: intermediate
CLI mappings: 3
Last verified: 2026-05-09

Microsoft Learn

An AKS cluster upgrade moves an AKS cluster and its node pools to a newer supported Kubernetes version or updates node images after checking available versions, planning surge capacity, and validating workload readiness.

Microsoft Learn: Upgrade an Azure Kubernetes Service cluster2026-05-09

Technical context

Technically, AKS upgrades involve the managed control plane, one or more node pools, Kubernetes minor and patch versions, node images, surge behavior, and workload scheduling. Azure exposes available upgrades through AKS APIs and CLI, and node pools may be upgraded in sequence depending on strategy. During upgrades, nodes can be cordoned and drained, replacement or surge nodes may be created, and pods are rescheduled according to disruption budgets, readiness probes, taints, tolerations, and available capacity.

Why it matters

AKS cluster upgrades matter because unsupported or stale Kubernetes versions create security, supportability, and compatibility risk. At the same time, careless upgrades can break workloads that rely on deprecated APIs, fragile probes, missing disruption budgets, or tight IP capacity. The goal is to stay current without making the cluster the cause of an outage. Strong teams treat upgrades as repeatable operations: discover available versions, test in lower environments, verify add-ons and workloads, communicate windows, capture health evidence, and keep a rollback or mitigation plan for application issues discovered after the upgrade. Practically, AKS cluster upgrade becomes safer when teams save target version, node surge plan, cordon and drain status, workload health, and rollback evidence, because reviewers can compare the intended design with the running state.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

az aks get-upgrades output, AKS upgrade settings, node image upgrade settings, maintenance windows, release notes, and post-upgrade validation checks

Signal 02

Azure portal, CLI output, IaC templates, monitoring dashboards, and incident runbooks

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

stay within supported Kubernetes versions
apply security and node image updates
standardize cluster versions across environments
perform staged production upgrades
capture upgrade evidence for audits

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

AKS cluster upgrade in action

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

Pioneer Health Exchange, a healthcare data exchange, had AKS clusters approaching the end of a supported Kubernetes version window. The platform team created a disciplined AKS cluster upgrade program instead of rushing production changes.

Business/Technical Objectives

upgrade four clusters within six weeks
avoid patient-portal downtime
validate add-ons and deprecated APIs before production
capture upgrade evidence for compliance

Solution Using AKS cluster upgrade

The team listed available upgrades with Azure CLI, upgraded development and staging clusters first, and scanned workloads for deprecated Kubernetes APIs. Critical services received corrected readiness probes and pod disruption budgets before production. During each window, engineers upgraded the control plane and node pools in a planned order while monitoring node drain, pod readiness, events, and application response time. The change record included az aks output, kubectl rollout status, version evidence, and post-upgrade health metrics. The release note also captured target version, node surge plan, cordon and drain status, workload health, and rollback evidence, the accountable owner, rollback trigger, and verification command so future reviews could reuse the same operating pattern.

Results & Business Impact

All four clusters moved to supported versions before the deadline.
Patient-portal availability stayed above 99.95 percent during upgrade windows.
Three deprecated API usages were fixed before production.
Compliance evidence preparation dropped from 6 hours to 90 minutes per cluster.

Key Takeaway for Glossary Readers

An AKS cluster upgrade is safest when version movement is treated as a tested operational workflow, not a portal button.

Case study 02

AKS cluster upgrade in action

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

Orchid Travel, a travel booking platform, needed to upgrade AKS before launching summer-demand features but feared breaking ingress and autoscaling behavior. The reliability team used staged upgrades and measured rollout gates.

Business/Technical Objectives

complete upgrade before seasonal launch
keep booking latency under 300 milliseconds
verify ingress and autoscaler behavior
reduce rollback uncertainty

Solution Using AKS cluster upgrade

Engineers duplicated production workload patterns in staging and ran synthetic booking traffic before and after the AKS cluster upgrade. Azure CLI captured available versions, current node pool versions, and upgrade state. The team upgraded one noncritical user node pool first, watched pod rescheduling, then upgraded the remaining pools after health checks passed. Ingress controllers, policy add-ons, and autoscaler settings were validated explicitly. A rollback playbook focused on application mitigation, because the team understood which platform operations could and could not be reversed quickly. The release note also captured target version, node surge plan, cordon and drain status, workload health, and rollback evidence, the accountable owner, rollback trigger, and verification command so future reviews could reuse the same operating pattern.

Results & Business Impact

The upgrade completed two weeks before launch freeze.
Booking latency remained at 247 milliseconds p95 during validation.
Autoscaler scale-up timing changed by less than 8 percent.
No rollback was required, and release confidence scores improved in post-change review.

Key Takeaway for Glossary Readers

AKS cluster upgrades protect supportability and security, but the real win is learning exactly how workloads behave during platform change.

Case study 03

AKS cluster upgrade in action

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

Canyon Grid Services, an energy grid software provider, had strict maintenance windows for grid-forecasting APIs running on AKS. The operations team redesigned upgrades around surge capacity and workload protection.

Business/Technical Objectives

finish production upgrades inside a 2-hour window
prevent single-replica workloads from blocking node drain
document node image status after upgrade
reduce emergency maintenance events

Solution Using AKS cluster upgrade

The team audited every deployment for replica count, disruption budgets, and readiness probes before the AKS cluster upgrade. Single-replica workloads were either scaled out or moved to a maintenance-safe schedule. Engineers confirmed subnet IP capacity for surge nodes and checked regional VM quota. During the upgrade, dashboards tracked node readiness, pod evictions, API health, and queue depth. Azure CLI exported node pool versions and node image state after completion. The upgrade checklist became mandatory for every cluster in the grid software platform. The release note also captured target version, node surge plan, cordon and drain status, workload health, and rollback evidence, the accountable owner, rollback trigger, and verification command so future reviews could reuse the same operating pattern.

Results & Business Impact

Upgrade duration fell from 3.5 hours to 88 minutes.
No critical pod remained stuck during node drain.
Emergency maintenance events tied to stale node images dropped by 50 percent.
Post-upgrade evidence was available within 20 minutes of completion.

Key Takeaway for Glossary Readers

An AKS cluster upgrade succeeds when node capacity, workload resilience, and evidence collection are planned together.

Why use Azure CLI for this?

Azure CLI is the normal way to discover available AKS upgrades, start control-plane or node-pool upgrades, and capture version evidence. CLI commands let teams automate preflight checks and avoid manual portal-only upgrade records that are hard to repeat or audit.

CLI use cases

List available Kubernetes upgrades for a cluster.
Upgrade the control plane or selected node pools during an approved window.
Check node pool versions and node image status before and after the change.
Export upgrade evidence for change management and compliance review.

Before you run CLI

Confirm the target version is supported and tested in lower environments.
Check subnet IP capacity, VM quota, surge settings, and maintenance window.
Review add-on compatibility, deprecated APIs, disruption budgets, and workload readiness.
Verify backup, rollback, and incident escalation paths before touching production.

What output tells you

Available-upgrade output shows supported target versions and whether patch or minor upgrades are possible.
Cluster and node pool output reveals version skew and upgrade completion state.
Provisioning state and events indicate whether Azure accepted and completed the operation.
Post-upgrade kubectl output confirms whether workloads recovered after node drain and rescheduling.

Mapped Azure CLI commands

Inspect and operate AKS cluster upgrade

diagnostic

az aks get-upgrades --resource-group <resource-group> --name <cluster>

az aksremoveContainers

az aks upgrade --resource-group <resource-group> --name <cluster> --kubernetes-version <version>

az aksremoveContainers

az aks nodepool upgrade --resource-group <resource-group> --cluster-name <cluster> --name <nodepool> --node-image-only

az aks nodepoolremoveContainers

Architecture context

Security

Security is a major driver for AKS upgrades. New Kubernetes and node image releases can include fixes for vulnerabilities, platform hardening, and component updates. Delaying upgrades increases exposure and can push clusters outside supported version windows. Security review should include API server access, add-ons, policy components, node image age, and workload identity behavior after the upgrade. Operators should also verify that deprecated APIs are not being used, because workloads that fail after an upgrade may tempt teams to weaken controls or rush insecure changes during recovery. The evidence to retain is target version, node surge plan, cordon and drain status, workload health, and rollback evidence, because those details show who can change the boundary and whether exposure matches policy.

Cost

Cost impact often appears during and after upgrades. Surge nodes can temporarily increase VM cost, and failed upgrades can extend maintenance windows or duplicate environments longer than planned. Testing upgrades in staging also consumes resources, but that cost is usually cheaper than a production outage. Node image or version changes may reveal different resource behavior, causing autoscaler adjustments or node pool resizing. FinOps review should include temporary surge capacity, blue-green upgrade strategy cost, idle staging environments, and whether old node pools were removed after migration. A FinOps review should connect target version, node surge plan, cordon and drain status, workload health, and rollback evidence to owner, environment, expected utilization, and review date so spend stays explainable.

Reliability

Reliability during an AKS cluster upgrade depends on workload design and capacity planning. Pods must tolerate node drain, replacement, and rescheduling. Pod disruption budgets, readiness probes, replica counts, topology spread, and surge capacity all matter. If a subnet lacks IP addresses or a node pool cannot create surge nodes, an upgrade may slow or fail. Reliable upgrade programs use staged environments, canary node pools, preflight checks, maintenance windows, and post-upgrade validation. The best outcome is not merely a successful version change; it is unchanged user experience during and after the process. During incidents, target version, node surge plan, cordon and drain status, workload health, and rollback evidence helps responders decide whether the issue is workload behavior, platform capacity, or a misconfigured release.

Performance

Performance can change after an AKS cluster upgrade because Kubernetes components, node images, networking behavior, container runtime details, and add-ons may change. Most upgrades should preserve performance, but fragile workloads can expose slower startup, readiness delays, DNS issues, or scheduling differences. Teams should benchmark key APIs, watch pod restart rates, measure node drain time, and compare latency before and after upgrade. If performance-sensitive workloads use custom ingress, storage, or service mesh components, they need explicit validation rather than assuming the platform upgrade is invisible. Teams should compare performance before and after changing AKS cluster upgrade, using target version, node surge plan, cordon and drain status, workload health, and rollback evidence to separate real bottlenecks from configuration assumptions.

Operations

Operationally, AKS upgrades require a checklist. Before the change, operators should list available versions, inspect current control-plane and node-pool versions, check add-on compatibility, scan for deprecated APIs, confirm quota and subnet capacity, and review disruption budgets. During the change, they should monitor node status, pod readiness, events, and application health. Afterward, they should capture version output, node image state, failed pod counts, and incident-free service metrics. Upgrade evidence should be saved with the change record so future teams can compare patterns and improve the process. The runbook should capture target version, node surge plan, cordon and drain status, workload health, and rollback evidence, assign an owner, and define when to roll back, escalate, or accept a documented exception.

Common mistakes

Upgrading production before testing workloads and add-ons in staging.
Ignoring subnet IP capacity required for surge nodes.
Assuming a control-plane upgrade also fixes every node pool and node image issue.
Failing to check deprecated Kubernetes APIs before a minor version upgrade.

Operator quick checks

What version is the control plane running today?
Which node pools are behind or using stale node images?
Do critical workloads have enough replicas and disruption budgets?
Can the subnet and quota handle surge capacity?

Questions to ask

What user-facing service would notice if node drain takes longer than expected?
Which add-on or controller is most likely to be version-sensitive?
What evidence proves the upgrade completed safely?
How will the team respond if a workload fails after the platform upgrade succeeds?

Related terms

No related terms mapped yet.

Graph connections

Graph edges are queued for this term.

Learning paths

Learn next

Use related terms, graph links, command groups, and comparison cards to keep moving through Azure without losing context.

Open relationship graph