Containers AKS operations field-manual-complete field-manual-complete

AKS node image

An AKS node image is the Azure-managed operating system image used to create or refresh the virtual machines in an AKS node pool. It includes the OS baseline and node-level components that Microsoft tests for AKS. Updating the node image is different from upgrading the Kubernetes minor version, although both affect cluster maintenance. In everyday terms, it is how platform teams keep worker nodes patched without rebuilding the whole cluster. The upgrade still disrupts nodes, so maintenance windows, surge capacity, and pod disruption planning matter.

Back to glossary browser Open Microsoft Learn source

Aliases: AKS node image, aks-node-image, aks-cluster, aks-node, node-pool, aks-cluster-upgrade, aks-maintenance-window, kubernetes-version, aks-linux-node-pool, aks-windows-node-pool, node-image-upgrade, automatic-os-image-upgrade
Difficulty: Intermediate
CLI mappings: 5
Last verified: 2026-05-30

Microsoft Learn

Microsoft Learn explains that AKS regularly provides new node images for the operating system and components used by cluster nodes. Updating node images helps apply security fixes, hotfixes, and AKS component updates, and can be managed manually or through automatic upgrade channels.

Microsoft Learn: Node images in Azure Kubernetes Service (AKS)2026-05-30

Technical context

Technically, an AKS node image is tied to node pools, OS type, OS SKU, region availability, Kubernetes support status, and upgrade settings. Azure publishes image versions and exposes the current value through node pool properties and Kubernetes node labels. Operators can upgrade node images with az aks upgrade or az aks nodepool upgrade using node-image-only flags, or configure node OS auto-upgrade channels. The process drains and replaces or reimages nodes according to surge and maintenance rules, while workloads reschedule through Kubernetes.

Why it matters

AKS node images matter because stale worker nodes carry security, support, and reliability risk even when the Kubernetes control plane looks healthy. New images contain operating system fixes, container runtime updates, AKS component updates, and bug fixes that workloads depend on indirectly. Ignoring node images can produce vulnerability exceptions, failed scale operations, node readiness issues, or incompatibility with newer AKS features. Upgrading carelessly can also evict pods at the wrong time. The value is in a controlled patch rhythm: know the current image, know the latest approved image, test it, schedule disruption, and verify every node pool afterward quickly and safely.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

In az aks nodepool show output, nodeImageVersion identifies the image currently used by a node pool and whether it trails the latest available version for patch planning.

Signal 02

In AKS upgrade checks, available node image versions appear beside Kubernetes upgrade information so operators can plan image-only maintenance safely before customer-impacting windows and audits.

Signal 03

In kubectl node labels, kubernetes.azure.com/node-image-version shows which image each node actually runs after an upgrade, replacement, or autoscaler event during evidence review.

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

Patch AKS worker nodes for security fixes without changing the cluster's Kubernetes minor version.
Prove compliance by showing current and latest node image versions across production node pools.
Schedule node OS image maintenance with surge capacity so workload disruption stays within approved windows.
Recover from node-level bugs or scale issues caused by older images after validating the new image in nonproduction.
Coordinate Windows, Linux, system, user, GPU, or specialized pools that need different node image upgrade sequencing.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

Education SaaS closes node vulnerability exceptions

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

An education SaaS platform ran AKS clusters for exam scheduling and grading APIs. Security scans showed production nodes using images more than ninety days old, creating recurring vulnerability exceptions before every audit.

Business/Technical Objectives

Bring all production node pools to an approved image version.
Avoid downtime during exam registration peaks.
Produce evidence for security and customer audits.
Create a monthly patch rhythm instead of emergency upgrades.

Solution Using AKS node image

The platform team used Azure CLI to list node pool versions, compare latestNodeImageVersion values, and identify pools blocked by quota or maintenance conflicts. They tested the image upgrade in staging with the same pod disruption budgets and autoscaler settings used in production. Production upgrades ran node pool by node pool during planned windows with max surge set to 33 percent. Operators watched node readiness, pending pods, application latency, and Kubernetes node image labels. After each pool completed, a script exported before-and-after nodeImageVersion values and attached them to the audit ticket. The team then enabled a controlled node OS auto-upgrade channel for noncritical clusters.

Results & Business Impact

Stale node image exceptions dropped from twenty-eight to two in one quarter.
Exam API availability stayed above 99.99 percent during upgrade windows.
Audit evidence collection fell from three days to under two hours.
Monthly patch planning replaced four emergency weekend maintenance events.

Key Takeaway for Glossary Readers

AKS node image management turns node patching from an audit emergency into a predictable fleet operation.

Case study 02

Robotics platform fixes scale-out failures on old images

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A warehouse robotics company used AKS to coordinate simulation workloads. During a seasonal planning run, new nodes joined slowly and some reported readiness problems after scale-out.

Business/Technical Objectives

Refresh stale node images without upgrading Kubernetes minor versions.
Restore predictable scale-out for simulation bursts.
Protect long-running simulation jobs during node replacement.
Document capacity and image checks for future events.

Solution Using AKS node image

Engineers discovered that user node pools had drifted several image releases behind the current AKS-tested images. They used az aks nodepool get-upgrades to identify latestNodeImageVersion per pool and scheduled node-image-only upgrades. Before changing production, they increased max surge, verified subnet IP capacity, and added pod disruption budgets for simulation coordinators. Long-running jobs were drained in batches by node pool, while checkpointed workers rescheduled to fresh nodes. Kubectl checks confirmed the new node image labels and Ready state after each batch. The operations runbook now checks image age, quota, pending pods, and autoscaler health before every high-volume simulation window.

Results & Business Impact

Scale-out time for 300 simulation pods improved from twenty-two minutes to nine minutes.
Node readiness failures during burst tests fell from 14 percent to below 1 percent.
No long-running simulation jobs were lost during the upgrade.
Pre-event readiness checks now finish in twenty minutes instead of two hours.

Key Takeaway for Glossary Readers

Old node images can become a scaling problem, not just a patching problem, when bursts depend on fresh, healthy nodes.

Case study 03

Public safety portal coordinates Windows and Linux pool patching

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A public safety portal ran Linux APIs and a Windows-based reporting component in the same AKS cluster. The two node pools had different image schedules and maintenance constraints.

Business/Technical Objectives

Patch Linux and Windows node images without mixing risk profiles.
Keep emergency reporting available during maintenance.
Expose image status clearly to operations leadership.
Avoid surprise capacity spikes during business hours.

Solution Using AKS node image

The platform team split the node image plan by pool. Linux API pools used a routine monthly image-only upgrade with surge capacity and standard pod disruption budgets. The Windows reporting pool used a separate evening window because startup time and image cadence differed. Azure CLI captured current nodeImageVersion and latestNodeImageVersion for each pool, then triggered upgrades only after quota and subnet IP checks passed. Application owners received a dashboard showing node readiness, pool image version, pending pods, and reporting latency. After each window, kubectl node labels confirmed that every node ran the expected image. Leadership received a concise patch report with remaining exceptions and the next scheduled window.

Results & Business Impact

Emergency reporting availability stayed at 100 percent across three patch windows.
Unplanned surge capacity during business hours dropped to zero.
Patch status reporting moved from manual spreadsheets to automated CLI evidence.
Windows pool exceptions were reduced from six weeks overdue to within policy.

Key Takeaway for Glossary Readers

AKS node image upgrades should be planned by node pool because operating systems, workloads, and disruption tolerance are rarely identical.

Why use Azure CLI for this?

As an Azure engineer, I use Azure CLI for node images because node image currency is fleet work, not a single-cluster curiosity. CLI can show current nodeImageVersion values, list available upgrades, apply a node-image-only upgrade, and capture the before-and-after evidence required by security teams. It also pairs well with kubectl checks that show node labels and readiness while workloads reschedule. The portal can start an upgrade, but scripts let me coordinate maintenance windows, surge settings, exclusions, and status checks across many clusters. That repeatability is what keeps patching from becoming a quarterly fire drill again across many subscriptions.

CLI use cases

Show every node pool's current nodeImageVersion and compare it with the latestNodeImageVersion reported by upgrade checks.
Run a node-image-only upgrade for one node pool after validating capacity, disruption budgets, and maintenance approval.
Configure max surge or maintenance settings so node replacement happens at a controlled speed.
Use kubectl to confirm each node's image label and Ready state after the Azure CLI upgrade completes.
Export before-and-after node image evidence for security audits, vulnerability exceptions, and fleet patch reports.

Before you run CLI

Confirm tenant, subscription, resource group, cluster name, node pool name, and whether you are targeting system or user pools.
Check AKS version support, OS SKU support, node pool health, autoscaler behavior, maintenance windows, and current workload disruption budgets.
Review permissions for AKS cluster and node pool updates, because image upgrades are mutating operations with availability impact.
Verify surge capacity and quota before starting; an upgrade may need extra cores, IPs, or temporary nodes.
Use read-only get-upgrades and show commands first, then record output for the change ticket before running upgrade commands.

What output tells you

nodeImageVersion shows the image currently associated with a node pool, while latestNodeImageVersion shows the update target available for that pool.
Provisioning state and power or upgrade status fields indicate whether Azure is still changing the node pool.
Node labels from kubectl prove which image individual nodes actually run after replacement or reimage operations finish.
Max surge and upgrade settings explain how much temporary capacity the pool may use during the image rollout.
Errors often reveal quota, unsupported version, maintenance configuration, pod disruption, or node pool health problems that block safe upgrades.

Mapped Azure CLI commands

AKS node image CLI commands

direct-or-adjacent

az aks nodepool get-upgrades --resource-group <resource-group> --cluster-name <cluster-name> --nodepool-name <nodepool-name>

az aks nodepooldiscoverContainers

az aks nodepool show --resource-group <resource-group> --cluster-name <cluster-name> --name <nodepool-name> --query nodeImageVersion

az aks nodepooldiscoverContainers

az aks nodepool upgrade --resource-group <resource-group> --cluster-name <cluster-name> --name <nodepool-name> --node-image-only

az aks nodepooloperateContainers

az aks upgrade --resource-group <resource-group> --name <cluster-name> --node-image-only --yes

az aksoperateContainers

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.kubernetes\.azure\.com/node-image-version}{"\n"}{end}'

Architecture context

Architecturally, node images are part of the AKS lifecycle strategy. I plan them with Kubernetes version upgrades, node pool separation, operating system SKU choices, workload disruption budgets, cluster autoscaler behavior, and maintenance windows. System node pools, user node pools, GPU pools, Windows pools, and specialized workloads may need different sequencing. Automatic channels reduce human delay, but production still needs observability and exception handling. Blue-green or surge-based patterns may be justified for strict availability workloads. A strong design treats node image upgrades as routine capacity turnover, with enough spare nodes and pod policies to keep service-level objectives intact safely during patching.

Security

Security impact is direct because node images carry operating system patches, kernel fixes, container runtime updates, and AKS-managed component updates. A cluster can have good RBAC and network policy but still run nodes with known vulnerabilities if images are stale. Review who can change auto-upgrade channels, start node pool upgrades, or pause maintenance. Track image age, supported OS SKU status, and exceptions for workloads that cannot tolerate disruption. Combine node image management with Defender for Cloud, vulnerability scanning, policy, and incident response. Do not rely on unattended Linux updates alone; node reboots or image upgrades may still be needed to complete fixes.

Cost

Node image upgrades usually have no separate image charge, but the operational cost is real. Surge settings can temporarily add nodes, blue-green strategies may double capacity, and failed upgrades consume engineer time. Delayed patching creates audit exceptions and emergency maintenance, which is expensive in a different way. Automatic channels reduce manual effort, but they still need planning around maintenance windows and workload readiness. FinOps teams should understand temporary capacity spikes during upgrades and ensure clusters are not oversized permanently just to make patching safe. Good node image hygiene balances security deadlines, spare capacity, and predictable maintenance labor over repeated cycles.

Reliability

Reliability depends on applying node images without removing too much capacity at once. Node image upgrades drain nodes and reschedule pods, so pod disruption budgets, replica counts, readiness probes, persistent volume behavior, and max surge settings are important. If a node pool is too small or workloads are single-replica, an otherwise routine image update can cause downtime. Stale images also create reliability risk through node readiness problems and unsupported configurations. Plan upgrades by node pool, test in nonproduction, respect maintenance windows, and verify all nodes return Ready with the expected image label. Reliable patching is controlled replacement, not surprise rebooting.

Performance

Performance impact can be positive or negative. New node images may include kernel, runtime, or AKS component improvements that help scheduling, networking, or container startup. During the upgrade, performance can dip if nodes drain, pods reschedule, caches warm, or capacity is temporarily constrained. Workloads without enough replicas may experience visible latency even when the upgrade succeeds technically. Measure application SLOs, pending pods, node pressure, and readiness timing during test upgrades. Keep node pools current enough to avoid old-image bugs, but schedule production upgrades when surge capacity and workload behavior can absorb the churn during planned production maintenance windows and drills.

Operations

Operators inspect AKS node images during patch cycles, security exceptions, scale failures, and upgrade planning. Runbooks should capture cluster name, node pool, OS type, OS SKU, Kubernetes version, current nodeImageVersion, latest available image, max surge, maintenance window, and workload disruption constraints. Before upgrading, check node pool health, pod disruption budgets, autoscaler settings, and capacity headroom. During the upgrade, watch node readiness, drain events, pending pods, and application alerts. Afterward, confirm every node reports the expected image version. Keep evidence because security teams often need proof that the fleet moved from old images to approved images across environments during security audits.

Common mistakes

Confusing Kubernetes version upgrades with node image upgrades and assuming one always replaces the need for the other.
Starting an image upgrade on a tiny node pool with single-replica workloads and no pod disruption budget review.
Forgetting to check quota or IP capacity before setting surge values that require temporary nodes.
Treating the Azure operation as complete without verifying Kubernetes node labels and Ready status afterward.
Ignoring OS SKU retirement or support windows until scale-out or image availability fails during an urgent change.

Operator quick checks

Run az aks nodepool get-upgrades and note latestNodeImageVersion before proposing the maintenance window.
Show current nodeImageVersion for each node pool and identify pools with different OS types or special workloads.
Check pod disruption budgets, replica counts, node pressure, and pending pods before starting the upgrade.
Confirm subscription quota, subnet IP capacity, and max surge settings can support temporary replacement nodes.
After the upgrade, list node image labels and verify every expected node is Ready on the new image.

Questions to ask

Which node pools are security-critical, customer-facing, or disruption-sensitive, and should they be upgraded first or last?
Who approves node-image-only upgrades, auto-upgrade channels, max surge settings, and maintenance window exceptions?
What workload breaks if a node drains, and do pod disruption budgets and replica counts protect it?
How will the team prove that every node pool reached the approved image version after the change?
What is the rollback or mitigation plan if a new image causes node readiness or application behavior problems?

Related terms

No related terms mapped yet.

Graph connections

Graph edges are queued for this term.

Learn next

Use related terms, graph links, command groups, and comparison cards to keep moving through Azure without losing context.

Open relationship graph