A scale set upgrade policy tells Azure how to apply changes to the virtual machines inside a Virtual Machine Scale Set. When the model changes, such as a new image, extension, data disk, or VM size, each instance may need an update. Manual policy makes operators trigger the upgrade. Automatic policy pushes changes without waiting for each manual instance action. Rolling policy updates batches so the whole fleet is not disrupted at once. The policy is a release-safety setting for VM fleets, not a general patching slogan.
VMSS upgrade policy, Virtual Machine Scale Set upgrade policy, upgradePolicy.mode, rolling upgrade policy, manual upgrade policy, automatic upgrade policy
Difficulty
intermediate
CLI mappings
4
Last verified
2026-05-23
Microsoft Learn
Microsoft Learn explains that a Virtual Machine Scale Set upgrade policy controls how model changes reach individual VM instances. Manual, automatic, and rolling modes affect when instances receive image, SKU, disk, extension, or zone changes, and rolling policies help production fleets update in batches while preserving serving capacity.
In Azure architecture, scale set upgrade policy belongs to the VMSS Compute control plane with documented ownership and recovery expectations. It governs how the scale set model is reconciled with existing instances after changes to image reference, SKU, disks, extensions, or other model properties. Rolling upgrades depend on batch size, health signals, pause time, and availability assumptions. The policy intersects with application health extension, load balancer probes, automatic OS image upgrades, instance repairs, orchestration mode, and deployment automation. Operators inspect it before changing images or platform settings on production fleets.
Why it matters
Scale set upgrade policy matters because VM fleet changes can become outages when every instance changes at once or when old instances never receive a required model update. A safe policy lets teams balance speed, control, and availability. Manual mode is useful when every batch needs inspection, but it can leave drift if operators forget to upgrade instances. Automatic mode is simple but risky for sensitive workloads. Rolling mode gives production teams a practical middle ground by replacing or updating controlled batches while keeping enough capacity online. The policy also creates clear expectations for incident response, compliance patching, and release rollback.
⌁
Where you see it
Signals, screens, and Azure surfaces where this term usually becomes operational.
Signal 01
In the VMSS Management or Updates settings, upgrade policy mode appears as Manual, Automatic, or Rolling, shaping how model changes reach instances during releases and patch windows.
Signal 02
In az vmss show output, upgradePolicy.mode and rollingUpgradePolicy fields reveal whether operators must trigger upgrades, pause batches, or let Azure advance automatically during release review.
Signal 03
In deployment templates, upgradePolicy declares rollout behavior, letting reviewers catch risky automatic upgrades before a production image, extension, or disk change ships in change control.
Signal 04
In Activity Log and Azure Monitor, rolling upgrade events expose batch timing, failed instance health, and whether a model change stalled. before completion and rollback.
✦
When this becomes relevant
Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.
Roll out a new VMSS image in batches so a bad build cannot remove the entire application tier at once.
Keep a GPU or batch worker scale set on manual upgrade policy until running jobs drain safely.
Adopt automatic OS image upgrades for hardened base images while preserving health checks and rollback evidence.
Compare staging and production upgrade policies before a release to catch accidental automatic updates in production.
Investigate mixed VMSS instance versions after a model change that left some machines serving old code or extensions.
◆
Real-world case studies
Different enterprise-style examples that show the term being used to hit measurable objectives.
Case study 01
Gaming platform prevents a bad image from draining match servers
Scenario, objectives, solution, measured impact, and takeaway.
📌Scenario
A multiplayer gaming platform hosted regional match coordinators on VM scale sets. A previous image release pushed an extension change to every instance and caused thousands of players to wait in lobbies.
🎯Business/Technical Objectives
Apply a new coordinator image without dropping more than 2 percent of active matches.
Detect bad batches within five minutes using health and matchmaking metrics.
Keep rollback simple if the extension failed again.
Align staging and production rollout behavior before launch weekend.
✅Solution Using Scale set upgrade policy
The engineering team changed the production scale set upgrade policy from automatic to rolling. They defined conservative batch sizing, confirmed load-balancer probes matched the coordinator readiness endpoint, and kept spare capacity above normal demand. Azure CLI exported upgradePolicy settings from staging and production before the release so reviewers could compare them. During rollout, each batch was monitored for failed probes, lobby wait time, and coordinator error rate. The release pipeline paused when any batch breached health thresholds and kept the previous gallery image version available for rollback.
📈Results & Business Impact
The new image reached all regions with no batch exceeding 1.3 percent match disruption.
A malformed extension setting was caught in staging and never reached production.
Rollback decision time dropped from 38 minutes to 7 minutes because the policy and image references were exported.
Launch-weekend lobby wait time stayed below the 20-second SLO.
💡Key Takeaway for Glossary Readers
A scale set upgrade policy turns VMSS image rollout into a controlled release process instead of a fleet-wide gamble.
Case study 02
Research institute protects long-running GPU simulations
Scenario, objectives, solution, measured impact, and takeaway.
📌Scenario
A climate research institute ran GPU simulations that lasted up to nine days on a VMSS. Automatic model updates had interrupted jobs when a driver extension changed without enough notice.
🎯Business/Technical Objectives
Prevent running simulations from being restarted by unplanned model updates.
Apply approved driver updates during scheduled research breaks.
Keep security teams informed about instances still awaiting upgrade.
Reduce wasted GPU hours caused by forced restarts.
✅Solution Using Scale set upgrade policy
The platform group moved the GPU scale set to manual upgrade policy and built a weekly review runbook. Azure CLI listed model differences, instance status, driver extension versions, and job tags. Researchers marked nodes ready after simulations completed, and operators triggered instance upgrades in small groups. The team documented accepted security exceptions in the change ticket when a critical run needed to finish before the driver update. They kept Azure Monitor alerts for unhealthy instances separate from upgrade state so a broken GPU host was still repaired quickly.
📈Results & Business Impact
Unplanned simulation restarts fell from eleven per quarter to one approved exception.
Wasted GPU runtime dropped by about 420 hours in the first quarter.
Security review time fell from two days to four hours because pending upgrade evidence was exported.
Researchers received predictable maintenance slots instead of surprise driver changes.
💡Key Takeaway for Glossary Readers
Manual upgrade policy is not outdated when the workload needs deliberate human control over when each expensive instance changes.
Scenario, objectives, solution, measured impact, and takeaway.
📌Scenario
A logistics IoT provider collected telemetry from refrigerated trailers through VMSS gateway nodes. A critical OpenSSL fix required a new image, but downtime risk was high during overnight shipping windows.
🎯Business/Technical Objectives
Deploy the patched image to all gateway nodes within forty-eight hours.
Keep telemetry ingestion loss below 1 percent during upgrade batches.
Prove each region used health-based rollout controls.
Avoid hand-editing upgrade settings across twenty-three scale sets.
✅Solution Using Scale set upgrade policy
The cloud team standardized rolling upgrade policy in Bicep and applied it through the release pipeline. Each scale set used a load-balancer probe tied to the gateway ingestion endpoint and maintained temporary spare capacity before batches began. Azure CLI compared upgradePolicy.mode, image reference, and rolling settings across regions before approval. During the rollout, operators watched telemetry gap rate, unhealthy instance count, and batch completion time. Any region that exceeded the failure threshold paused automatically while the team inspected gateway logs and extension status.
📈Results & Business Impact
All twenty-three scale sets reached the patched image in thirty-one hours.
Telemetry loss during upgrade windows stayed at 0.4 percent.
Manual policy drift was eliminated because Bicep enforced the same rolling settings everywhere.
Security signoff arrived one day earlier than planned with region-by-region evidence attached.
💡Key Takeaway for Glossary Readers
Rolling upgrade policy helps security patches move quickly without treating every VM instance as disposable at the same moment.
Why use Azure CLI for this?
After ten years of Azure engineering, I use Azure CLI for scale set upgrade policy because the risky part is rarely the first click; it is proving what the fleet will do after the model changes. CLI lets me inspect upgradePolicy.mode, rolling settings, image references, instance health, and pending instance updates before I approve a release. It also lets automation set the same mode across environments, trigger manual upgrades when needed, and export evidence after each batch. The portal can explain the setting, but CLI gives change history, repeatability, and safer comparisons between staging and production scale sets safely.
CLI use cases
Inspect upgradePolicy.mode, rollingUpgradePolicy, image reference, and instance health before approving a model update.
Create or update a VMSS with a reviewed upgrade policy mode instead of relying on portal defaults.
Trigger or monitor manual instance upgrades when policy mode requires operators to advance the fleet deliberately.
Before you run CLI
Confirm tenant, subscription, resource group, scale set, region, orchestration mode, permissions, and output format before changing policy.
Check spare capacity, health probes, application health extension, image version, extension changes, and rollback plan before a rollout.
Review destructive and cost risk because a policy change can restart instances, create surge capacity, or leave workloads under-provisioned.
What output tells you
upgradePolicy.mode tells whether Azure waits for manual action, updates automatically, or proceeds through rolling batches.
rollingUpgradePolicy fields describe batch size, pause behavior, unhealthy thresholds, and whether capacity can surge during upgrade.
Instance view output shows which machines are updated, unhealthy, or still running an older model after the policy is applied.
Mapped Azure CLI commands
Vmss operations
adjacent
az vmss list --resource-group <resource-group>
az vmssdiscoverCompute
az vmss show --name <scale-set-name> --resource-group <resource-group>
az vmssdiscoverCompute
az vmss create --name <scale-set-name> --resource-group <resource-group> --image <image>
az vmssprovisionCompute
az vmss scale --name <scale-set-name> --resource-group <resource-group> --new-capacity <count>
az vmssoperateCompute
Architecture context
Architecturally, the upgrade policy is a deployment-control choice for compute fleets. I decide it after looking at workload tolerance, load-balancer health checks, zone distribution, spare capacity, image source, extension behavior, and rollback mechanics. A stateless web tier can usually tolerate rolling upgrades with strong health probes. A batch worker, GPU pool, or stateful service may need manual control and queue draining. The policy also has to match IaC and image governance. If Bicep declares one mode, a release pipeline changes another, and operations trigger manual repairs separately, you get drift that is difficult to explain during an outage review meeting.
Security
Security impact is direct when upgrade policy controls how quickly hardened images, extension fixes, and OS updates reach running instances. Manual mode can protect fragile workloads, but it also lets vulnerable instances remain unchanged if the upgrade action is forgotten. Automatic or rolling modes can reduce exposure, but only when health signals and rollback plans are credible. Operators should verify who can change the policy, whether images come from trusted galleries, whether extensions are signed and necessary, and whether managed identities and network controls remain consistent after each batch. A bad policy can turn patch management into permanent drift quickly.
Cost
Upgrade policy has an indirect but meaningful cost path. Rolling upgrades often require spare capacity or max surge behavior, which can create temporary compute and disk cost. Manual upgrades may reduce immediate risk, but they increase operator time and can leave older, less efficient images running. A failed automatic upgrade can waste hours in incident response, retries, and customer support. Teams should include upgrade policy in FinOps discussions when fleets are large, images are frequent, or maintenance windows are expensive. The cheapest setting is not the one with fewer instances; it is the one that avoids avoidable downtime and rework.
Reliability
Reliability impact is direct because the policy decides how much of the fleet can change at the same time. Updating too many instances can drain capacity, fail health probes, or overload remaining workers. Updating too slowly can leave mixed behavior across old and new instances. Rolling policies improve reliability when they use realistic batch sizes, health checks, pause times, and spare capacity. Manual policies improve control but require disciplined runbooks. Operators should test policy behavior in staging, validate actual instance health during upgrades, and know how to pause, repair, or roll back if a new model breaks production traffic unexpectedly.
Performance
Performance impact is indirect during normal operation and direct during rollout. A policy that updates too many instances at once can reduce serving capacity, increase queue depth, and raise p95 latency even if the final image performs better. A policy that leaves mixed versions can create inconsistent response times or cache behavior. Rolling upgrades protect performance when health probes, batch percentages, and pause times reflect real warm-up periods. Operators should watch request latency, CPU, memory, load-balancer distribution, and downstream errors while batches proceed. The goal is a stable fleet transition, not merely a completed model update record in production logs.
Operations
Operators manage scale set upgrade policy by inspecting the current mode, reviewing model differences, validating health signals, and deciding whether upgrades should be manual, automatic, or rolling. Before a change, they export the scale set model, image version, extension list, rolling settings, and instance status. During rollout, they watch batch progress, health probe failures, application errors, and capacity. After rollout, they confirm all instances match the latest model or document intentional exceptions. Good operations practice also prevents emergency portal edits from drifting away from Bicep, Terraform, or release-pipeline definitions after urgent incidents, routine governance reviews, and audit requests after releases.
Common mistakes
Choosing automatic mode for a fragile application without validating health probes, warm-up time, and rollback behavior.
Leaving manual mode in place after a migration, then discovering months later that instances never received the approved image.
Assuming rolling mode is safe while capacity is too low for a batch to be removed from service.