Compute Virtual Machines field-manual-complete field-manual field-manual-complete

Update domain

An update domain is Azure’s way of avoiding a planned maintenance event that restarts every virtual machine in an application at the same time. When VMs are placed in an availability set, Azure assigns them to update domains. During platform maintenance, Azure works through those domains one group at a time. If the application has enough healthy instances across domains, users should keep receiving service while one group is patched, rebooted, or temporarily unavailable.

Aliases
platform update domain, VM update domain, availability set update domain, planned maintenance domain
Difficulty
intermediate
CLI mappings
5
Last verified
2026-05-28

Microsoft Learn

An update domain is a logical group of virtual machines that Azure can reboot or update together during planned platform maintenance. In an availability set, Azure spreads VMs across update domains so only part of the application tier is affected at one time.

Microsoft Learn: Availability options for Azure virtual machines2026-05-28

Technical context

In Azure architecture, update domains belong to the compute placement model for availability sets and related virtual machine patterns. They sit below the application and operating system but above the physical host maintenance process. A VM records availability-set membership, and that set defines the platform update domain count. Architects combine update domains with fault domains, load balancers, health probes, zones, and application retry logic so planned platform maintenance does not become a full application outage.

Why it matters

Update domains matter because planned platform maintenance is normal, not exceptional. Without deliberate placement, a small application can lose all replicas during one maintenance wave or require manual outage windows for routine host updates. Update domains let engineers design for partial loss and test whether the service survives one group being unavailable. They also shape runbooks: operators can drain traffic, watch health probes, and confirm quorum before the next group is affected. For learners, the term explains why high availability is not only about backups or regions. It gives teams a concrete maintenance-risk conversation before users feel the impact. It also improves release and audit planning.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

Azure CLI availability-set output shows platformUpdateDomainCount beside fault-domain values, making the maintenance-wave configuration visible for a specific resource group, region, and deployment review record. clearly

Signal 02

ARM or Bicep templates define platformUpdateDomainCount on an availability set, and VM resources reference that set through availabilitySet IDs in repeatable infrastructure code reviews. cleanly

Signal 03

Incident timelines show VMs rebooting at different maintenance times, while load balancer health probes reveal whether remaining update domains carried real user traffic safely. consistently

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

  • Design a two-tier VM application so planned Azure maintenance drains only one replica group at a time.
  • Audit legacy availability sets before a migration to decide whether zones, scale sets, or current placement best fits the workload.
  • Validate that every manually created VM joined the intended availability set instead of silently reducing maintenance resilience.
  • Plan rolling application patching that aligns with Azure placement boundaries and load balancer health probes.
  • Explain why a maintenance notice affected some VM instances while sibling instances in other domains stayed online.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

Municipal payment system survives host maintenance

Municipal payment system survives host maintenance: Update domains turn platform maintenance from a surprise restart into a placement-aware operating model that teams can test and explain.

Scenario

A city tax agency ran its payment portal on four Windows VMs in one region. During annual tax week, a planned platform update had previously restarted both web servers within minutes, forcing clerks to take phone payments manually.

Business/Technical Objectives
  • Keep online payments available during Azure planned maintenance.
  • Document which web and API nodes could restart together.
  • Reduce emergency maintenance calls during peak filing weeks.
  • Give auditors evidence that availability design matched the runbook.
Solution Using Update domain

The infrastructure team rebuilt the portal into two availability sets, one for web nodes and one for API nodes, each with five update domains. They placed every VM behind Azure Load Balancer with health probes and used CLI inventory to prove no replica was deployed outside its set. Application patching windows were also aligned to the same grouping, so the team could drain one domain, wait for healthy probes, and continue to the next. Monitoring workbooks tracked backend health, CPU headroom, and restart timestamps during platform notices.

Results & Business Impact
  • Unplanned payment interruptions during maintenance dropped from three events per year to zero in the next filing cycle.
  • Maintenance readiness review time fell from four hours to forty minutes because placement evidence came from repeatable CLI exports.
  • The portal handled one web domain offline with latency remaining under the 300 ms internal target.
Key Takeaway for Glossary Readers

Update domains turn platform maintenance from a surprise restart into a placement-aware operating model that teams can test and explain.

Case study 02

Game matchmaking tier keeps players connected

Game matchmaking tier keeps players connected: Update domains are most useful when application state, load balancing, and rebuild automation are designed to respect them. with rehearsed accountability

Scenario

A multiplayer game studio hosted lobby and matchmaking services on Linux VMs. A maintenance wave once restarted enough lobby nodes to disconnect thousands of players during a weekend tournament. during public events

Business/Technical Objectives
  • Avoid simultaneous restarts of active matchmaking replicas.
  • Keep tournament lobby reconnects below the support escalation threshold.
  • Verify VM placement after auto-rebuild scripts replaced unhealthy nodes.
Solution Using Update domain

Engineers mapped the service into an availability set with explicit update domain count and added a preflight check to the VM rebuild pipeline. Every node joined the correct load balancer backend pool, and session state moved to a managed cache so any single update domain could restart without losing the full lobby. Azure CLI commands listed availability set membership after each image rollout. A dashboard compared domain placement, active sessions, and reconnect rate so the on-call team knew whether a degradation was a platform wave or an application release. The revised drill recorded owner approval and rollback evidence.

Results & Business Impact
  • Tournament disconnect spikes fell by 82 percent compared with the previous maintenance period.
  • Bad rebuilds were caught in deployment validation instead of after players reported missing lobbies.
  • The operations team reduced weekend bridge time by roughly two hours per release window.
Key Takeaway for Glossary Readers

Update domains are most useful when application state, load balancing, and rebuild automation are designed to respect them. with rehearsed accountability safely for donors

Case study 03

Factory control gateway avoids line stoppage

Factory control gateway avoids line stoppage: Update domains help bridge cloud maintenance behavior and real-world operational continuity for VM-based systems. before the next maintenance event reaches production users

Scenario

A manufacturer used Azure-hosted gateway VMs to relay telemetry between plants and a central analytics system. When both gateways restarted together, production engineers lost near-real-time visibility into quality defects. across critical shifts and product changeovers

Business/Technical Objectives
  • Maintain at least one gateway path during Azure maintenance.
  • Preserve telemetry buffering without requiring plant firewall changes.
  • Cut incident triage time for gateway restarts by half.
  • Make VM placement visible to the operational technology team.
Solution Using Update domain

The cloud team placed gateway VMs in an availability set and documented their update domain distribution in the plant operations runbook. Each gateway used managed disks, a shared message buffer, and a load-balanced endpoint with health probes. CLI checks were added to the monthly review pack, showing availability set membership, VM power state, and recent restarts. The design did not claim regional disaster recovery, but it ensured Azure planned maintenance would not remove both gateway paths at once. The runbook also named a rollback owner and alert channel.

Results & Business Impact
  • Telemetry blind spots during planned maintenance dropped from twenty minutes to under two minutes.
  • The plant support team could identify platform restarts without waiting for cloud escalation.
  • Quality analysts avoided an estimated 12 hours of manual reconciliation per quarter.
Key Takeaway for Glossary Readers

Update domains help bridge cloud maintenance behavior and real-world operational continuity for VM-based systems. before the next maintenance event reaches production users with confidence with owners

Why use Azure CLI for this?

Azure CLI is valuable for update domains because the portal hides too much placement detail when you are under pressure. As an Azure engineer, I use CLI to inventory which VMs belong to an availability set, confirm the platform update domain count, pull instance-view data, and compare that design with load balancer membership. The command line also makes evidence repeatable: export JSON before a maintenance window, check drift after a rebuild, and prove whether a VM was accidentally deployed outside the intended availability boundary. That repeatable evidence is what keeps readiness reviews honest. It also makes maintenance reviews easier to repeat.

CLI use cases

  • Inventory every availability set and confirm update domain counts before a planned maintenance window.
  • Find VMs missing availability set membership after manual rebuilds, image migrations, or emergency deployments.
  • Create a correctly shaped availability set from automation instead of relying on inconsistent portal defaults.
  • Export placement evidence for architecture review, incident analysis, or compliance change records.

Before you run CLI

  • Confirm the tenant, subscription, resource group, and availability set name because similarly named VM groups often exist across environments.
  • Use read-only commands first; creating a new availability set does not move existing VMs and can create misleading empty resources.
  • Check regional limits and deployment constraints before changing architecture, because update domains are tied to placement decisions made at VM creation.
  • Choose JSON output for automation and table output for human review during maintenance calls.

What output tells you

  • platformUpdateDomainCount shows how many planned-maintenance groups the availability set can spread instances across.
  • The VM list query shows whether each VM is actually attached to an availability set or running outside the intended design.
  • Instance view data helps correlate VM health, power state, and restart evidence with the maintenance timeline.
  • Resource IDs confirm that automation is checking the correct availability set in the correct subscription and region.

Mapped Azure CLI commands

Update domain CLI commands

adjacent
az vm availability-set show --resource-group <resource-group> --name <availability-set>
az vm availability-setdiscoverCompute
az vm availability-set list --resource-group <resource-group> --output table
az vm availability-setdiscoverCompute
az vm list --resource-group <resource-group> --query "[].{name:name,availabilitySet:availabilitySet.id}" --output table
az vmdiscoverCompute
az vm get-instance-view --resource-group <resource-group> --name <vm-name>
az vmdiscoverCompute
az vm availability-set create --resource-group <resource-group> --name <availability-set> --platform-update-domain-count 5 --platform-fault-domain-count 2
az vm availability-setprovisionCompute

Architecture context

Architecturally, update domains are a local availability construct, not a disaster recovery strategy. They reduce the blast radius of Azure-initiated maintenance inside one placement model. I usually explain them beside fault domains: fault domains reduce simultaneous hardware-failure risk, while update domains reduce simultaneous planned-maintenance risk. A mature design still needs load-balanced instances, stateless or replicated application tiers, database quorum planning, monitoring, and deployment automation. For new builds, availability zones often deserve first consideration, but update domains remain important for workloads that still use availability sets or legacy VM patterns. Put the degraded-state capacity target and rollback owner in the architecture review.

Security

Security impact is indirect because an update domain does not grant access, encrypt data, or open network paths. Risk appears when maintenance reduces capacity and teams bypass controls to restore service quickly. A weak availability design can push operators toward emergency local administrator access, unmanaged snapshots, or temporary firewall changes during a degraded window. Keep RBAC, Just-in-Time VM access, disk encryption, managed identities, and change approvals intact. Maintenance-aware design should reduce panic, not become an excuse for weakening identity or network boundaries. Access to change placement belongs with tightly governed compute administrators, and response plans should preserve controls while capacity is reduced.

Cost

Update domains do not create a separate line item, but they influence cost through the number of instances required for resilience. A workload that must stay available during planned maintenance needs extra VM capacity, load balancing, monitoring, and sometimes licensed software on each replica. The alternative is cheaper on paper and expensive during downtime. FinOps reviews should separate wasted idle replicas from intentional availability capacity. Rightsizing still matters, but removing a node without understanding update domain coverage can convert planned maintenance into lost revenue. Compare the resilience spend with downtime cost, service level commitments, and modernization options such as zones or managed services.

Reliability

Reliability is the core reason update domains exist. They reduce the chance that planned maintenance removes every instance of a tier at once. The pattern works only when the application has enough replicas, the load balancer can detect unhealthy nodes, and state is externalized or replicated safely. Two VMs in one availability set are better than two uncoordinated VMs, but they still need probes, retries, quorum awareness, and tested restart behavior. Treat update domains as one layer in a broader continuity design. Run reduced-capacity drills so the platform does not become the first real test. Document expected degraded capacity before every window.

Performance

Runtime performance is affected indirectly. When one update domain is unavailable, the remaining instances must absorb traffic without breaching latency or queue limits. If normal utilization is already high, planned maintenance exposes the bottleneck immediately. Performance testing should model one domain drained or rebooted, not only steady-state load across every VM. Watch CPU, memory, connection counts, backend health, request latency, retries, and queue depth during rolling maintenance. A healthy design has enough headroom for the reduced-capacity period and predictable warmup behavior when the drained group returns. Capacity tests should include cache warmup, dependency latency, retry storms, and operator response time during drills.

Operations

Operators use update domain information before maintenance windows, platform notices, VM rebuilds, and availability reviews. The practical work is to list VMs in each availability set, confirm domain counts, verify every instance is behind the correct load balancer, and document which application roles can be down together. During troubleshooting, compare observed outages with placement data. If multiple failed nodes share a domain, the design may be working; if all active nodes sit in one domain, deployment automation needs correction. Include owners, diagrams, maintenance history, and verification commands in the runbook so evidence survives team turnover. That prevents readiness knowledge from disappearing.

Common mistakes

  • Assuming update domains protect against regional outages; they only reduce planned maintenance impact within the placement model.
  • Creating the availability set after the VMs already exist and expecting Azure to move them automatically.
  • Putting active and passive nodes in a way that leaves the active path concentrated in one update domain.
  • Ignoring load balancer health probes, so traffic still reaches a VM while its domain is being updated.