Compute Virtual Machines complete template-specs-five-use-cases template-specs-five-use-cases-three-case-studies

Spot eviction

Spot eviction is what happens when Azure takes back a Spot virtual machine because the spare capacity is needed elsewhere or the price condition no longer matches your setting. Spot VMs can be much cheaper than regular VMs, but they are not guaranteed to stay available. Depending on the eviction policy, the VM is either stopped and deallocated or deleted. This makes Spot useful for flexible jobs, test environments, batch processing, and scale-out workers, but risky for systems that cannot tolerate sudden interruption.

Back to glossary browser Open Microsoft Learn source

Aliases: Spot eviction, spot eviction, spot-eviction
Difficulty: intermediate
CLI mappings: 5
Last verified: 2026-05-24

Microsoft Learn

Microsoft Learn explains that Azure Spot Virtual Machines can be stopped when Azure needs capacity for pay-as-you-go workloads or when the Spot price exceeds the maximum price you set. The eviction policy determines whether the VM is deallocated or deleted.

Microsoft Learn: About Azure Spot Virtual Machines2026-05-24

Technical context

In Azure architecture, spot eviction belongs to the compute capacity and scheduling model for virtual machines and virtual machine scale sets. The control plane stores priority, eviction policy, maximum price, image, disk, network, identity, and scale-set configuration. The operational impact appears in the workload plane when an instance disappears, changes power state, or loses local progress. Teams usually connect Spot workloads to queues, autoscale rules, load balancers, managed disks, images, batch schedulers, checkpoint storage, and observability so interrupted work can be retried safely.

Why it matters

Spot eviction matters because the savings are real only when the workload is built for interruption. A training job that checkpoints every few minutes may tolerate eviction well. A database, stateful app, or customer-facing single instance may fail badly. The term forces engineers to ask whether the application can drain, retry, rebuild, or replace work after capacity disappears. It also affects cost because deallocated Spot VMs may still leave disks behind, while delete policies can remove disks and data. Good Spot design turns unused Azure capacity into cheaper compute without pretending it has standard VM availability. That distinction protects production commitments. During every architecture review.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

The VM or scale set configuration shows Spot priority, eviction policy, maximum price, size, zone, image, disk settings, and whether the instance is deallocated. and ownership tags explicitly.

Signal 02

Azure Activity Log and instance view output show eviction-related state changes, failed reallocations, power transitions, and cleanup actions tied to the affected compute resource. and alerts during drills.

Signal 03

Queue depth, batch job history, autoscale metrics, and cost reports reveal whether Spot eviction is delaying work, wasting retries, or leaving disks behind. after scheduled runs after evictions.

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

Run interruptible batch workers that process queue messages and checkpoint progress before an eviction can discard work.
Lower cost for CI runners, image builds, test labs, or ephemeral development environments that can be recreated quickly.
Add burst capacity to a baseline fleet of standard VMs when demand spikes but strict availability is not required.
Compare eviction rates by VM size, zone, and region before selecting a Spot pool for machine learning or rendering jobs.
Choose delete or deallocate policy deliberately so interrupted instances do not create surprise disk cost or data loss.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

Animation studio finishes render bursts without owning peak capacity

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A boutique animation studio had weekly render bursts before client reviews. Standard VMs finished the work, but peak capacity sat idle for most of the month.

Business/Technical Objectives

Cut render compute cost without delaying Friday client deliveries.
Keep completed frames safe if a worker was evicted mid-scene.
Avoid storage charges from abandoned deallocated workers.
Measure completed frames per dollar instead of VM hourly price alone.

Solution Using Spot eviction

The pipeline kept a small standard VM scale set for baseline rendering and added Spot workers for burst scenes. Each worker pulled one frame range from a queue, wrote completed frames to Azure Storage, and checkpointed partial progress every few minutes. The team chose delete eviction policy because no required output lived on the VM after checkpointing. Azure CLI scripts listed Spot instances, confirmed priority and eviction policy, and checked for leftover disks after each delivery window. Metrics combined queue backlog, eviction events, completed frames, and storage cost.

Results & Business Impact

Peak render cost fell 47 percent across six production cycles.
Client review packages stayed on schedule for 11 consecutive Fridays.
Lost work after eviction averaged under four minutes per affected worker.
Unattached disk cleanup dropped from 19 manual findings to zero.

Key Takeaway for Glossary Readers

Spot eviction is manageable when every unit of work is checkpointed, replaceable, and separated from durable output storage.

Case study 02

Genomics lab controls sequencing analysis spend

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A university genomics lab ran alignment and variant-calling jobs that could pause safely but consumed expensive compute during grant-funded research peaks. Researchers needed lower cost without hiding failed work.

Business/Technical Objectives

Reduce compute cost for interruptible analysis stages by at least 35 percent.
Preserve intermediate files and audit logs when workers disappeared.
Keep urgent clinical validation jobs off volatile capacity.
Show grant managers clear usage by project and VM priority.

Solution Using Spot eviction

The platform team split the workflow into priority tiers. Clinical validation used standard VMs, while exploratory batch stages used Spot instances behind a scheduler that retried from durable checkpoints in Data Lake Storage. Deallocate policy was used for selected debugging pools where disks had to be inspected, while routine pools used delete policy. CLI reports exported VM priority, owner tags, eviction policy, and stopped-deallocated counts every morning. When eviction frequency rose for one VM size, the team switched the Spot pool to a more available size and limited concurrent retries.

Results & Business Impact

Exploratory analysis compute spend fell 41 percent in the first semester.
Clinical validation jobs maintained the same two-hour service-level target.
Grant cost reports separated standard and Spot usage with 96 percent tag coverage.
Debug disk retention was limited to approved pools, reducing orphaned disk cost by 68 percent.

Key Takeaway for Glossary Readers

Spot eviction supports research economics when workload priority, checkpointing, disk policy, and reporting are designed together.

Case study 03

SaaS vendor stops CI runners from becoming surprise infrastructure

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A SaaS vendor used Spot VMs for continuous integration runners. Builds were cheap, but evicted and deallocated runners left disks and stale registrations that confused deployment teams.

Business/Technical Objectives

Keep CI cost savings while removing abandoned runners automatically.
Prevent stale agents from appearing available in the build system.
Protect release builds from Spot volatility during scheduled launches.
Reduce manual cleanup performed by platform engineers each Monday.

Solution Using Spot eviction

Engineers moved normal feature-branch builds to a Spot scale set with delete eviction policy and short-lived runner registration tokens. Release builds used standard priority capacity during launch windows. A cleanup script used Azure CLI to list Spot instances, compare power state, remove stale runner records, and verify that deleted workers did not leave managed disks. Build definitions checkpointed cache artifacts to storage rather than the VM. Dashboards tracked eviction count, build retry count, average queue time, and storage leftovers.

Results & Business Impact

Feature-branch build compute cost fell 52 percent month over month.
Stale runner records dropped from 73 per month to fewer than 5.
Release build queue time stayed under 10 minutes during three product launches.
Manual cleanup fell from four engineer-hours per week to 20 minutes of review.

Key Takeaway for Glossary Readers

Spot eviction is a good fit for CI only when runner identity, cache storage, cleanup, and release exceptions are automated.

Why use Azure CLI for this?

After years of operating Azure compute fleets, I use Azure CLI for spot eviction checks because the important settings are easy to overlook in the portal. CLI can show priority, eviction policy, maximum price, scale-set configuration, instance state, zones, disk retention, and tags across many resource groups. It is also the fastest way to compare production and test definitions, detect drift from infrastructure code, and gather evidence during an incident where workers vanished. For Spot, automation matters because replacement, cleanup, and capacity checks need to happen faster than a manual portal investigation. It also supports automated shutdown and cleanup reviews. That repeatability matters when capacity disappears during releases.

CLI use cases

List VMs or scale-set instances using Spot priority and export their eviction policy, size, zone, and tags.
Show instance view after a job interruption to confirm whether the VM was evicted, deallocated, stopped, or deleted.
Compare maximum price and eviction policy between infrastructure code and the currently deployed fleet.
Find deallocated Spot VMs and unattached disks that are still creating storage cost after eviction.
Update scale-set capacity or move workers to standard priority during a high-priority processing window.

Before you run CLI

Confirm subscription, resource group, VM or scale set name, and whether the workload is production, test, or batch-only.
Use read permission for inventory, but require compute contributor rights for delete, deallocate, capacity, or policy changes.
Understand the eviction policy before cleanup because deleting a Spot VM can remove disks if the design expects them to persist.
Check queue backlog, checkpoint location, and baseline standard capacity before changing a Spot fleet during active processing.
Use JSON output when scripting cleanup, because table output can hide disk IDs, priority values, and provisioning details.

What output tells you

Priority shows whether the VM is Spot or regular capacity, which determines whether eviction risk applies.
Eviction policy tells you whether Azure will deallocate the VM or delete it when capacity or price conditions trigger eviction.
Maximum price, size, region, and zone help explain why some workers are interrupted more often than others.
Power state and provisioning state show whether an instance is running, stopped-deallocated, failed, or waiting for allocation.
Disk and network resource IDs show what may remain billable or require cleanup after an eviction event.

Mapped Azure CLI commands

Spot VM eviction and cleanup checks

direct-compute-operations

az vm show --name <vm-name> --resource-group <resource-group> --query "{priority:priority,evictionPolicy:evictionPolicy,provisioningState:provisioningState}"

az vmdiscoverCompute

az vm get-instance-view --name <vm-name> --resource-group <resource-group>

az vmdiscoverCompute

az vm list --resource-group <resource-group> --query "[?priority=='Spot'].[name,powerState,evictionPolicy]" --show-details

az vmdiscoverCompute

az vmss show --name <scale-set> --resource-group <resource-group>

az vmssdiscoverCompute

az disk list --resource-group <resource-group> --query "[?managedBy==null].[name,diskSizeGb,sku.name,tags]" --output table

az diskdiscoverCompute

Architecture context

Architecturally, Spot eviction should be designed around disposable capacity. I would place Spot instances behind a queue, batch service, scale set, or orchestrator that understands retry and replacement. Persistent data belongs on durable storage, not on the temporary disk or in a process that cannot checkpoint. Load balancers and health probes should remove interrupted instances quickly. For mixed fleets, regular VMs provide the baseline and Spot adds burst capacity. The design should document eviction policy, maximum price behavior, cleanup steps, quota, zone availability, and what happens when Azure cannot immediately allocate replacement capacity. Replacement capacity should be considered probabilistic, not guaranteed. This avoids confusing savings with resilience goals.

Security

Security impact is indirect but still important. Eviction itself is a capacity event, not an access-control feature, but the response can create risk. A delete policy may remove disks that contain evidence needed for investigation. A deallocate policy may leave disks, public IPs, managed identities, or attached resources behind longer than expected. Replacement automation can accidentally create instances with weaker network rules, old images, or broad managed identities. Teams should use least-privilege identities, hardened images, disk encryption, tag-based cleanup, and policy controls so interrupted Spot capacity does not become unmanaged infrastructure. Cleanup automation should preserve evidence required by security teams. Cleanup is part of secure recovery after interruption too.

Cost

Cost impact is the main reason teams use Spot, but eviction policy changes the economics. Spot compute can be cheaper while it runs, yet deallocated instances can still leave managed disks and snapshots billable. Delete policy can reduce leftover storage cost but may remove data if the workload was not designed correctly. Frequent eviction can also increase cost by causing repeated startup, lost work, longer job windows, and operator time. FinOps reviews should compare actual completed work per dollar, not only VM hourly price, and should tag Spot capacity separately from standard baseline compute. Storage leftovers often erase expected savings. Measure finished work, not raw allocation.

Reliability

Reliability impact is direct because eviction can happen with limited warning and no guarantee that the same capacity will return. Workloads must tolerate lost instances, incomplete jobs, and changing scale. Reliable Spot use depends on checkpointing, idempotent processing, queue visibility timeouts, retries, graceful shutdown handling, and enough non-Spot baseline capacity for critical work. Scale sets should define health probes and replacement behavior, while operators monitor eviction frequency by size and region. Disaster recovery plans should not depend on Spot capacity being available during a regional stress event because that is exactly when spare capacity may shrink. Test evictions before trusting savings. Before wider production use.

Performance

Performance impact is not only raw VM speed. Eviction interrupts throughput, increases queue backlog, delays batch completion, and can make apparent performance unstable when capacity churns. A workload may run quickly on Spot until several workers disappear, after which retries and replacement delay dominate the timeline. Larger or rarer VM sizes can have different eviction patterns than common sizes. Operators should measure completed jobs per hour, checkpoint overhead, warm-up time, and backlog recovery after eviction. Mixed fleets often perform better than all-Spot designs because baseline capacity keeps progress moving during capacity pressure. Recovery time after churn should be measured explicitly. Throughput reports should include retries and restarts too.

Operations

Operators inspect Spot eviction through VM priority settings, eviction policy, instance view, activity logs, scale-set instance states, metrics, and workload queues. Day-two work includes checking which jobs were interrupted, cleaning up deallocated instances or orphaned disks, adjusting maximum price, and deciding whether a size or region has become too volatile. Incident response should look at eviction events beside application retries, queue backlog, and autoscale actions. Runbooks should define when to pause low-priority jobs, move to standard VMs, change the scale mix, or reduce worker count to protect the budget. Operators should review eviction trends before expanding Spot coverage. Review cleanup queues weekly. This prevents cost waste and incident confusion after capacity loss events.

Common mistakes

Putting stateful databases or single-instance customer services on Spot VMs and treating eviction as a rare edge case.
Choosing deallocate policy without cleaning up stopped Spot instances and their billable managed disks.
Using delete policy before confirming that checkpoints, logs, and required outputs are stored away from the VM.
Scaling all workers as Spot with no standard baseline, causing the entire job to stall during capacity pressure.
Measuring savings by hourly VM price while ignoring repeated retries, lost work, longer run windows, and operator cleanup time.

Operator quick checks

List Spot VMs and scale-set instances, then verify priority, eviction policy, maximum price, size, zone, and owner tags.
Check for stopped-deallocated Spot instances and unattached disks after recent eviction events.
Review queue backlog and checkpoint success before changing scale or deleting interrupted workers.
Confirm the workload can retry idempotently and does not depend on temporary disk data after eviction.
Compare recent eviction frequency by VM size and region before expanding the Spot pool.

Questions to ask

What work is lost if this instance disappears during the busiest five minutes of the job?
Which resources remain billable after eviction, and who is responsible for cleanup?
What baseline capacity continues processing if every Spot worker in the region is evicted?
How quickly can the system checkpoint, drain, retry, or replace interrupted work?
What monitoring proves the savings still outweigh lost work, retries, and operational effort?

Related terms

No related terms mapped yet.

Graph connections

Graph edges are queued for this term.

Learn next

Use related terms, graph links, command groups, and comparison cards to keep moving through Azure without losing context.

Open relationship graph