AI and Machine Learning Microsoft Foundry verified

Provisioned throughput for AI

Provisioned throughput for AI is the Azure model-serving option you choose when a generative AI workload needs reserved capacity instead of only shared capacity. You assign provisioned throughput units, often called PTUs, to a supported deployment so the application has a steadier amount of processing power available. It is useful for customer-facing assistants, agents, and batch workloads with known demand. It also creates a billing commitment, so teams must size it from traffic data rather than optimism.

Aliases
No aliases mapped yet
Difficulty
advanced
CLI mappings
6
Last verified
2026-05-20

Microsoft Learn

Provisioned throughput for AI in Microsoft Foundry allocates provisioned throughput units to supported model deployments. PTUs give reserved processing capacity for predictable latency and throughput, with deployment choices such as regional, data-zone, or global provisioned capacity and separate reservation-based billing options.

Microsoft Learn: What is provisioned throughput for Foundry Models?2026-05-20

Technical context

In Azure architecture, provisioned throughput for AI sits on the model deployment inside Microsoft Foundry or Azure OpenAI. The deployment uses a provisioned SKU, capacity value, supported model, and region or data-zone choice. Application code still sends inference requests to the deployment endpoint, but platform teams manage PTU quota, utilization, latency, reservations, monitoring, and change control. The concept connects AI engineering with cloud capacity management: prompt length, output length, concurrency, and retry behavior all affect how much provisioned capacity the workload actually consumes.

Why it matters

Provisioned throughput for AI matters because serious AI applications often move from experimentation to committed user experiences. Once an assistant supports contact-center agents, claims reviewers, developers, or field technicians, unpredictable latency and throttling become business problems. PTUs help teams buy predictable serving capacity for supported models, but they also expose poor planning. Too few units cause bottlenecks; too many units create idle spend. The feature forces a mature conversation across engineering, SRE, FinOps, and product owners: what traffic do we expect, what latency do we promise, what model do we need, and how will we prove the capacity is being used well?

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

Microsoft Foundry or Azure OpenAI deployment settings show a provisioned SKU, model name, version, capacity value, region, quota context, and provisioning state for the endpoint.

Signal 02

Azure CLI output for a cognitive services deployment includes SKU, capacity, model, version, and resource ID fields that operators export for quota, reservation, and drift reviews.

Signal 03

FinOps reports show PTU-related hourly charges, reservation coverage, idle capacity, utilization trends, model choices, environment tags, and owners tied to specific AI deployments and teams.

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

  • Guarantee steadier latency for customer-facing AI assistants with predictable daily traffic patterns.
  • Support scheduled high-volume batch inference where completion windows matter more than burst flexibility.
  • Use reservation economics for mature AI workloads after measured utilization supports the commitment.
  • Separate production PTU deployments from experimentation so prototypes do not steal critical capacity.
  • Plan regional, data-zone, or global provisioned capacity for compliance, latency, and availability tradeoffs.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

Contact center assistant stabilizes agent answers

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A national appliance-repair company gave contact-center agents an AI assistant for warranty rules, troubleshooting steps, and part recommendations. Morning call volume was predictable, but standard model latency varied during peak queue times.

Business/Technical Objectives
  • Keep AI guidance under a three-second response target for agents.
  • Avoid overbuying capacity for overnight low-volume hours.
  • Tie PTU spend to the contact-center cost center.
  • Validate that private networking and diagnostics stayed intact after cutover.
Solution Using Provisioned throughput for AI

The platform team benchmarked realistic agent questions, including long warranty prompts and retrieved repair procedures, then deployed provisioned throughput for the production Azure OpenAI endpoint. The app routed only agent-assist traffic to the PTU deployment; experimentation stayed on standard capacity. Azure CLI captured SKU, capacity, model version, endpoint, tags, and diagnostic settings before and after cutover. Dashboards tracked PTU utilization, token rate, P95 latency, and abandoned-agent-assist requests. After two weeks, FinOps adjusted capacity based on measured utilization and tagged the deployment to the service desk budget.

Results & Business Impact
  • P95 AI guidance latency improved from 5.1 seconds to 2.7 seconds during morning peaks.
  • Agent-assist abandonment fell by 19% after latency stabilized.
  • Capacity was reduced by 12% after real token metrics replaced launch estimates.
  • Monthly PTU cost was assigned cleanly to the contact-center cost center.
Key Takeaway for Glossary Readers

PTUs are strongest when measured traffic, ownership, and latency targets are defined before the deployment is created.

Case study 02

Energy inspections finish image-report batches overnight

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

An energy services firm processed turbine inspection notes and images into engineer-ready summaries. The work had to finish overnight so field crews could review exceptions before driving to remote sites.

Business/Technical Objectives
  • Complete nightly inference batches before the 6 a.m. dispatch meeting.
  • Avoid competing with daytime engineering-chat traffic.
  • Control capacity spend after seasonal inspection peaks.
  • Provide audit evidence for model, region, and deployment settings.
Solution Using Provisioned throughput for AI

Architects created a dedicated provisioned-throughput-for-AI deployment for the overnight batch workflow and left interactive engineering chat on a separate endpoint. The batch scheduler submitted work in controlled waves based on PTU utilization and paused when retries increased. Azure CLI exports captured deployment SKU, capacity, model version, region, and tags for compliance review. Application Insights correlated inference duration with document size and image count, while FinOps reviewed idle capacity after each inspection cycle. The team kept a smaller standard deployment as a fallback for nonurgent summaries if capacity saturated.

Results & Business Impact
  • Nightly report completion improved from 82% to 98% before the dispatch meeting.
  • Daytime engineering-chat latency no longer spiked during leftover batch processing.
  • Seasonal capacity was removed after the inspection window, avoiding about $21,000 in idle spend.
  • Compliance reviewers received repeatable deployment evidence for every inspection batch.
Key Takeaway for Glossary Readers

Provisioned throughput for AI turns batch completion windows into capacity plans that can be measured and retired.

Case study 03

Developer copilot controls spend during enterprise rollout

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A software company rolled out an internal developer copilot to six engineering divisions. Early pilots used standard capacity, but the production launch needed steadier latency and clearer chargeback.

Business/Technical Objectives
  • Give developers predictable code-assist latency during core working hours.
  • Prevent pilot teams from consuming production PTU capacity.
  • Track ownership, reservation coverage, and idle capacity by division.
  • Maintain rollback to standard deployments during model-version changes.
Solution Using Provisioned throughput for AI

The platform group created provisioned AI deployments for production divisions with measured adoption, while leaving early experiments on standard endpoints. Routing rules selected the correct deployment by division and feature flag. Azure CLI inventory ran nightly to export deployment SKU, capacity, model version, tags, and provisioning state into a FinOps workbook. The team compared PTU utilization with reservation coverage and flagged endpoints below the agreed utilization threshold. During a model-version update, traffic shifted one division at a time, with standard deployments retained as rollback targets until latency and quality checks passed.

Results & Business Impact
  • Median code-assist latency stayed under 1.9 seconds for onboarded divisions.
  • Two idle pilot PTU deployments were removed before month-end billing.
  • Chargeback reports matched deployment tags for all six engineering divisions.
  • The staged model update completed with no division-wide outage or capacity overrun.
Key Takeaway for Glossary Readers

Provisioned AI capacity needs routing discipline and ownership metadata as much as it needs model capacity.

Why use Azure CLI for this?

As an Azure engineer with ten years managing capacity and incidents, I use Azure CLI for provisioned throughput for AI because PTU mistakes are expensive and visible. CLI lets me inspect deployment SKU, capacity, model version, region, provisioning state, and resource ID without depending on portal memory. It is also the fastest way to compare production, staging, and benchmark environments before a change. During an incident, CLI evidence helps determine whether latency is caused by capacity exhaustion, a deployment change, wrong endpoint routing, or application retry storms. For FinOps, CLI exports create a clean inventory of who owns each paid AI capacity decision.

CLI use cases

  • List AI deployments and identify which endpoints use provisioned SKUs and PTU capacity.
  • Show a specific deployment before changing capacity, model version, or traffic routing.
  • Create a provisioned deployment with approved SKU name, capacity value, model, and region in automation.
  • Export PTU deployment inventory for reservation planning, chargeback, and idle-capacity reviews.
  • Validate deployment state after cutover so applications call the intended provisioned endpoint.

Before you run CLI

  • Confirm tenant, subscription, resource group, AI account, deployment name, model, region, and intended provisioned SKU.
  • Check quota, model support, and regional or data-zone availability before creating or resizing PTU capacity.
  • Treat deployment create, update, and delete as cost-impacting changes that may also affect production traffic.
  • Verify RBAC, provider registration, network access, private endpoints, managed identities, and diagnostic settings before testing.
  • Use JSON output and record capacity, SKU, model version, provisioning state, tags, and resource ID for rollback evidence.

What output tells you

  • SKU name and capacity identify the provisioned deployment type and how many PTUs or capacity units are assigned.
  • Model name, model format, and version show whether the endpoint matches the application and benchmark assumptions.
  • Provisioning state confirms whether capacity creation, update, or deletion is complete enough for traffic cutover.
  • Resource ID, location, and tags connect the deployment to billing, ownership, network policy, and quota context.
  • Deployment lists reveal stale PTU endpoints, duplicate release candidates, or nonproduction capacity that should be reviewed.

Mapped Azure CLI commands

Provisioned model deployment operations

direct
az cognitiveservices account list --resource-group <resource-group>
az cognitiveservices accountdiscoverAI and Machine Learning
az cognitiveservices account show --name <account-name> --resource-group <resource-group>
az cognitiveservices accountdiscoverAI and Machine Learning
az cognitiveservices account deployment list --name <account-name> --resource-group <resource-group>
az cognitiveservices account deploymentdiscoverAI and Machine Learning
az cognitiveservices account deployment show --name <account-name> --resource-group <resource-group> --deployment-name <deployment>
az cognitiveservices account deploymentdiscoverAI and Machine Learning
az cognitiveservices account deployment create --name <account-name> --resource-group <resource-group> --deployment-name <deployment> --model-name <model> --model-version <version> --model-format OpenAI --sku-name GlobalProvisionedManaged --sku-capacity <capacity>
az cognitiveservices account deploymentprovisionAI and Machine Learning
az cognitiveservices account deployment delete --name <account-name> --resource-group <resource-group> --deployment-name <deployment>
az cognitiveservices account deploymentremoveAI and Machine Learning

Architecture context

As an Azure architect, I use provisioned throughput for AI when model serving has a measurable SLO and enough steady demand to justify reserved capacity. The design starts with workload profiling: requests per minute, prompt tokens, output tokens, concurrency, model family, regional requirements, and failure behavior. I then choose deployment type, capacity, monitoring, and reservation strategy. I also keep standard or smaller fallback deployments in mind for noncritical traffic. Provisioned AI capacity should be protected by RBAC and release gates because a casual deployment edit can create a billing or availability incident. The architecture is capacity engineering plus AI behavior, not just a bigger SKU.

Security

Security impact is indirect but important. PTUs do not change who can call the model, what data is encrypted, or whether prompts are safe. The risk is that provisioned AI deployments become valuable production endpoints with predictable capacity and cost exposure. Unauthorized clients, leaked keys, or broad network access can consume reserved capacity, cause denial of service, or generate sensitive outputs at scale. Teams should use managed identities where supported, key rotation, private endpoints or network restrictions when appropriate, diagnostic logging, and least-privilege RBAC for deployment changes. Prompt shielding, content filtering, and data-access controls still belong around the application workflow.

Cost

Cost impact is direct and often significant. Provisioned throughput for AI is billed for deployed PTUs, and reservation discounts may apply separately from deployment operations. That means a deployment can keep costing money while idle, while a reservation can sit underused if capacity planning is sloppy. FinOps owners should track deployed PTUs, utilization, reservation coverage, model choice, region or data-zone selection, idle hours, and token growth. Benchmarking matters before purchase because one model, prompt shape, or concurrency pattern can require different capacity than another. The best teams set capacity-review dates, tag owners, and remove nonproduction PTUs after tests or benchmarks finish.

Reliability

Reliability impact is direct for high-demand AI applications. Properly sized PTUs can deliver steadier latency and reduce dependence on shared capacity. They do not guarantee the whole application is reliable. Retrieval services, tools, client retries, network paths, and regional dependencies can still fail. Undersized PTUs can throttle during a launch, while deleted or misrouted deployments can break production immediately. Operators should monitor utilization, throttling, latency percentiles, failed requests, and capacity-change events. They should also test fallback routing, simpler prompts, queue-based smoothing, and rollback before peak events. Provisioned throughput improves one layer of reliability: model-serving capacity under measured and monitored load.

Performance

Performance impact is direct because PTUs are bought to provide predictable processing capacity for supported AI models. The benefit appears as steadier latency and throughput under expected load. Performance can still degrade if prompt length grows, output limits rise, retrieval becomes slow, clients retry aggressively, or traffic exceeds planned utilization. Teams should benchmark with realistic prompts, measure token rate, track P50 and P95 latency, and observe throttling. Provisioned capacity also changes release testing: a prompt-template change can reduce effective throughput even without more users. Operators should compare capacity metrics before and after model upgrades, prompt edits, and traffic-routing changes in production.

Operations

Operators manage provisioned throughput for AI by treating PTUs like a production capacity asset. They inventory deployments, confirm SKU names, track capacity, watch utilization, compare latency against SLOs, and coordinate quota or reservation changes. Azure CLI helps export deployment configuration, verify model versions, detect drift, and support incident review. Operations teams also need runbooks for capacity exhaustion, unexpected idle spend, model-version migration, endpoint cutover, and safe deletion. Prompt changes must be reviewed because longer prompts can consume more capacity even when request count stays stable. Mature operations pair PTU metrics with application traces, release records, owner tags, and FinOps reporting.

Common mistakes

  • Sizing PTUs from user count alone instead of request rate, prompt tokens, output tokens, and concurrency.
  • Leaving benchmark PTU deployments active after testing and discovering idle charges during month-end review.
  • Routing production traffic to a standard deployment while paying for provisioned capacity that sits unused.
  • Assuming a reservation automatically creates capacity or that deleting a deployment automatically cancels every billing commitment.
  • Changing prompt templates or model versions without rechecking whether the existing PTU allocation still supports the workload.