AI and Machine Learning Azure OpenAI verified

Requests per minute

Requests per minute is the request-count side of an Azure OpenAI rate limit. It says how many calls a deployment or quota scope can receive in roughly a minute, regardless of how many tokens those calls contain. A small chat request and a large prompt can both count as one request, while token limits measure text volume. When RPM is exhausted, callers usually see throttling. Engineers use RPM to design client concurrency, retries, batching, regional capacity, and tenant isolation for production AI applications.

Aliases
RPM, Azure OpenAI RPM, OpenAI request rate limit, requests-per-minute quota, model deployment request limit
Difficulty
fundamentals
CLI mappings
6
Last verified
2026-05-22T05:55:00Z

Microsoft Learn

Requests per minute, or RPM, in Azure OpenAI is a quota and rate-limit measure for how many API requests a subscription, region, model, or deployment type can accept in a minute. It works alongside tokens per minute to protect capacity from request-count bursts. It also informs capacity planning.

Microsoft Learn: Azure OpenAI in Microsoft Foundry Models quotas and limits2026-05-22T05:55:00Z

Technical context

In Azure architecture, requests per minute sits between the Azure OpenAI data plane and the quota model managed through Azure AI Foundry and Cognitive Services resources. It is scoped by subscription, region, model family, deployment type, and sometimes the deployment capacity assigned by the team. RPM works with tokens per minute, content filtering, authentication, private networking, and Azure Monitor telemetry. Azure CLI does not change every RPM limit directly, but it inventories accounts, deployments, SKUs, regions, metrics, and alerts used to govern request pressure.

Why it matters

Requests per minute matters because AI applications often fail from request bursts before they run out of raw token volume. A support bot, agent workflow, or RAG pipeline can issue many small calls when users refresh pages, retry failed steps, or fan out across tools. Without RPM planning, teams mistake throttling for model slowness, overscale the wrong layer, or let one tenant consume capacity meant for everyone. Clear RPM design protects user experience, supports quota increase requests, and forces architects to make concrete decisions about batching, backoff, regional deployment, priority traffic, and graceful degradation under load. That evidence keeps scaling conversations grounded in observed customer traffic.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

In Azure AI Foundry or the OpenAI resource, deployment details and quota views show model, region, capacity, tokens-per-minute, request-limit context, and quota tier for planning during launch planning.

Signal 02

In API responses and application logs, HTTP 429 errors, retry-after values, deployment names, client IDs, and timestamps show when request volume exceeded the available RPM budget during peak traffic.

Signal 03

In Azure Monitor metric exports, minute-level request counts, latency, token usage, and throttling signals help operators distinguish request bursts from model latency or network issues.

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

  • Protect a customer-facing chatbot from retry storms by capping concurrent calls before RPM throttling becomes a visible outage.
  • Prepare quota-increase evidence by exporting per-deployment request counts, 429 rates, regions, and business impact for the overloaded model.
  • Separate premium tenants or critical workflows onto dedicated deployments so one noisy workload cannot consume the shared RPM budget.
  • Throttle background summarization or embedding jobs during business hours so interactive chat traffic keeps predictable response times.
  • Choose between regional deployments, provisioned throughput, caching, or batching after measuring request-count pressure instead of guessing.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

Legal assistant stops morning throttling before partner reviews

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

LexBridge built an Azure OpenAI assistant that summarized matter updates for partners at 8 a.m. every weekday. The application looked healthy, but partners saw waves of 429 errors and assumed the model service was unstable.

Business/Technical Objectives
  • Keep partner-facing summaries below two seconds for accepted requests.
  • Reduce 429 errors during the morning burst by at least 80 percent.
  • Preserve data residency by keeping production traffic in the approved region.
  • Create evidence for a quota request if tuning was not enough.
Solution Using Requests per minute

Engineers treated requests per minute as the primary bottleneck instead of blaming model latency. They used CLI to inventory the OpenAI account, deployment, region, and model capacity, then exported one-minute request metrics and correlated them with application logs. The team added client-side concurrency limits, exponential backoff with jitter, and a queue that spread nonurgent matter refreshes across 20 minutes. Premium partner requests stayed on the original deployment, while background document refreshes moved to a second deployment in the same region. API Management added a per-client ceiling so one practice group could not consume the whole RPM budget.

Results & Business Impact
  • Peak 429 errors fell from 18 percent of calls to 2.6 percent during the first week.
  • Median accepted-request latency stayed at 1.4 seconds instead of swinging above six seconds under retries.
  • The quota request included deployment IDs, minute-level request charts, and documented business impact.
  • Partners received summaries before 8:10 a.m. on 96 percent of business days, up from 71 percent.
Key Takeaway for Glossary Readers

Requests per minute turns AI reliability from vague throttling complaints into capacity, concurrency, and traffic-shaping decisions.

Case study 02

Sports media platform protects live highlight generation

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

ClipArena used Azure OpenAI to generate short highlight captions during live matches. Every goal or controversial call triggered thousands of mobile refreshes, and caption generation began failing exactly when editors needed it most.

Business/Technical Objectives
  • Protect live caption generation during game-changing events.
  • Separate editorial workflows from fan-triggered refresh traffic.
  • Avoid buying unnecessary provisioned throughput before measuring the true RPM gap.
  • Give incident commanders a clear throttle and recovery playbook.
Solution Using Requests per minute

The platform team modeled requests per minute across three traffic classes: editor requests, fan-facing caption refreshes, and background recap jobs. CLI exports identified deployment capacity, region, and request metrics, while application telemetry identified fan refresh bursts after scoring events. Engineers moved editor tools to a dedicated deployment, cached captions for repeated story fragments, and placed fan-triggered requests behind a queue with a strict per-match concurrency budget. Background recap jobs paused automatically when request metrics crossed the alert threshold. The runbook included CLI commands for collecting deployment and metric evidence during each match.

Results & Business Impact
  • Caption failure rate during peak match moments dropped from 14 percent to 1.8 percent.
  • Editors kept sub-three-second caption generation during 11 consecutive high-traffic matches.
  • The team avoided a rushed provisioned-throughput purchase and deferred the decision until playoff traffic data was available.
  • Incident reviews shrank from two hours of guesswork to a 20-minute metric comparison.
Key Takeaway for Glossary Readers

RPM planning protects the moments when users care most by reserving request capacity for the workflows that cannot wait.

Case study 03

B2B SaaS support bot isolates noisy tenants

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

CasePilot offered an Azure OpenAI support assistant to hundreds of enterprise tenants. One customer imported a huge knowledge base and accidentally triggered many small classification calls that throttled unrelated tenants.

Business/Technical Objectives
  • Stop one tenant from exhausting the shared deployment request budget.
  • Keep standard support chat available while bulk imports continue.
  • Give customer success teams tenant-level evidence during escalations.
  • Reduce retry waste without removing the new knowledge-import feature.
Solution Using Requests per minute

Engineers used requests per minute as the boundary for tenant isolation. They tagged every OpenAI call with tenant and feature metadata, exported deployment metrics with CLI, and matched 429 spikes to the import classifier. The team created a dedicated background-processing deployment with its own queue and lowered concurrency for imports. Interactive support chats kept the primary deployment, protected by per-tenant ceilings enforced at the API layer. The import job changed from immediate fan-out to paced batches with retry-after handling. Dashboards showed request count, token use, 429s, and queue age per tenant.

Results & Business Impact
  • Cross-tenant throttling incidents fell from nine in one month to zero in the next release cycle.
  • Bulk import completion time increased by only 11 percent while support-chat availability returned to 99.9 percent.
  • Retry-generated duplicate calls dropped by 61 percent after jitter and queue pacing were added.
  • Customer success could identify the consuming tenant and feature within five minutes.
Key Takeaway for Glossary Readers

A shared AI deployment needs request-budget ownership, or one enthusiastic tenant can create an outage for everyone.

Why use Azure CLI for this?

After ten years of Azure engineering work, I use Azure CLI for RPM work because rate-limit problems are rarely solved from one portal screen. I need repeatable evidence: which subscription is active, which OpenAI accounts exist, which region hosts each deployment, what deployment capacity is assigned, what metrics show during the incident, and whether alerts fired. CLI lets me export that evidence, compare dev, test, and production, and feed it into change records or quota requests. Even when a quota increase is requested in the portal, CLI gives the operational proof behind the request. That repeatability matters when escalation evidence must survive handoffs between teams.

CLI use cases

  • Inventory Azure OpenAI accounts and deployment locations before modeling which resources share the same quota pressure.
  • List deployments with model names, versions, SKU capacity, and rate-limit fields for incident evidence or quota review.
  • Export Azure Monitor metric definitions and request metrics at one-minute intervals during a throttling investigation.
  • Create a metric alert that pages operators when 429 responses or request counts approach the safe operating envelope.
  • Compare dev, test, and production deployment capacity so load tests do not rely on a different limit profile.

Before you run CLI

  • Confirm the tenant, subscription, resource group, OpenAI account name, deployment name, region, and whether you are inspecting Azure OpenAI or another Cognitive Services kind.
  • Use Reader or Monitoring Reader for discovery, and require Cognitive Services Contributor or equivalent only for deployment or alert changes.
  • Check whether commands expose keys, endpoint names, tenant IDs, prompt-related telemetry, or customer identifiers before saving output to tickets.
  • Understand cost and reliability risk before scaling capacity, adding deployments, changing alert thresholds, or shifting traffic between regions.

What output tells you

  • Account output confirms the resource ID, kind, endpoint, region, SKU, and subscription boundary used for monitoring and quota conversations.
  • Deployment output shows model, version, SKU capacity, provisioning state, and any rate-limit properties returned by the management API.
  • Metric definitions tell you which request, token, latency, and throttling signals are available for the resource in that region.
  • Metric time series reveal whether failures align with bursty request count, token-heavy prompts, deployment changes, or regional traffic shifts.

Mapped Azure CLI commands

Requests per minute CLI Commands

az cognitiveservices account show --name <account-name> --resource-group <resource-group> --query "{id:id,kind:kind,location:location,sku:sku.name,endpoint:properties.endpoint}" --output json
az cognitiveservices accountdiscoverAI and Machine Learning
az cognitiveservices account deployment list --name <account-name> --resource-group <resource-group> --query "[].{name:name,model:properties.model.name,version:properties.model.version,sku:sku.name,capacity:sku.capacity,rateLimits:properties.rateLimits}" --output json
az cognitiveservices account deploymentdiscoverAI and Machine Learning
az monitor metrics list-definitions --resource <openai-resource-id> --output table
az monitor metricsdiscoverAI and Machine Learning
az monitor metrics list --resource <openai-resource-id> --metric "Azure OpenAI Requests" --interval PT1M --aggregation Count --output json
az monitor metricsdiscoverAI and Machine Learning
az monitor metrics alert create --name <alert-name> --resource-group <resource-group> --scopes <openai-resource-id> --condition "total Azure OpenAI Requests > <threshold>" --window-size 5m --evaluation-frequency 1m --action <action-group-id>
az monitor metrics alertprovisionAI and Machine Learning
az cognitiveservices account list --resource-group <resource-group> --query "[].{name:name,kind:kind,location:location,sku:sku.name}" --output table
az cognitiveservices accountdiscoverAI and Machine Learning

Architecture context

A seasoned Azure architect treats requests per minute as a concurrency boundary for the AI platform, not just a number in a quota table. The design starts with workload shape: chat turns, agent tool calls, embeddings, batch jobs, and retry behavior. Then the team maps those calls to model deployments, regions, subscriptions, and tenant priority. High-value workloads may need separate deployments or provisioned throughput, while background jobs should be paced or queued. RPM also affects API Management policies, client SDK retry settings, circuit breakers, and observability. Good architecture avoids one shared deployment where a noisy feature can throttle every user-facing AI experience.

Security

Security impact is indirect because RPM does not authenticate callers or encrypt prompts. Microsoft Entra ID, keys, managed identities, private endpoints, network ACLs, and content safety controls handle those duties. The security risk appears when weak access boundaries allow abusive or compromised clients to burn request capacity, generate denial-of-service symptoms, or hide prompt abuse in normal traffic. Engineers should protect keys, prefer managed identity where supported, separate tenant or workload deployments when appropriate, and alert on unusual request spikes by resource, deployment, client, or region. Rate-limit evidence is also useful during abuse investigations and incident reviews. During incidents, rate evidence also helps distinguish abuse from normal launch pressure.

Cost

RPM is not usually a separate line item, but it drives cost decisions. If a workload repeatedly hits RPM, teams may add deployments, request more quota, move to provisioned throughput, or split traffic across regions. Those choices can increase spend, operational complexity, and monitoring effort. Poor RPM design also causes wasted retries, duplicate prompts, and support escalations. FinOps owners should connect request volume to features, tenants, model choices, cache hit rates, and retry rates. Sometimes the cheapest fix is not more quota; it is batching, prompt caching, rate shaping, or eliminating accidental fan-out in an agent workflow. It also exposes where avoidable retries create hidden token spend.

Reliability

Reliability impact is direct. When RPM is exhausted, Azure OpenAI callers can receive 429 throttling even if the application servers, network, and model endpoint look healthy. Reliable systems pace calls, use exponential backoff with jitter, cap parallelism, queue nonurgent work, and fail gracefully when limits are reached. Architects should test burst behavior, not only average traffic, because agentic workflows can multiply calls per user action. Multi-region or multi-deployment patterns reduce blast radius, but they must respect data residency and quota availability. Alerting should connect request volume, throttling, latency, retries, and user-facing errors. It also clarifies who owns traffic shedding when the deployment is healthy but capacity is exhausted. Teams should rehearse this path before planned demand spikes.

Performance

Performance impact is direct because RPM controls how many calls can enter the model service over time. When request volume exceeds the limit, callers wait, retry, or receive throttling, which users experience as slow or failed AI features. Higher RPM capacity can improve concurrency, but it does not reduce model latency for each accepted call. Performance tuning should examine request burstiness, token size, client parallelism, connection reuse, retries, and cache strategy. Small requests can still overwhelm RPM, while fewer large prompts may pressure token limits instead. Good dashboards separate request-count throttling from latency and token saturation. That separation prevents teams from over-tuning the wrong layer during load tests. It also clarifies which bottleneck deserves tuning first.

Operations

Operators inspect RPM through deployment configuration, quota screens, Azure Monitor metrics, diagnostic logs, SDK error telemetry, and incident timelines. They identify whether throttling comes from request count, token volume, retry storms, or shared capacity. CLI helps list OpenAI resources and deployments, retrieve resource IDs, collect metric definitions, query minute-level request counts, and create alerts. Runbooks should specify who can request quota, when traffic can be shifted, how to lower background concurrency, and what evidence must be captured after a 429 incident. Good operations also document expected RPM per feature and tenant. Those records make quota reviews factual, repeatable, and defensible during launch incidents. That discipline prevents guesswork during time-sensitive production reviews.

Common mistakes

  • Treating tokens per minute as the only limit and ignoring many small requests that can still exhaust RPM.
  • Letting client retries run without jitter, which turns a temporary throttle into a self-amplifying request storm.
  • Pooling unrelated tenants, agents, and batch jobs on one deployment without priority controls or traffic shaping.
  • Requesting more quota before proving whether caching, batching, regional routing, or retry policy would solve the problem.