AI and Machine Learning Azure OpenAI field-manual-complete field-manual-complete field-manual-complete

Tokens per minute

Tokens per minute is the speed limit for how much model text an Azure OpenAI deployment can handle every minute. Tokens are the model-sized chunks created from prompts, retrieved context, tool results, and responses. When a workload burns tokens faster than its TPM allocation, requests can throttle even if the app servers look healthy. Teams use TPM to decide model choice, deployment count, region placement, batching, retry behavior, and whether provisioned throughput is worth paying for. That makes capacity planning concrete for product teams. That makes capacity planning concrete for product teams.

Aliases
TPM, token throughput, model token quota, tokens per minute quota, Azure OpenAI TPM
Difficulty
intermediate
CLI mappings
5
Last verified
2026-05-28

Microsoft Learn

Tokens per minute is the throughput quota that limits how many input and output tokens an Azure OpenAI deployment can process in a minute. It is assigned by region, subscription, model, and deployment type, and it works with request-per-minute limits and capacity choices.

Microsoft Learn: Azure OpenAI in Microsoft Foundry Models quotas and limits2026-05-28

Technical context

In Azure AI architecture, tokens per minute sits between quota planning, model deployment configuration, and application retry behavior. The limit is scoped by subscription, region, model family, and deployment type, then applied to deployments that serve application traffic. It works with requests per minute, context-window limits, prompt caching, provisioned throughput, telemetry, and client-side throttling. Operators see it in Foundry quota views, deployment settings, usage exports, 429 responses, and application latency dashboards. Document the owning team, scope, adjacent services, and expected review cadence before the design reaches production.

Why it matters

Tokens per minute matters because AI capacity failures show up as slow responses, 429 errors, failed automations, or product features that quietly degrade under load. A team can choose the right model and still fail launch day if TPM is assigned to the wrong region or deployment. It also shapes architecture: large prompts, retrieved documents, tool results, and verbose completions consume quota quickly. TPM becomes the practical planning unit for concurrency, retry policy, backoff, quota requests, batching, provisioned throughput, and launch-readiness reviews. Clear evidence keeps design reviews practical, accountable, and tied to business risk instead of guesswork. Clear evidence keeps design reviews practical, accountable, and tied to business risk instead of guesswork.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

In Azure AI Foundry usage and quota pages, TPM appears by model, region, deployment type, subscription, approved limit, remaining capacity, quota request workflow, and launch evidence.

Signal 02

In CLI deployment and usage output, account names, deployment SKUs, capacity values, locations, model names, and provisioning states reveal where TPM is allocated during incidents.

Signal 03

In application telemetry, 429 responses, retry spikes, prompt token counts, completion token counts, throttled requests, and p95 latency show traffic pressing against TPM limits during review.

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

  • Size an Azure OpenAI production launch by estimating peak tokens per user journey instead of guessing from request count alone.
  • Separate chat, embeddings, batch summarization, and agent workflows so one token-heavy feature cannot consume all model capacity.
  • Prepare quota-increase or provisioned-throughput decisions with measured prompt, completion, retry, and throttling evidence.
  • Tune RAG chunk size, conversation memory, and answer length when latency or 429 errors rise during busy periods.
  • Design failover regions that have enough model capacity, not merely deployed endpoints with empty traffic profiles.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

Sports media service protects live-final chat capacity

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A sports media platform used Azure OpenAI to summarize live match questions, but championship traffic caused sudden 429 responses during the final ten minutes of games.

Business/Technical Objectives
  • Keep fan-facing chat latency under three seconds during peak match windows.
  • Reduce throttled model calls without removing citation quality from answers.
  • Separate internal editorial summarization from public user traffic.
  • Produce quota evidence strong enough for the platform capacity review.
Solution Using Tokens per minute

The team measured tokens per user question, retrieved stat packet, system message, and streamed answer. They discovered that editorial batch summaries consumed the same deployment budget as fan chat. Engineers split workloads into separate deployments, shortened retrieved context for repeat questions, capped answer length during surge mode, and added gateway limits tied to per-user token budgets. Azure CLI inventory commands exported deployment names, regions, and model versions, while Monitor metric queries captured token usage and throttling at one-minute intervals. The runbook routed nonurgent editorial jobs to a queue whenever public traffic crossed the surge threshold.

Results & Business Impact
  • Peak-match 429 responses dropped from 11.8% of model calls to 0.7%.
  • P95 answer latency improved from 8.4 seconds to 2.6 seconds during the final quarter.
  • Editorial batch summaries were delayed by four minutes during surges with no public outage.
  • Quota-review evidence was accepted without a second round of data collection.
Key Takeaway for Glossary Readers

TPM gives teams a practical capacity boundary for protecting the user experience when token-heavy features compete for the same model deployment.

Case study 02

Insurance claims assistant avoids renewal-season throttling

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

An insurance carrier launched a claims assistant that reviewed adjuster notes, policy clauses, and uploaded photos, but renewal season doubled request volume and saturated its model deployment.

Business/Technical Objectives
  • Support twice the normal adjuster volume without model throttling.
  • Keep long claim summaries accurate while reducing wasted prompt tokens.
  • Give operations a repeatable signal for when to request more capacity.
  • Avoid moving the workload to expensive dedicated throughput prematurely.
Solution Using Tokens per minute

The AI engineering group built a token budget for each claim workflow: note cleanup, policy retrieval, photo-caption review, and final summary generation. They replaced full policy excerpts with reranked clause snippets, summarized old conversation turns, and blocked attachments that had already been processed. Azure CLI checks confirmed that staging and production used different model versions, which explained inconsistent load-test results. The team then created dashboards for prompt tokens, completion tokens, retries, and 429 responses by deployment. Instead of buying dedicated throughput immediately, they used queueing for nonurgent document backfills and reserved the main deployment for adjuster interactions.

Results & Business Impact
  • Average prompt size fell 38% while claim-summary acceptance scores stayed within review tolerance.
  • Throttled requests decreased from 6.3% to 0.4% during renewal week.
  • The carrier deferred a provisioned-throughput purchase for one quarter, saving an estimated $72,000.
  • Operations gained a documented quota trigger based on sustained token usage and retry rate.
Key Takeaway for Glossary Readers

A clear TPM model helps teams decide whether to buy capacity, reshape prompts, or separate workloads before customers feel the bottleneck.

Case study 03

Online learning platform smooths exam-week tutor demand

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

An online education provider used an AI tutor for calculus and chemistry, but exam week produced long chats with large worked examples that overwhelmed its TPM allocation.

Business/Technical Objectives
  • Keep tutor sessions responsive during evening exam-prep peaks.
  • Prevent one subject with long prompts from starving all other courses.
  • Maintain explanation quality while limiting unnecessary context.
  • Give support staff clear incident evidence during student complaints.
Solution Using Tokens per minute

The platform team analyzed token usage by course, problem type, conversation depth, and answer style. Chemistry lab questions carried large safety context, while calculus questions produced lengthy step-by-step responses. Engineers separated high-token subjects into dedicated deployments, added conversation summarization after six turns, and introduced a shorter answer mode when TPM consumption crossed a warning threshold. CLI deployment exports verified the correct model versions for each subject deployment, and minute-level metric queries were attached to incident tickets. The user interface also displayed a friendly delay message when noncritical hints moved to a queue.

Results & Business Impact
  • Exam-week chat completion failures fell from 9.1% to 0.6%.
  • Median tutor response time dropped from 5.7 seconds to 2.2 seconds at peak.
  • Course-level deployment separation prevented chemistry surges from affecting calculus sessions.
  • Support ticket triage time fell 44% because agents could see deployment-specific token pressure.
Key Takeaway for Glossary Readers

TPM is the capacity language that turns vague AI slowness into measurable deployment ownership and practical traffic controls.

Why use Azure CLI for this?

Azure CLI is useful for TPM work because capacity problems need repeatable evidence, not screenshots taken after the incident. As a ten-year Azure engineer, I use CLI to confirm the exact Azure OpenAI resource, region, deployment name, model version, SKU, tags, and diagnostic settings before blaming the prompt or the model. CLI also exports deployment inventory and metrics queries for change records, load-test reviews, and quota-increase requests. The portal is fine for inspection, but scripts let teams compare production, staging, and disaster-recovery regions in minutes and prove which deployment is actually being throttled. It keeps capacity debates grounded in current Azure state. This evidence matters most during launch reviews. That discipline prevents late surprises. This prevents teams from chasing capacity issues in the wrong environment.

CLI use cases

  • Confirm which resource, region, SKU, and tags own the model deployment under review.
  • List deployments to separate chat, embedding, batch, and agent traffic during capacity analysis.
  • Show a deployment before load testing so model version, capacity, and deployment name match the test plan.
  • Discover available Azure Monitor metric names before building token, throttle, and latency dashboards.
  • Export minute-level token and throttling evidence for quota requests, incident reviews, and FinOps meetings.

Before you run CLI

  • Select the correct tenant and subscription; TPM is scoped by region, subscription, model, and deployment type, so the wrong context produces misleading evidence.
  • Confirm you have reader access to the Azure OpenAI resource and monitoring data before collecting deployment or metric output.
  • Use UTC time windows, consistent output formats, and the exact deployment name from application configuration when comparing telemetry.
  • Treat create, update, and delete deployment commands as cost or availability-impacting; use read-only commands during diagnosis.
  • Check whether the workload uses multiple regions or multiple deployments before concluding that a single TPM value explains all failures.

What output tells you

  • The account output identifies the resource, kind, location, SKU, tags, and resource ID that anchor the capacity conversation.
  • Deployment output shows model name, model version, deployment name, and capacity-related settings that must match application configuration.
  • Metric definitions tell you which token, request, throttle, and latency signals are available for the resource in that region.
  • Metric values show whether token usage and throttling rose during the same time window as user complaints or load-test failures.
  • Tag and resource group fields help route ownership to the application team, platform team, or FinOps reviewer responsible for quota decisions.

Mapped Azure CLI commands

Tokens per minute CLI commands

adjacent
az cognitiveservices account show --name <account> --resource-group <resource-group> --query "{name:name,kind:kind,location:location,sku:sku.name,tags:tags}" --output json
az cognitiveservices accountdiscoverAI and Machine Learning
az cognitiveservices account deployment list --name <account> --resource-group <resource-group> --output table
az cognitiveservices account deploymentdiscoverAI and Machine Learning
az cognitiveservices account deployment show --name <account> --resource-group <resource-group> --deployment-name <deployment> --output json
az cognitiveservices account deploymentdiscoverAI and Machine Learning
az monitor metrics list-definitions --resource <ai-resource-id> --output table
az monitor metricsdiscoverAI and Machine Learning
az monitor metrics list --resource <ai-resource-id> --metric <token-or-throttle-metric> --interval PT1M --output table
az monitor metricsdiscoverAI and Machine Learning

Architecture context

Architecturally, TPM belongs in the traffic-management and capacity layer of an AI system. I plan it before the first public pilot by estimating tokens per user action, peak concurrent users, retry amplification, background evaluations, and the answer budget reserved for each request. High-volume systems often separate interactive chat, embedding generation, batch summarization, and agent tool workflows into different deployments or regions so one workload cannot starve another. TPM also influences RAG chunking, API gateway throttles, queue back-pressure, and whether provisioned throughput should be reserved. A mature design makes token consumption visible beside latency, error rate, and cost. before launch reviews. Capacity ownership must be explicit. Validate region allocations. I also document burst assumptions before approvals. That ownership keeps scaling choices explicit for reviewers.

Security

Security impact is indirect but real. TPM does not grant access or encrypt data, yet exhausted quota can become an availability attack path. A leaked key, exposed endpoint, or overbroad managed identity can let one caller burn shared model capacity for everyone. Teams should combine Microsoft Entra authentication where available, key rotation, private networking, content filters, per-tenant limits, API gateway controls, and alerting for abnormal callers. Quota increases should require approval because more throughput expands blast radius, data-processing volume, and response workload. Keep ownership, audit trails, approval paths, and emergency access documented before production changes. Keep ownership, audit trails, approval paths, and emergency access documented before production changes.

Cost

Cost impact is direct because token volume commonly drives model consumption charges, while provisioned options create capacity commitments. Higher TPM enables more throughput, but it also lets poorly tuned prompts spend faster. Oversized RAG chunks, repeated retries, duplicate model calls, and verbose completions raise cost without improving value. FinOps reviews should connect TPM allocation to workload ownership, tenant demand, forecasted peak, actual token usage, and quota requests. Budget alerts should follow the same workload boundaries so approval and accountability stay clear. Review ownership, idle usage, scale assumptions, chargeback signals, and retention behavior before expanding capacity. Review ownership, idle usage, scale assumptions, chargeback signals, and retention behavior before expanding capacity.

Reliability

Reliability depends on staying inside TPM during normal peaks and degrading cleanly when demand spikes. If an application treats every 429 response as a reason to retry immediately, it can multiply token pressure and extend an outage. Reliable designs reserve capacity for critical workflows, cap prompt and retrieved-context size, use bounded exponential backoff, queue nonurgent work, and monitor token usage by deployment. Multi-region designs should test whether failover capacity has enough TPM, not just whether the secondary endpoint exists. Operators should also rehearse quota exhaustion because the fix may involve traffic shaping or model changes, not a simple restart. Capacity alarms should fire before users see extended throttling. Alert early. Preproduction load tests should include worst-case prompts and expected concurrent user bursts.

Performance

Performance impact is direct because TPM throttling raises response time once requests exceed available token capacity. Queues, retries, and backoff can make a chat feature feel unreliable before users understand the cause. Prompt design matters: fewer retrieved chunks, concise tool schemas, shorter system messages, and controlled output length reduce token pressure. Teams should monitor first-token latency, total duration, tokens per request, 429 rates, and retry delay. Load tests should include realistic prompt sizes, streaming behavior, and worst-case output lengths. Benchmark realistic load, dependency latency, concurrency limits, diagnostic query speed, retry behavior, and rollback impact before declaring the design ready. Benchmark realistic load, dependency latency, concurrency limits, diagnostic query speed, retry behavior, and rollback impact before declaring the design ready.

Operations

Operators manage TPM by inventorying deployments, checking regional quotas, watching usage trends, and correlating throttling with application releases. They inspect deployment names, model versions, SKU choices, assigned capacity, retry settings, token telemetry, and diagnostic settings. During incidents, they separate TPM pressure from RPM limits, context overflow, service health, and application bugs. Runbooks should include before-and-after usage exports, expected peak load, rollback settings, owner contacts, and alert thresholds. Operators also document who owns each deployment capacity pool and approval path. Store repeatable command output, owner contacts, rollback evidence, normal examples, and post-change verification steps with the operational runbook. Store repeatable command output, owner contacts, rollback evidence, normal examples, and post-change verification steps with the operational runbook.

Common mistakes

  • Estimating capacity from requests per minute while ignoring prompt length, retrieved context, tool output, and completion size.
  • Retrying 429 responses without bounded backoff, which increases token pressure and makes recovery slower.
  • Testing with short prompts but launching with long conversation memory and production RAG chunks.
  • Assuming a secondary region can absorb failover traffic without checking its approved model capacity.
  • Buying provisioned throughput before checking whether prompt compression, smaller models, caching, or queueing would solve the constraint.