Containers Azure Kubernetes Service premium

Pod disruption budget

A pod disruption budget, often called a PDB, tells Kubernetes how much voluntary disruption an application can tolerate. For example, it can require at least two replicas to stay available while nodes are drained for an AKS upgrade. It does not stop every outage, crash, or forced eviction. It mainly protects against planned operations that would remove too many healthy pods at once. A useful PDB matches the right pods, reflects real replica counts, and fits the application rollout and maintenance strategy.

Aliases
PDB, Kubernetes pod disruption budget, AKS pod disruption budget
Difficulty
intermediate
CLI mappings
5
Last verified
2026-05-19

Microsoft Learn

A pod disruption budget is a Kubernetes policy that limits how many matching pods can be voluntarily disrupted at one time. In AKS, it helps preserve availability during node drains, upgrades, and planned maintenance by enforcing minAvailable or maxUnavailable rules.

Microsoft Learn: Reliability in Azure Kubernetes Service (AKS)2026-05-19

Technical context

In Azure architecture, a pod disruption budget is a Kubernetes policy object used inside AKS clusters. It selects pods by label and defines either minAvailable or maxUnavailable. The Kubernetes eviction API checks that budget during voluntary disruptions such as node drain, cluster upgrade, maintenance, or administrator-initiated eviction. PDBs interact with Deployments, StatefulSets, replica counts, node pools, topology spread, readiness probes, and rollout strategy. Azure CLI helps operators identify the AKS cluster and maintenance context, while kubectl inspects the PDB and matching pods.

Why it matters

Pod disruption budget matters because planned maintenance can accidentally become an outage when too many replicas leave service at once. AKS upgrades, node image updates, node drains, and scale-down events all need room to move pods safely. A PDB gives Kubernetes a rule for preserving availability during those voluntary disruptions. It also forces teams to think honestly about replica count and readiness. A PDB cannot save a single-replica workload from downtime, and a too-strict PDB can block upgrades. The value comes from matching the budget to real application tolerance and operational procedures. Operators should review it before every maintenance window, not after eviction failures.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

In kubectl get pdb output, operators see minAvailable or maxUnavailable, current healthy pods, desired healthy pods, expected pods, allowed disruptions, and selected workload labels clearly.

Signal 02

In AKS upgrade or node-drain troubleshooting, blocked evictions often reference a pod disruption budget that currently allows zero voluntary disruptions for matching pods during maintenance.

Signal 03

In workload manifests or Helm charts, PDB objects use label selectors that must match pods controlled by Deployments, StatefulSets, or other workload controllers during rollout planning.

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

  • Protect critical AKS replicas during node drains, node-image upgrades, cluster upgrades, and planned maintenance.
  • Prevent maintenance from evicting too many healthy pods when a workload needs minimum availability for user traffic.
  • Troubleshoot blocked AKS upgrades by inspecting allowed disruptions, matching labels, current healthy pods, and replica counts.
  • Balance reliability and maintenance speed by setting realistic minAvailable or maxUnavailable values for each workload.
  • Document application-owner approval before loosening a PDB, scaling replicas, or forcing evictions during a maintenance window.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

Airline booking upgrade protection

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

SkyRoute operated a booking API on AKS with three replicas during normal demand. A previous node image upgrade evicted two replicas at once, causing checkout timeouts during a fare sale.

Business/Technical Objectives
  • Keep at least two booking API replicas available during planned node maintenance.
  • Prevent node upgrades from causing checkout latency above the service target.
  • Document when operators may scale replicas before a maintenance window.
  • Avoid permanently increasing cluster size for rare upgrade events.
Solution Using Pod disruption budget

The platform team added a PDB with minAvailable set to two for the booking API labels and verified that the selector matched only checkout pods. Before upgrades, Azure CLI confirmed the AKS cluster, node pool, and maintenance window, while kubectl showed current healthy, desired healthy, and allowed disruptions. The release runbook scaled replicas from three to four during high-demand maintenance windows, then returned them to normal afterward. Alerts watched P95 latency, pod readiness, and failed checkout requests while node drains proceeded one pod at a time.

Results & Business Impact
  • The next node image upgrade completed without checkout downtime.
  • P95 checkout latency stayed under the target during planned maintenance.
  • Temporary replica scaling added capacity only for the maintenance window, limiting extra cost.
  • Operators gained a clear pre-upgrade check for selector, readiness, and allowed disruptions.
Key Takeaway for Glossary Readers

A pod disruption budget turns planned AKS maintenance into a controlled availability decision instead of a surprise outage.

Case study 02

Payments risk service drain control

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

ClearPay ran a fraud-scoring service on AKS that every payment authorization called before approval. Maintenance windows sometimes overlapped traffic spikes, and node drains risked leaving too few scoring pods ready.

Business/Technical Objectives
  • Maintain fraud-scoring availability during voluntary node drains.
  • Keep authorization latency stable during maintenance windows.
  • Give security and platform teams shared evidence before upgrades.
  • Avoid bypassing fraud checks just to complete infrastructure work.
Solution Using Pod disruption budget

Engineers created a PDB using maxUnavailable of one for fraud-scoring pods and ensured the Deployment maintained enough replicas across zones. They used Azure CLI to review the target AKS cluster and node pools, then kubectl described the PDB before each maintenance event. If allowed disruptions was zero, operators checked pod readiness, recent rollouts, and node capacity instead of forcing the drain. The service dashboard correlated allowed disruptions, ready replicas, authorization latency, and error rate. Security approved the process because maintenance could no longer silently reduce fraud-scoring capacity below the agreed minimum.

Results & Business Impact
  • Maintenance-related scoring capacity drops were eliminated over two upgrade cycles.
  • Authorization P95 latency stayed within five percent of normal during node drains.
  • Security review time for planned maintenance fell by 40 percent because evidence was standardized.
  • No emergency fraud-check bypasses were needed during the following quarter.
Key Takeaway for Glossary Readers

PDBs help regulated services keep essential replicas available while infrastructure teams perform necessary maintenance.

Case study 03

Emergency dispatch maintenance window

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

SignalDesk processed emergency dispatch messages for municipalities on AKS. The message router used multiple replicas, but a planned drain once removed enough pods to create a visible backlog.

Business/Technical Objectives
  • Keep dispatch message backlog below the alert threshold during maintenance.
  • Ensure voluntary disruptions never remove more router capacity than the workload can absorb.
  • Make blocked drains understandable to application and platform operators.
  • Coordinate PDB settings with autoscaling and readiness probes.
Solution Using Pod disruption budget

The team defined a PDB requiring at least three router pods to remain available and adjusted the Deployment to run four replicas during maintenance. Labels were reviewed so the PDB matched router pods but not unrelated workers. Azure CLI confirmed the cluster and maintenance schedule, while kubectl get pdb showed allowed disruptions before each drain. If allowed disruptions dropped to zero, operators checked readiness probes, pod placement, and autoscaler headroom. The incident workbook displayed queue depth, ready pods, allowed disruptions, and drain progress on one screen for dispatch coordinators.

Results & Business Impact
  • Dispatch backlog stayed below alert threshold during the next scheduled maintenance window.
  • Blocked drains were resolved by fixing one unready pod, not by deleting the PDB.
  • Maintenance status calls shortened because platform and application teams shared the same PDB evidence.
  • Router availability improved without permanently adding more nodes to the cluster.
Key Takeaway for Glossary Readers

A PDB is most valuable when operators pair it with readiness, replica planning, and clear maintenance evidence.

Why use Azure CLI for this?

As an Azure engineer, I use Azure CLI with kubectl for pod disruption budgets because upgrade and drain problems need fast evidence, not guesses. Azure CLI confirms the AKS cluster, resource group, node pools, version, and upgrade context. Kubectl then shows the PDB selector, desired healthy pods, current healthy pods, allowed disruptions, and the pods matched by labels. That view tells me whether maintenance is blocked by a too-strict budget, under-replicated workload, readiness failure, or wrong selector. The commands are also useful before a maintenance window, because I can export the state, share it with application owners, and avoid deleting protection just to force an upgrade.

CLI use cases

  • Get AKS credentials and confirm the correct cluster before reviewing disruption budgets.
  • List all PDBs across namespaces before a node image upgrade or maintenance window.
  • Describe a PDB to inspect selectors, allowed disruptions, and healthy pod counts.
  • Compare deployment replicas, pod readiness, and PDB settings when node drains or upgrades are blocked.

Before you run CLI

  • Confirm tenant, subscription, resource group, AKS cluster, namespace, node pool, and kubeconfig context before maintenance checks.
  • Use Kubernetes RBAC carefully; changing PDBs, labels, replicas, or node drain behavior can affect production availability.
  • Check region, zone, upgrade plan, maintenance window, and autoscaler capacity before interpreting allowed disruptions.
  • Avoid deleting a PDB to unblock maintenance until the application owner approves the risk and rollback path.

What output tells you

  • Allowed disruptions shows how many matching pods can be voluntarily evicted without violating the budget right now.
  • Current healthy and desired healthy counts show whether the workload has enough ready replicas for planned maintenance.
  • Selectors identify which pods are covered, and mismatched labels explain why a budget appears valid but protects nothing.
  • Deployment replicas, pod readiness, and node placement explain whether the budget blocks drains because capacity or health is insufficient.

Mapped Azure CLI commands

AKS pod disruption budget checks

direct
az aks get-credentials --name <cluster-name> --resource-group <resource-group>
az akssecureContainers
az aks show --name <cluster-name> --resource-group <resource-group> --query "{kubernetesVersion:kubernetesVersion, powerState:powerState, nodeResourceGroup:nodeResourceGroup}"
az aksdiscoverContainers
kubectl get pdb -A
kubectl describe pdb <pdb-name> -n <namespace>
kubectl get pods -n <namespace> -l <selector> -o wide

Architecture context

A pod disruption budget is a Kubernetes availability policy used inside AKS to protect a selected set of pods during voluntary disruptions. It works through label selectors and minAvailable or maxUnavailable rules, and the eviction API checks it during node drains, upgrades, maintenance, and administrator-initiated evictions. I treat a PDB as part of rollout architecture, not just a YAML accessory. It must align with replica counts, topology spread, node pool design, readiness probes, and application dependency behavior. A PDB cannot save a single-replica workload or prevent involuntary failures, but it can stop routine operations from taking too many healthy replicas offline at once. Operators should test it before cluster upgrades.

Security

Security impact is indirect because a PDB does not grant access, encrypt data, or define network policy. Risk appears through operations around it. Users who can create or modify PDBs can block maintenance, delay security patches, or weaken availability guarantees. Users who can drain nodes or evict pods can test whether budgets protect critical workloads. Operators should control Kubernetes RBAC for PDB changes, node drain actions, and workload label changes. Labels are especially important because a broad selector can protect the wrong pods, while a missing selector can leave sensitive workloads exposed to planned disruption. Change records should show when availability protections were loosened for patching or upgrades.

Cost

Cost impact is indirect. A PDB does not create an Azure billing item, but it often requires enough replicas and node capacity to keep workloads available during maintenance. If minAvailable requires two healthy pods, the cluster must have enough capacity to run at least that many pods while nodes drain. Overly strict budgets can delay scale-down and keep nodes running longer than necessary. Under-designed budgets can lead to outages that cost far more than capacity. FinOps reviews should compare PDB requirements with replica counts, cluster autoscaler behavior, maintenance windows, and node pool utilization. That review keeps availability protection from becoming unexplained idle node capacity.

Reliability

Reliability impact is direct because a PDB controls how much planned disruption a workload can absorb. It helps keep enough replicas ready during node upgrades, maintenance, scale-down, and administrator drains. However, reliability can get worse if the PDB is unrealistic. A minAvailable value equal to the replica count may block evictions forever, while no PDB may allow too many pods to disappear during maintenance. Reliable designs combine PDBs with multiple replicas, readiness probes, topology spread, anti-affinity, capacity headroom, and tested upgrade procedures. Operators should verify allowed disruptions before maintenance begins. This check prevents maintenance windows from turning into avoidable application incidents.

Performance

Performance impact is indirect but visible during change events. A PDB does not improve normal request latency, but it protects performance during voluntary disruption by keeping enough pods serving traffic. If too few pods remain, surviving replicas may become overloaded, increasing latency and error rate. If the PDB blocks maintenance, old nodes or images may remain longer than planned. Operators should examine allowed disruptions, replica readiness, autoscaler headroom, and P95 latency before and after drains. PDBs work best when paired with realistic resource requests and service-level performance targets. That pairing helps preserve user experience while infrastructure work continues safely. Validate under load.

Operations

Operators inspect PDBs before AKS upgrades, node drains, planned maintenance, and incident response. Common checks include selector labels, desired healthy pods, current healthy pods, allowed disruptions, matching deployments, and replica count. Azure CLI identifies the cluster, node pools, and upgrade context; kubectl shows the PDB and pods affected by it. During blocked maintenance, operators should not delete the PDB casually. They should confirm whether the workload is under-replicated, not ready, mislabeled, or too strict, then coordinate with the application owner before changing budgets or scaling replicas. The safest fix is often scaling replicas or correcting readiness, not removing protection. Coordinate before changing it.

Common mistakes

  • Setting minAvailable equal to the replica count and then wondering why AKS upgrades or node drains are blocked.
  • Creating a PDB selector that does not match the workload labels, leaving important pods unprotected.
  • Using a PDB for a single-replica workload and expecting it to prevent downtime during maintenance.
  • Deleting the PDB to force an upgrade without scaling replicas or confirming application-owner approval.