Kubernetes Pod Disruption Budgets PDB & Topology Spread Constraints

PostedDecember 26, 2021

UpdatedMarch 4, 2026

Author -Rajkumar Aute

Achieving true high availability and zero-downtime architecture in Kubernetes requires defending your applications against two completely different types of threats: planned maintenance and unplanned disasters. While standard Deployments and ReplicaSets ensure your desired number of pods are running under normal circumstances, they are not enough on their own.

A Pod Disruption Budget (PDB) acts as your primary safety net for planned, voluntary disruptions (like EKS Node upgrades or cluster scale-downs), ensuring a strict minimum number of pods stay active to serve customer traffic. On the other hand,

A Topology Spread Constraints are your absolute shield against unplanned, involuntary disruptions (like a complete AWS Availability Zone crash), mathematically distributing your pods across multiple physical failure domains. Together, these two mechanisms form the unbreakable core of enterprise-grade Kubernetes reliability.

The PDB Analogy (Voluntary Protection): Think of a busy bank with 5 cashiers. The bank manager wants to upgrade all the chairs at the counters (planned maintenance). If the manager takes away all 5 chairs at once, work stops, and customers get angry. But if there is a strict PDB rule saying, “At least 3 cashiers must be available at all times,” the maintenance team will only upgrade up to 2 chairs at a time. The bank keeps running smoothly.

Limits the number of pods that can go down simultaneously due to voluntary disruptions.

Key Fields (Pick ONE):

minAvailable: The strict floor. Minimum pods that must remain up (e.g., 2 or 80%). Great for Quorum/Stateful apps.

maxUnavailable: The flexible ceiling. Maximum pods safely taken down (e.g., 1 or 20%). Best for HPA-scaled web APIs.

Command: kubectl get pdb

The Topology Spread Analogy (Involuntary Protection): Imagine you are carrying 6 precious glass bottles (your Pods) to a party. If you put all 6 bottles into one single carry bag (a single Node or single Availability Zone) and the bag tears, all 6 bottles break! Your party is ruined. Instead, using Topology Spread Constraints, you smartly divide the 6 bottles equally: 2 in Bag A, 2 in Bag B, and 2 in Bag C. If one bag tears completely, you still have 4 bottles left to keep the party going!

Controls the physical distribution of pods across defined infrastructure boundaries to prevent single points of failure.

The Four Pillars:

topologyKey: The physical boundary (kubernetes.io/hostname for nodes, topology.kubernetes.io/zone for AZs).

maxSkew: Maximum allowed difference in pod count between domains (e.g., 1).

whenUnsatisfiable: The fallback (DoNotSchedule for strict limits, ScheduleAnyway for soft, flexible scaling).

labelSelector: Identifies which pods to count in the math.

Pod Disruption Budgets PDB

While Deployments and ReplicaSets ensure your applications are running, a Pod Disruption Budget (PDB) acts as an essential safety net during planned maintenance activities, like node upgrades or scaling down the cluster. A PDB tells Kubernetes the minimum number of pods that must stay running or the maximum number of pods that can be taken down simultaneously. This ensures that voluntary disruptions do not accidentally cause application downtime.

The “Voluntary” vs. “Involuntary” Rule

Let us understand the foundation clearly. In Kubernetes, disruptions happen in two ways:

Involuntary Disruptions: These are accidents. A hardware failure, a node running out of memory (OOM), or a network crash. PDBs cannot prevent these. K8s will simply try to restart the pods on a healthy node.
Voluntary Disruptions: These are intentional actions taken by the cluster admin or an automated script. Examples include draining a node for patches (kubectl drain), scaling down the cluster to save costs, or deleting a pod manually.

Must know that when you run kubectl drain <node-name>, Kubernetes looks at the pods on that node. Before it deletes a pod to move it elsewhere, the Eviction API checks if a PDB exists. If terminating that pod violates the PDB rule (e.g., drops the running pods below the minAvailable limit), the API blocks the drain process until another pod is successfully spun up on a different node.

Type	Examples	PDB Protection?
Voluntary	`kubectl drain`, Cluster Autoscaler scaling down, EKS Managed Node updates.	Yes
Involuntary	Hardware failure, Kernel panic, AWS Spot Instance interruption, Network partition.	No

The Two Ways to Define a PDB

You can define a PDB using either Integer values or Percentages.

Key Differences at a Glance

Feature	`minAvailable`	`maxUnavailable`
Primary Focus	Guaranteeing a strict floor of availability.	Allowing a ceiling of acceptable disruption.
Best For	StatefulSets, Databases, Quorum requirements.	Deployments, Web APIs, Autoscaling workloads.
Percentage Math	Rounds UP (Increases your safety margin).	Rounds UP (Increases the disruption allowance).
Single Replica Risk	`minAvailable: 1` on a 1-replica app blocks all evictions forever.	`maxUnavailable: 1` on a 1-replica app allows that pod to be safely evicted.

Option A: `minAvailable`

minAvailable dictates the absolute minimum number (or percentage) of pods that must remain Running and Ready during a voluntary disruption.

When to use it:

Quorum-based Applications: If you are running a database or a StatefulSet (like ZooKeeper, Elasticsearch, or Consul), the cluster needs a strict quorum to avoid a split-brain scenario. If quorum requires 3 nodes to function, you set minAvailable: 3.
Strict Capacity Floors: When you know your application will completely crash or drop traffic if it falls below a specific threshold of compute power.

Bash

---
# The API version for PDBs changed from policy/v1beta1 to policy/v1 in Kubernetes 1.21.
# Always use policy/v1 for modern EKS clusters.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  # The name of your PDB. Best practice: name it after the Deployment/StatefulSet it protects.
  name: my-api-pdb
  # PDBs are strictly namespace-scoped. This MUST match the namespace of your pods.
  namespace: production
  # Optional: Labels for the PDB itself (useful for cluster management tools).
  labels:
    managed-by: platform-team
    environment: production
spec:
  # =========================================================================
  # 1. AVAILABILITY REQUIREMENT
  # Rule: You must define EXACTLY ONE of either `minAvailable` OR `maxUnavailable`.
  # You cannot define both. They accept either absolute integers (e.g., 2) or percentages (e.g., 20%).
  # =========================================================================
  
  # OPTION A: minAvailable
  # Guarantees that at least this many pods are "Ready" during a voluntary disruption (like a node drain).
  # If you have 3 replicas, setting this to 2 means Kubernetes will only evict 1 pod at a time.
  minAvailable: 2 

  # OPTION B: maxUnavailable (Commented out for this example)
  # Guarantees that no more than this number/percentage of pods are "Not Ready" at the same time.
  # Often preferred for Deployments hooked to a Horizontal Pod Autoscaler (HPA), where absolute numbers fluctuate.
  # maxUnavailable: 25%

  # =========================================================================
  # 2. SELECTOR (Targeting the Pods)
  # Rule: This tells the PDB which pods it is responsible for protecting.
  # =========================================================================
  selector:
    # `matchLabels` is the most common method. 
    # CRITICAL: These labels must EXACTLY match the `spec.template.metadata.labels` 
    # in your Deployment or StatefulSet.
    matchLabels:
      app: my-api
      
    # `matchExpressions` allows for more complex, set-based targeting (Optional).
    # You can use both matchLabels and matchExpressions together (they act as an AND condition).
    # matchExpressions:
    #   - key: tier
    #     operator: In
    #     values:
    #       - backend
    #       - api

  # =========================================================================
  # 3. ADVANCED BEHAVIOR (EKS 1.27+ / GA in 1.31)
  # =========================================================================
  # unhealthyPodEvictionPolicy dictates how the PDB handles pods that are ALREADY 
  # failing/crashing when a node drain is initiated.
  # 
  # - IfAlwaysAllow (Recommended for upgrades): Allows the eviction of unhealthy pods 
  #   even if it violates the `minAvailable` rule. This prevents a single crashing 
  #   pod from permanently blocking a node upgrade.
  # - IfHealthyBudget: Only allows eviction of unhealthy pods if the total number 
  #   of healthy pods meets the minAvailable threshold.
  unhealthyPodEvictionPolicy: IfAlwaysAllow

---
# The API version for PDBs changed from policy/v1beta1 to policy/v1 in Kubernetes 1.21.
# Always use policy/v1 for modern EKS clusters.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  # The name of your PDB. Best practice: name it after the Deployment/StatefulSet it protects.
  name: my-api-pdb
  # PDBs are strictly namespace-scoped. This MUST match the namespace of your pods.
  namespace: production
  # Optional: Labels for the PDB itself (useful for cluster management tools).
  labels:
    managed-by: platform-team
    environment: production
spec:
  # =========================================================================
  # 1. AVAILABILITY REQUIREMENT
  # Rule: You must define EXACTLY ONE of either `minAvailable` OR `maxUnavailable`.
  # You cannot define both. They accept either absolute integers (e.g., 2) or percentages (e.g., 20%).
  # =========================================================================
  
  # OPTION A: minAvailable
  # Guarantees that at least this many pods are "Ready" during a voluntary disruption (like a node drain).
  # If you have 3 replicas, setting this to 2 means Kubernetes will only evict 1 pod at a time.
  minAvailable: 2 

  # OPTION B: maxUnavailable (Commented out for this example)
  # Guarantees that no more than this number/percentage of pods are "Not Ready" at the same time.
  # Often preferred for Deployments hooked to a Horizontal Pod Autoscaler (HPA), where absolute numbers fluctuate.
  # maxUnavailable: 25%

  # =========================================================================
  # 2. SELECTOR (Targeting the Pods)
  # Rule: This tells the PDB which pods it is responsible for protecting.
  # =========================================================================
  selector:
    # `matchLabels` is the most common method. 
    # CRITICAL: These labels must EXACTLY match the `spec.template.metadata.labels` 
    # in your Deployment or StatefulSet.
    matchLabels:
      app: my-api
      
    # `matchExpressions` allows for more complex, set-based targeting (Optional).
    # You can use both matchLabels and matchExpressions together (they act as an AND condition).
    # matchExpressions:
    #   - key: tier
    #     operator: In
    #     values:
    #       - backend
    #       - api

  # =========================================================================
  # 3. ADVANCED BEHAVIOR (EKS 1.27+ / GA in 1.31)
  # =========================================================================
  # unhealthyPodEvictionPolicy dictates how the PDB handles pods that are ALREADY 
  # failing/crashing when a node drain is initiated.
  # 
  # - IfAlwaysAllow (Recommended for upgrades): Allows the eviction of unhealthy pods 
  #   even if it violates the `minAvailable` rule. This prevents a single crashing 
  #   pod from permanently blocking a node upgrade.
  # - IfHealthyBudget: Only allows eviction of unhealthy pods if the total number 
  #   of healthy pods meets the minAvailable threshold.
  unhealthyPodEvictionPolicy: IfAlwaysAllow

Option B: `maxUnavailable`

maxUnavailable dictates the maximum number (or percentage) of pods that are allowed to be taken down simultaneously.

When to use it:

Stateless Applications (Deployments): This is the gold standard for web APIs, microservices, and frontend services.
Autoscaling Workloads (HPA): If your workload scales dynamically between 2 and 50 pods via a Horizontal Pod Autoscaler, minAvailable becomes risky. (If you scale down to 3 pods, but your minAvailable was hardcoded to 4, the PDB becomes impossible to satisfy, and evictions block entirely). maxUnavailable adapts smoothly to whatever the current desired replica count is.

Bash

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-api-dynamic-pdb
  namespace: production
  labels:
    managed-by: platform-team
    workload-type: stateless
spec:
  # =========================================================================
  # 1. DISRUPTION TOLERANCE (The Core Rule)
  # Rule: You must define EXACTLY ONE of either `minAvailable` OR `maxUnavailable`.
  # =========================================================================
  
  # OPTION B: maxUnavailable
  # This dictates the maximum number of pods that can be temporarily taken down 
  # during voluntary disruptions (like an EKS managed node group upgrade).
  
  # Example 1: Absolute Integer (Uncomment to use)
  # maxUnavailable: 1 
  # Meaning: "No matter how many pods are running, only let 1 go down at a time."
  # Warning: If you have 50 pods, draining a node with 5 pods on it will take a long time 
  # because Kubernetes will evict them strictly one by one.

  # Example 2: Percentage (Recommended for HPA workloads)
  maxUnavailable: 25%
  # Meaning: "Always ensure 75% of the DESIRED replicas are running."
  # - If HPA scales to 4 pods: maxUnavailable is 1.
  # - If HPA scales to 40 pods: maxUnavailable is 10 (node drains happen much faster).
  # Note on Math: Kubernetes rounds UP for maxUnavailable percentages. 
  # E.g., 25% of 5 pods = 1.25 -> rounded up to 2 pods can be evicted.


  selector:
    matchLabels:
      app: my-api
  unhealthyPodEvictionPolicy: IfAlwaysAllow

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-api-dynamic-pdb
  namespace: production
  labels:
    managed-by: platform-team
    workload-type: stateless
spec:
  # =========================================================================
  # 1. DISRUPTION TOLERANCE (The Core Rule)
  # Rule: You must define EXACTLY ONE of either `minAvailable` OR `maxUnavailable`.
  # =========================================================================
  
  # OPTION B: maxUnavailable
  # This dictates the maximum number of pods that can be temporarily taken down 
  # during voluntary disruptions (like an EKS managed node group upgrade).
  
  # Example 1: Absolute Integer (Uncomment to use)
  # maxUnavailable: 1 
  # Meaning: "No matter how many pods are running, only let 1 go down at a time."
  # Warning: If you have 50 pods, draining a node with 5 pods on it will take a long time 
  # because Kubernetes will evict them strictly one by one.

  # Example 2: Percentage (Recommended for HPA workloads)
  maxUnavailable: 25%
  # Meaning: "Always ensure 75% of the DESIRED replicas are running."
  # - If HPA scales to 4 pods: maxUnavailable is 1.
  # - If HPA scales to 40 pods: maxUnavailable is 10 (node drains happen much faster).
  # Note on Math: Kubernetes rounds UP for maxUnavailable percentages. 
  # E.g., 25% of 5 pods = 1.25 -> rounded up to 2 pods can be evicted.


  selector:
    matchLabels:
      app: my-api
  unhealthyPodEvictionPolicy: IfAlwaysAllow

You cannot define both minAvailable and maxUnavailable in the same PDB manifest.

The Math (Rounding Rules): When you use a percentage for minAvailable, Kubernetes rounds up to the nearest integer to ensure safety.

Example: You have 7 pods and set minAvailable: 50%.
Calculation: 50% of 7 is 3.5. Kubernetes rounds this up to 4.
Result: At least 4 pods must always remain available.

The Danger Zone (“The 100% Trap”): If you set minAvailable: 100% (or set it to an integer equal to your total replica count), you are telling Kubernetes: “Never evict any pods.” This will completely block kubectl drain commands and prevent node upgrades or Autoscaler scale-downs.

Integration with Readiness Probes

A PDB is useless if your Readiness Probes are poorly configured. Kubernetes considers a pod “Available” based on its Readiness check. If your probe returns 200 OK before the app is actually ready to handle traffic, you will still experience downtime.

DevSecOps Level

The “Drain” Workflow (Behind the Scenes)

The Anatomy of a `kubectl drain`

When you trigger a drain, you aren’t just “deleting” pods. You are initiating a coordinated handoff between the Eviction API and the Control Plane.

The Eviction Lifecycle

Cordoning: The node is marked as SchedulingDisabled. No new pods will be placed there.
Eviction Request: Instead of a standard DELETE call, kubectl sends an Eviction request to the API server.
The PDB Check: The API server looks for any PDB whose label selector matches the pod.
The Decision:
- Allowed: If the disruption budget is not exceeded, the pod is deleted.
- Denied: If the eviction would violate the PDB, the API returns a 429 (Too Many Requests).
The Wait Loop: kubectl waits (usually with a timeout) and retries the eviction. It expects the controller (like a Deployment) to spin up a replacement on a different node.

The “Stuck Drain” Scenarios

Beyond the replicas: 1 scenario you mentioned, here are three other ways drains get stuck:

The Quorum Trap (PDB + Misconfiguration)
- If you have a 3-node ZooKeeper or etcd cluster with minAvailable: 2, and one node is already down for maintenance, trying to drain a second node will fail. Kubernetes doesn’t care why the first pod is gone; it only sees that evicting another would drop you to 1, breaking the PDB.
The Pending Replacement
- If your cluster is at capacity (no CPU/RAM left), the new pod created by the Deployment will sit in Pending. Since the new pod never becomes Ready, the PDB never “unlocks” the old pod for eviction. You are now in a deadlock.
The “Broken” Readiness Probe
- If a new pod starts but fails its readinessProbe, it never counts toward the minAvailable total. The drain will loop indefinitely while the new pod restarts.

Best Practices for “Drain-Safe” Clusters

To prevent your automated upgrades from hanging at 3 AM, follow these rules:

Replica Count Rule: Always ensure replicas > minAvailable. For a minAvailable: 1, you need at least 2 replicas.
The “Zero” Exception: Setting maxUnavailable: 0 is a death sentence for node maintenance. It literally means “never allow a disruption.”
PDB for Critical Only: Don’t put PDBs on everything. Dev/Test environments or non-critical workers often don’t need them.
Timeout Handling: Use the --timeout and --force flags in CI/CD pipelines to prevent a single stuck pod from blocking an entire cluster rollout.

PDBs introduce complex operational dynamics

PDBs introduce complex operational dynamics, especially concerning cluster scaling and stateful workloads:

PodDisruptionConditions: In K8s 1.26+, there are features to handle unhealthy pods better in PDB calculations (UnhealthyPodEvictionPolicy), allowing you to evict crashing pods even if the PDB is theoretically violated, preventing node-drain deadlocks.
The minAvailable: 100% Trap: If an architect mistakenly sets minAvailable: 100% or maxUnavailable: 0 for a Deployment, the pods can never be voluntarily evicted. This completely blocks node drains and prevents Cluster Autoscaler from scaling down empty nodes.
Rounding Logic: When using percentages, K8s rounds up for minAvailable and rounds down for maxUnavailable. If you have 3 replicas and set minAvailable: 50%, 50% of 3 is 1.5, which rounds up to 2. K8s will ensure 2 pods are always up.
StatefulSets: PDBs are critical for Quorum-based databases (like ZooKeeper or Elasticsearch) managed by StatefulSets. If you need a quorum of 3 to maintain split-brain protection, your minAvailable must be strictly set to the quorum size.

Production-grade DevSecOps environment

PDBs are a mandatory reliability standard. Without them, automated CI/CD pipelines and infrastructure scaling can cause self-inflicted outages.

GitOps: Ensure PDBs are bundled tightly with your application Helm charts or Kustomize manifests managed via ArgoCD.

Policy Enforcement: Use Kyverno or OPA Gatekeeper to enforce a mutating or validating webhook. For example, write a rule: “Any Deployment in the prod namespace with >1 replica MUST have an associated PDB.”

Monitoring: DevSecOps must monitor PDB health. Use Prometheus to alert on kube_poddisruptionbudget_status_allowed_disruptions == 0. If this stays at 0 for too long, it means your application is degraded or your nodes are blocked from draining.

Additional Details

What happens if a pod is naturally unhealthy? If a pod is crash-looping, it does not count as “Available”. If you have 3 replicas, minAvailable: 2, and 2 pods are in CrashLoopBackOff, your allowedDisruptions is 0. You cannot drain the node holding the 1 healthy pod.
Difference between ReplicaSet and PDB: A ReplicaSet ensures “I will recreate a pod if it dies.” A PDB ensures “I will stop you from killing this pod if it risks my availability.”
Overlapping PDBs: If multiple PDBs match the same pod (via messy label selectors), it creates conflicting constraints. Always keep label selectors strict and isolated to specific deployments.

Topology Spread Constraints: The Ultimate Defense Against Involuntary Disruption

If Pod Disruption Budgets (PDBs) are your safety net for planned maintenance, then Topology Spread Constraints are your absolute primary shield against unplanned disasters, like an entire AWS Availability Zone (AZ) going down. By default, the Kubernetes scheduler loves “bin-packing” cramming as many pods as possible onto a single node to save money. This is great for cost, but terrible for high availability. Topology Spread Constraints give you the mathematical control to spread your mission-critical pods evenly across nodes, zones, and regions, ensuring true zero-downtime architecture.

Imagine you are carrying 6 precious glass bottles (your application Pods) to a party.

The Default Scheduler approach: You put all 6 bottles into one single carry bag (a single Node/AZ). If the bag tears, all 6 bottles break, and your party is ruined (Downtime!).

The podAntiAffinity approach: You decide strictly that every bottle must have its own separate bag. But if you only have 3 bags, you have to leave 3 bottles at home (Pods stuck in Pending).

The Topology Spread Constraint approach: You smartly divide the 6 bottles equally: 2 in Bag A, 2 in Bag B, and 2 in Bag C. If one bag tears, you still have 4 bottles left to keep the party going!

Here is how to take control of your pod distribution.

The Trap: Why Not Just Use `podAntiAffinity`?

Historically, engineers relied on podAntiAffinity to enforce a simple rule: “Do not put this pod on a node that already has a pod with the same label.”

The Problem: It is a rigid, binary constraint. If you have 3 nodes and request 4 replicas with strict anti-affinity, that 4th pod will stay Pending forever because there are no empty nodes left. It cannot scale gracefully.
The Solution: Topology Spread Constraints operate mathematically. They ensure pods are distributed evenly across defined domains (like nodes or zones), but remain flexible enough to allow multiple pods in the same domain as you scale up.

The Four Pillars of Pod Topology Spread Constraints

To write effective Pod Topology Spread Constraints and guarantee true high availability, you need to master these four required fields in your pod specification. Think of them as the blueprints for your cluster’s resilience:

1. `topologyKey` (The Failure Domain)

This field tells the Kubernetes scheduler how your underlying infrastructure is partitioned and what “boundaries” it should use to spread your pods.

To spread across individual Nodes: Use kubernetes.io/hostname (protects against single-server crashes).
To spread across Cloud Availability Zones (AZs): Use topology.kubernetes.io/zone (protects against data center outages in AWS, GCP, Azure, etc.).

2. `maxSkew` (The Balance Beam)

This integer (>0) defines the maximum allowed difference in the number of matching pods between any two topology domains. It’s the strictness of your balancing act.

The Math: If maxSkew: 1, and you have 3 AZs currently holding [2, 2, 1] pods, the scheduler must place the next pod in the third AZ to balance the spread to [2, 2, 2].
The Rule: A lower number means tighter balance, while a higher number allows for more uneven distribution.

3. `whenUnsatisfiable` (The Fallback Strategy)

This dictates the scheduler’s behavior when it hits a wall and cannot honor your maxSkew rule (e.g., an AZ is completely out of CPU/Memory).

DoNotSchedule (Strict): The pod remains in a Pending state.
- Best for: Strict compliance requirements, specialized stateful workloads, or batch jobs where placement is non-negotiable.
- Danger: Can prevent emergency auto-scaling if resources run dry.
ScheduleAnyway (Soft): The scheduler will try to balance the pods, but if it fails, it will prioritize availability and place the pod wherever it fits.
- Best for: Production stateless APIs and web applications where keeping the service online is more important than perfect symmetry.

4. `labelSelector` (The Target)

Similar to a PodDisruptionBudget (PDB) or a Service, the scheduler needs to know exactly which pods to count when calculating the current skew.

The Catch: This selector must perfectly match your pod’s labels. If your labels don’t match, the scheduler assumes there are 0 existing pods and your constraint won’t work as intended.

Production-Ready YAML (Multi-AZ EKS)

Here is a Deployment configured with a “Defense in Depth” strategy: it prioritizes spreading across AWS Availability Zones first, and individual nodes second.

Bash

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-api
  namespace: production
  labels:
    app: my-api
spec:
  # The total number of pods we want running.
  replicas: 6
  selector:
    matchLabels:
      app: my-api
  template:
    metadata:
      labels:
        app: my-api
    spec:
      # =======================================================================
      # TOPOLOGY SPREAD CONSTRAINTS
      # This controls how the kube-scheduler distributes pods across failure domains.
      # It calculates the distribution by counting pods that match the labelSelector.
      # =======================================================================
      topologySpreadConstraints:

        # ---------------------------------------------------------------------
        # RULE 1: AVAILABILITY ZONE SPREAD (Primary Defense)
        # Goal: Survive an entire AWS Availability Zone going offline.
        # ---------------------------------------------------------------------
        - maxSkew: 1
          # maxSkew specifies the maximum allowed difference in pod count between any two zones.
          # If maxSkew is 1, you can have 2 pods in Zone A and 3 in Zone B, but NOT 1 in Zone A and 3 in Zone B.
          
          topologyKey: topology.kubernetes.io/zone
          # topologyKey is the node label the scheduler uses to identify the failure domain.
          # EKS automatically applies this label to all worker nodes (e.g., ap-south-1a).
          
          whenUnsatisfiable: ScheduleAnyway
          # ScheduleAnyway (Soft Rule): If an AZ is out of capacity, it places the pod in another AZ.
          #   -> Pros: Your app scales successfully even if AWS has capacity issues in one zone.
          #   -> Cons: The spread might become slightly uneven temporarily.
          # DoNotSchedule (Hard Rule): The pod remains 'Pending' until the spread constraint can be met.
          
          labelSelector:
            matchLabels:
              app: my-api
          # labelSelector tells the constraint which pods to count when calculating the skew.
          # This MUST match the pod's labels.

          # ADVANCED (Kubernetes 1.27+ / EKS 1.27+):
          matchLabelKeys:
            - pod-template-hash
          # During a rolling update, Kubernetes creates a new ReplicaSet with a unique 'pod-template-hash'.
          # By adding this key, the scheduler calculates the spread ONLY for the new pods being rolled out,
          # rather than mixing the old and new pod counts. This ensures perfect balance after an upgrade.

        # ---------------------------------------------------------------------
        # RULE 2: NODE SPREAD (Secondary Defense)
        # Goal: Survive a single EC2 instance (Worker Node) crashing.
        # ---------------------------------------------------------------------
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          # kubernetes.io/hostname is the default label containing the unique EC2 instance ID.
          # This rule prevents the scheduler from stacking multiple pods on the exact same server.
          
          whenUnsatisfiable: ScheduleAnyway
          # Again, using ScheduleAnyway ensures that if you have more replicas than you have nodes,
          # Kubernetes will just start placing multiple pods per node instead of blocking the deployment.
          
          labelSelector:
            matchLabels:
              app: my-api
              
          matchLabelKeys:
            - pod-template-hash

      containers:
        - name: my-api-container
          image: my-registry/my-api:v1.0.0
          # Standard configurations like ports, readinessProbes, and resources go here.
          ports:
            - containerPort: 8080

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-api
  namespace: production
  labels:
    app: my-api
spec:
  # The total number of pods we want running.
  replicas: 6
  selector:
    matchLabels:
      app: my-api
  template:
    metadata:
      labels:
        app: my-api
    spec:
      # =======================================================================
      # TOPOLOGY SPREAD CONSTRAINTS
      # This controls how the kube-scheduler distributes pods across failure domains.
      # It calculates the distribution by counting pods that match the labelSelector.
      # =======================================================================
      topologySpreadConstraints:

        # ---------------------------------------------------------------------
        # RULE 1: AVAILABILITY ZONE SPREAD (Primary Defense)
        # Goal: Survive an entire AWS Availability Zone going offline.
        # ---------------------------------------------------------------------
        - maxSkew: 1
          # maxSkew specifies the maximum allowed difference in pod count between any two zones.
          # If maxSkew is 1, you can have 2 pods in Zone A and 3 in Zone B, but NOT 1 in Zone A and 3 in Zone B.
          
          topologyKey: topology.kubernetes.io/zone
          # topologyKey is the node label the scheduler uses to identify the failure domain.
          # EKS automatically applies this label to all worker nodes (e.g., ap-south-1a).
          
          whenUnsatisfiable: ScheduleAnyway
          # ScheduleAnyway (Soft Rule): If an AZ is out of capacity, it places the pod in another AZ.
          #   -> Pros: Your app scales successfully even if AWS has capacity issues in one zone.
          #   -> Cons: The spread might become slightly uneven temporarily.
          # DoNotSchedule (Hard Rule): The pod remains 'Pending' until the spread constraint can be met.
          
          labelSelector:
            matchLabels:
              app: my-api
          # labelSelector tells the constraint which pods to count when calculating the skew.
          # This MUST match the pod's labels.

          # ADVANCED (Kubernetes 1.27+ / EKS 1.27+):
          matchLabelKeys:
            - pod-template-hash
          # During a rolling update, Kubernetes creates a new ReplicaSet with a unique 'pod-template-hash'.
          # By adding this key, the scheduler calculates the spread ONLY for the new pods being rolled out,
          # rather than mixing the old and new pod counts. This ensures perfect balance after an upgrade.

        # ---------------------------------------------------------------------
        # RULE 2: NODE SPREAD (Secondary Defense)
        # Goal: Survive a single EC2 instance (Worker Node) crashing.
        # ---------------------------------------------------------------------
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          # kubernetes.io/hostname is the default label containing the unique EC2 instance ID.
          # This rule prevents the scheduler from stacking multiple pods on the exact same server.
          
          whenUnsatisfiable: ScheduleAnyway
          # Again, using ScheduleAnyway ensures that if you have more replicas than you have nodes,
          # Kubernetes will just start placing multiple pods per node instead of blocking the deployment.
          
          labelSelector:
            matchLabels:
              app: my-api
              
          matchLabelKeys:
            - pod-template-hash

      containers:
        - name: my-api-container
          image: my-registry/my-api:v1.0.0
          # Standard configurations like ports, readinessProbes, and resources go here.
          ports:
            - containerPort: 8080

Even perfectly written constraints can behave unexpectedly when they collide with other Kubernetes systems. Keep an eye out for these three common edge cases in production:

1. Autoscaler Conflicts (Karpenter vs. CAS) Strict topology constraints can create friction with your cluster autoscaler during rapid scale-up events.

The CAS Struggle: If you use the standard Kubernetes Cluster Autoscaler (CAS), aggressive topologySpreadConstraints can sometimes confuse its simulation logic, leading to delayed node provisioning or scale-up deadlocks.
The Karpenter Advantage: If you are using Karpenter (specifically on EKS), it handles these constraints natively. Karpenter understands the spread rules and will proactively provision the exact right nodes in the correct AZs to satisfy your maxSkew out of the gate.

2. The Rollout Imbalance (The “Old + New” Trap) During a standard rolling update, the scheduler looks at your labelSelector and counts both the old terminating pods and the new spinning-up pods together. This temporary doubling of pods can heavily warp your skew calculation, causing your new pods to get stuck in a Pending state.

💡 Pro-Tip: Use matchLabelKeys: [pod-template-hash] (available in Kubernetes v1.27+). Adding this field instructs the scheduler to group pods by their specific replica set hash. This forces the scheduler to calculate the spread exclusively for the new revision, ignoring the old pods entirely.

3. Node Taints and “Phantom” Skew By default, the scheduler calculates skew based on the pod’s intended topology domain. If a pod is scheduled to a node but gets stuck in a Pending state because it lacks a toleration for that node’s taint, the scheduler still counts it toward the maxSkew of that zone.

The Impact: This “phantom pod” effectively blocks other healthy pods from scheduling in that zone, creating a bottleneck. Always double-check how your taints and tolerations overlap with your spread constraints.

DevSecOps Architect level

production grade and add respective tools names

Node Provisioning: The standard Cluster Autoscaler (CAS) often gets confused by aggressive spread constraints during rapid scale-ups. For DevSecOps architectures on AWS, migrating to Karpenter is highly recommended. Karpenter evaluates Topology Spread Constraints natively and will proactively provision EC2 instances in the exact missing AZs to satisfy your maxSkew.
Rebalancing Existing Pods: The kube-scheduler only evaluates constraints when a pod is created. If a node goes down and comes back, or if you add a new AZ, your existing pods will not magically move to balance themselves. You must deploy the Kubernetes Descheduler (Official Link) running as a CronJob with the RemovePodViolatingTopologySpreadConstraint strategy to automatically evict and rebalance pods over time.

Production-Grade DevSecOps Toolchain

Provisioning (Karpenter): The standard Cluster Autoscaler (CAS) gets confused by aggressive spread constraints. Karpenter (on AWS) evaluates TSCs natively and proactively provisions EC2 instances in the exact missing AZs to satisfy your maxSkew.
GitOps (ArgoCD / Flux): Ensure PDBs and TSCs are strictly bundled with your application Helm charts to prevent manual kubectl overrides.
Policy as Code (Kyverno / OPA Gatekeeper): Enforce resilience programmatically. Create a cluster rule: “Any Deployment in the production namespace with replicas > 1 MUST have an associated PDB and a Zone-level Topology Spread Constraint.”
Continuous Rebalancing (Kubernetes Descheduler): TSCs are evaluated at schedule-time, not runtime. If an AZ goes down and recovers, existing pods won’t magically move. Run the Descheduler as a CronJob using the RemovePodViolatingTopologySpreadConstraint strategy to rebalance the cluster over time.
Observability (Prometheus): Alert on these critical metrics:
- kube_poddisruptionbudget_status_allowed_disruptions == 0 (Alerts when a node drain is blocked by a strict PDB).
- kube_deployment_status_replicas_unavailable > 0 (Alerts when TSCs are forcing pods into a Pending state).

Tech should learn

AWS(Draft)

AWS-Cloud-Tech

AWS-Compute

DevOps Essentials

DevSecOps Essentials(Draft)

CI/CD

GitHub Actions

Docker

Kubernetes (Draft)

The Kubernetes Foundation

Kubernetes Architecture

Kubernetes Setting Up the Lab

Kubernetes Namespace

Kubernetes Pod

Kubernetes Workload Controller

Kubernetes Storage and Configurations

Kubernetes Networking

Kubernetes Authentication & Authorization

AWS Elastic Kubernetes Service

EKS Architecture

AWS EKS Identity & Access Management

EKS Configuration & Storage

EKS Workload Controllers

EKS Advanced Networking & Traffic Management

EKS Workload Security

EKS Observability & Troubleshooting

EKS CI/CD, GitOps

EKS Platform Engineering

EKS Cluster Upgrades & Reliability

EKS AI, ML, LLMs

Programming

Python

Kubernetes Pod Disruption Budgets PDB & Topology Spread Constraints

Pod Disruption Budgets PDB

The “Voluntary” vs. “Involuntary” Rule

The Two Ways to Define a PDB

Key Differences at a Glance

Option A: minAvailable

Option B: maxUnavailable

Integration with Readiness Probes

DevSecOps Level

The Anatomy of a kubectl drain

The “Stuck Drain” Scenarios

Best Practices for “Drain-Safe” Clusters

Topology Spread Constraints: The Ultimate Defense Against Involuntary Disruption

The Trap: Why Not Just Use podAntiAffinity?

The Four Pillars of Pod Topology Spread Constraints

1. topologyKey (The Failure Domain)

2. maxSkew (The Balance Beam)

3. whenUnsatisfiable (The Fallback Strategy)

4. labelSelector (The Target)

Production-Ready YAML (Multi-AZ EKS)

DevSecOps Architect level

Option A: `minAvailable`

Option B: `maxUnavailable`

The Anatomy of a `kubectl drain`

The Trap: Why Not Just Use `podAntiAffinity`?

1. `topologyKey` (The Failure Domain)

2. `maxSkew` (The Balance Beam)

3. `whenUnsatisfiable` (The Fallback Strategy)

4. `labelSelector` (The Target)