Kube-Scheduler
Kube-Scheduler: The “Decision Maker” & Cluster Planner
Imagine you are checking into a massive hotel (the Cluster) with your family (the Pod). You go to the front desk. The receptionist (Kube-Scheduler) doesn’t carry your bags or make your bed (that’s the Kubelet’s job). Instead, they look at your requirements: “We need a room with two beds, a sea view, and it must be non-smoking.”
The receptionist checks the computer system (etcd), filters out the rooms that are occupied or too small, scores the remaining rooms based on the best view, and finally hands you the key for Room 304 (assigns the Node). If no room matches your needs, you stay in the lobby (Pending state) until one opens up.
In technical terms, the Scheduler watches for new Pods that have no assigned node and selects the best node for them to run on.
- The Matchmaker: It marries “homeless” Pods to the most suitable Node.
- The Observer: It strictly watches for Pods where
spec.nodeNameis empty. - Two-Step Logic: Always remember: Filter first (Can it fit?), Score second (Is it the best fit?).
- Hands-Off Leader: It assigns the node (updates the database) but never touches the container itself.
- The Brain, Not the Muscle: It makes decisions; the Kubelet executes them.
| Feature | Description | Simple Analogy |
| Filtering (Predicates) | Eliminating unsuitable nodes immediately. | “This shirt is size Small, I need Large. Discard it.“ |
| Scoring (Priorities) | Ranking the remaining “survivor” nodes. | “These 3 shirts fit, but the red one looks the best. Pick the red one.“ |
| Taints & Tolerations | Repelling pods from specific nodes. | “This seat is ‘Reserved’ for VIPs. You can’t sit here unless you have a VIP ticket.” |
| Node Affinity | Attracting pods to specific nodes. | “I prefer to sit near the window (US-East zone).” |
| Pod Affinity | Grouping pods together. | “I want to sit next to my friend (Database Pod).” |
| Pod Anti-Affinity | Keeping pods apart (for safety). | “I don’t want to sit next to my ex (Same App Instance).” |
The Kube-Scheduler is a control plane component that runs within the master node. Its entire life purpose is to watch for Unbound Pods.
The Scheduling Loop (The Lifecycle):
- Queueing:Â When you runÂ
kubectl run nginx, the API server saves the pod to etcd. The Scheduler notices this pod is sitting in the “Scheduling Queue.” - Filtering (Hard Constraints): The scheduler looks at all available nodes (e.g., 100 nodes). It runs checks like:
- NodeResourcesFit:Â Does the node have enough free CPU/Memory?
- NodeUnschedulable:Â Is the node cordoned (closed for maintenance)?
- TaintToleration:Â Does the pod tolerate the node’s taints?
- Result:Â Maybe only 20 nodes pass this filter. The rest are dropped for this specific pod.
- Scoring (Soft Constraints):Â Now it looks at those 20 nodes to find the “best” one. It calculates a score (0-100) for each.
- ImageLocality:Â Does the node already have theÂ
nginx image downloaded? (Saves time = Higher score). - LeastRequested: Which node is emptiest? (Spreading the load = Higher score).
- ImageLocality:Â Does the node already have theÂ
- Binding:Â The node with the highest score wins. The Scheduler sends aÂ
Binding object to the API Server saying, “Assign Pod X to Node Y.”
DevSecOps Architect
At an architect level, you must understand that the Scheduler is not just a binary “fit/no-fit” machine; it is a Pluggable Scheduling Framework.
The Scheduling Framework (Extension Points): Modern Kubernetes (v1.19+) uses a framework approach where you can inject custom logic at different points (Plugins). You don’t just replace the scheduler; you extend it.
- QueueSort:Â Decides which pending pod goes first (PriorityClasses).
- PreFilter / Filter:Â Checks constraints (like GPU availability).
- PreScore / Score:Â Runs ranking algorithms.
- Reserve:Â Reserves resources internally in the scheduler cache (to prevent race conditions).
- Permit:Â specific logic to “wait” or “allow” (used in gang scheduling for AI/ML workloads).
- Bind:Â The actual assignment logic.
Multi-Scheduler Setup: Did you know you can run multiple schedulers in one cluster?
- You can write a custom scheduler for specific batch jobs (like Big Data) and leave the default scheduler for web apps.
- In the Pod spec, you simply defineÂ
schedulerName: my-custom-scheduler.
Descheduler (The “Corrector”): The Kube-Scheduler only runs once when the pod is created. If the cluster becomes unbalanced later (e.g., a big node is added but stays empty), the Scheduler won’t move old pods.
- Solution: Use the Descheduler. It evicts pods based on policies, forcing them to go back to the Kube-Scheduler to find a better home.
–
Taints and Tolerations (The “Repellent”)
- This is a critical concept.
- Taint: applied to a Node (e.g., “This node is for GPU tasks only”).
- Toleration: applied to a Pod (e.g., “I am a GPU task, I can tolerate that taint”).
- Analogy: A Taint is like a “Bad Smell” on the node. Only pods that “Tolerate” the smell will land there. Everyone else stays away.
Affinity and Anti-Affinity (The “Magnet”)
- Node Affinity: “I want to run on a node that is in the ‘US-East’ zone.” (Attraction).
- Pod Affinity: “I want to run on the same node as the Database Pod.” (Togetherness).
- Pod Anti-Affinity: “I do not want to run on the same node as another Web Server.” (Separation – useful for High Availability so one server crash doesn’t kill both apps).
Use Case
- Dedicated Infrastructure: Ensuring heavy AI workloads only land on expensive GPU nodes using Taints.
- Cost Optimization: Packing non-critical dev pods onto cheaper “Spot Instances” using Affinity.
- High Availability: Using Anti-Affinity to ensure that if one node crashes, your entire application doesn’t go down because the replicas were spread out.
Benefits
- Resource Efficiency: Ensures hardware is utilized optimally (Bin-packing).
- Stability: Prevents “Noisy Neighbors” (one app eating all CPU) by respecting resource limits during filtering.
- Automation: Eliminates the need for manual placement of containers.
Limitations
- Static Decision: Once a pod is scheduled, the scheduler forgets about it. It does not re-balance purely based on runtime metrics (unless you use Descheduler).
- Complex Rules Conflict: It is easy to write rules that contradict each other (e.g., “Must be on Node A” vs “Tainted against Node A”), causing the pod to be stuck in
Pendingforever.
Common Issues, Problems, and Solutions
| Issue | Problem Analysis | Solution |
| Pod Stuck in “Pending” | Scheduler cannot find a node that satisfies all Filters (CPU, Taints, Affinity). | Check events: kubectl describe pod <pod-name>. Look for “FailedScheduling”. |
| “Insufficient CPU” | The sum of Pod requests is higher than available Node capacity. | Reduce Pod requests or add more Nodes (Cluster Autoscaler). |
| Uneven Distribution | All pods landed on one node because it was empty at that specific second. | Use PodTopologySpreadConstraints to force even spreading. |
- Main Docs: Kubernetes Scheduler
- Taints & Tolerations: Taints and Tolerations Docs
- Assigning Pods: Assigning Pods to Nodes
Labs
Scenario: Force a Pod to run on a specific node using nodeSelector.
Step 1: Label a Node First, we give a “sticker” (label) to one of our worker nodes.
kubectl label nodes worker-node-1 disktype=ssdStep 2: Create the Pod Manifest Create a file named pod-ssd.yaml.
apiVersion: v1
kind: Pod
metadata:
name: nginx-ssd
spec:
containers:
- name: nginx
image: nginx
nodeSelector:
disktype: ssd # This forces the Scheduler to look for the labelStep 3: Apply and Verify
kubectl apply -f pod-ssd.yaml
kubectl get pod nginx-ssd -o wide
# You should see it running specifically on worker-node-1