Kube-Controller Manager

PostedDecember 26, 2021

UpdatedFebruary 8, 2026

Author -Rajkumar Aute

Imagine you have a very strict a Smart Air Conditioner (AC).

Think of it as a Smart AC: You set the remote to 24°C (this is your Desired State). The AC constantly checks the room temperature (the Current State). If the room gets hot (26°C), the AC turns on to cool it down. If it gets too cold, it stops. It works non-stop to ensure the room is exactly how you want it.
In Kubernetes: You tell Kubernetes, “I want 3 copies of my application running.” The Kube-Controller Manager is the boss that constantly checks: “Do we have 3 copies?” If one crashes and only 2 are left, it notices the gap and immediately orders a new one to be created. It enforces the rules.

“The Kube-Controller Manager is the brain that fixes the difference between what you want and what you have.”
“It is a single binary (program) that runs many different logic loops inside it.”
“If a Pod dies, the Controller Manager is the one that notices and requests a replacement.”

Key Characteristics to Remember

State Watcher: Continuously monitors the cluster state via the API Server.
Self-Healing: Automatically corrects failures (like restarting crashed pods).
Single Binary: Even though it does many jobs (Node control, Job control), it runs as one process to keep things simple.

Sheet Table

Feature	Description	Real-World Analogy
Primary Role	Regulates the state of the cluster.	A Cruise Control system in a car.
Input	Desired State (YAML configuration).	Setting speed to 80 km/h.
Action	Compares Current State vs. Desired State.	Checking speedometer vs. setting.
Output	Corrective action (Create/Delete Pods).	Accelerating or braking.

The Kubernetes Controller Manager is a daemon that embeds the core control loops shipped with Kubernetes. In robotics and automation, a “control loop” is a non-terminating loop that regulates the state of a system.

In Kubernetes, a controller is a control loop that watches the shared state of the cluster through the API Server and makes changes attempting to move the current state towards the desired state.

For a DevSecOps Architect, the Kube-Controller Manager is critical for High Availability (HA) and Cluster Stability.

Leader Election: In a production setup (like a multi-master cluster), you will have multiple instances of the Controller Manager running for redundancy. However, only one can be active at a time to conflicts. They use a Leader Election mechanism (using leases API) to decide who is the active “Enforcer.” If the leader dies, another instance takes over.
Performance Tuning: You can tune the --concurrent-deployment-syncs flag. By default, it might process items sequentially or with low concurrency. Increasing this allows faster reconciling in large clusters but puts more load on the API Server.
Security Context: The Controller Manager requires significant privileges (it can touch almost any object). It should run with its own Service Account and restricted RBAC (Role-Based Access Control) permissions.
Metrics & Monitoring: It exposes a /metrics endpoint that tools like Prometheus can scrape. You should alert on metrics like workqueue_depth (is the controller falling behind?) and rest_client_requests_total (is it spamming the API server?).

Key Controllers running inside this single binary:

Node Controller: Responsible for noticing and responding when nodes go down. It checks if a node has stopped sending “heartbeats.”
Replication Controller: Responsible for maintaining the correct number of pods for every replication controller object in the system.
Endpoints Controller: Populates the Endpoints object (matches Services & Pods).

1. Kubernetes Node Controller: The Heartbeat Monitor of Your Cluster

Key Characteristics to Remember

Location: Runs inside the kube-controller-manager on the Control Plane.
Main Job: Manages the lifecycle of nodes (Registration, Monitoring, Deletion).
Health Signal: Relies on Heartbeats sent by the Kubelet every 10 seconds (default).
Reaction: If a node stops sending heartbeats, the controller waits for a grace-period before evicting pods.
Cloud Awareness: In cloud environments (AWS, Azure, GCP), it checks with the cloud provider to see if the VM still exists.

Feature	Default Value	Description
Monitor Period	5 seconds	How often the controller checks the status of each node.
Node Monitor Grace Period	40 seconds	How long the controller waits before marking a node as `Unhealthy` / `Unknown`.
Pod Eviction Timeout	5 minutes	How long the controller waits after a node is “Unknown” before deleting its Pods.
Eviction Rate	0.1/sec	Speed at which pods are deleted from a down node (prevents mass deletion panic).
Zone Awareness	Enabled	Handles failures differently if a whole Availability Zone goes down vs. just one node.

At a basic level, the Node Controller is a loop that runs forever. It looks at the Node objects in the API Server.

Registration: When a new Node (server) joins, the Kubelet creates a Node object. The Controller validates it.
CIDR Assignment: It assigns a range of IP addresses (CIDR block) to the node so its Pods can have IPs.
Tainting: If a node is having trouble (like running out of disk or memory), the Controller puts a “stamp” (Taint) on it. This tells the Scheduler, “Don’t send new Pods here!”
Kubectl: You see the Node Controller’s work when you run kubectl get nodes. Official Kubectl Docs

DevSecOps Architect Level

For an architect, the Node Controller is about Failure Handling and State Consistency.

1. The “Unknown” State Logic: When Kubelet stops posting status:

Step 1: Controller marks Node status condition Ready to Unknown.
Step 2: It adds a taint node.kubernetes.io/unreachable with NoExecute effect.
Step 3: The API server starts a countdown (default 5 mins). If the node doesn’t return, the pods are evicted (deleted) so they can be rescheduled elsewhere.

2. Cloud Provider Integration: If you are on AWS/GCP/Azure, the Node Controller runs a loop called cloudNodeController.

If a node becomes Unknown, the controller asks the Cloud API: “Does this VM instance still exist?”
If the Cloud API says “No, it’s deleted,” the Node Controller immediately deletes the Node object from Kubernetes, skipping the 5-minute wait.

Cloud Controller Manager (CCM): In modern clusters, cloud-specific node logic is moved here. Official CCM Docs

Key Characteristics

Asynchronous: It doesn’t talk directly to the node. It talks to the API Server, which talks to Etcd.
Consistency: It ensures the “Desired State” (Active Nodes) matches the “Actual State” (Infrastructure).
Self-Healing: It triggers the self-healing process for workloads by killing pods on dead nodes.

Use Case

Auto-Scaling: When a Cluster Autoscaler adds a node, the Node Controller registers it.
Disaster Recovery: When a zone fails, the controller detects the massive loss of heartbeats and manages the pod migration to healthy zones.
Maintenance: When you drain a node, the controller respects the “SchedulingDisabled” status.

Benefits

High Availability: Ensures workloads move away from broken hardware.
Resource Management: Prevents scheduling on nodes that have lost network connectivity.
Automated Cleanup: Removes stale node objects so you don’t see “Phantom Nodes” in your list.

Common Issues, Problems, and Solutions

Problem 1: Node Flapping (Ready <-> NotReady)

Cause: Kubelet is under high load and cannot send heartbeats in time.
Solution: Increase --node-monitor-grace-period in the controller manager or give more CPU to the Kubelet.

Problem 2: Pods Stuck in “Terminating” on Dead Node

Cause: The Node is down, so the Kubelet cannot confirm the pod is deleted. The Controller waits.
Solution: Force delete the pod: kubectl delete pod <pod-name> --grace-period=0 --force.

Problem 3: “Thundering Herd” on Recovery

Cause: Massive eviction causes Scheduler to overload.
Solution: Configure --node-eviction-rate carefully (default is 0.1 per second).

Kubernetes Node Architecture: https://kubernetes.io/docs/concepts/architecture/nodes/
Controller Manager: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
Taints and Tolerations: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/

2. Kubernetes Replication Controller (RC)

a ReplicationController ensures a specific number of “Pods” (application instances) are running at all times.

Case A: If a Pod crashes (room gets hot), the RC starts a new one immediately.
Case B: If someone accidentally starts too many Pods (room gets cold), the RC kills the extra ones.
The Goal: It guarantees that the Desired State (e.g., 3 Pods) always matches the Actual State.

Here are some quick lines to stick in your memory:

“RC (Replication Controller) guarantees that a specified number of pod replicas are running at any one time.”
“It replaces pods that are deleted, failed, or terminated.”
“It is the original/legacy way of replication; the modern way is ReplicaSet (used by Deployments).”
“RC only supports Equality-Based selectors (e.g., env = prod), unlike ReplicaSet which supports Set-Based selectors (e.g., env in (prod, stage)).”

Self-Healing: Automatically restarts failed containers.
Scalability: Can easily scale pods up or down manually or automatically.
Load Balancing: Helps distribute traffic across multiple pods (when used with Services).
Legacy Status: It is apiVersion: v1.

Now, let us go deeper. As an architect, you must know how it works internally.

The Reconciliation Loop: The Controller Manager runs a loop called the Informer. It watches for changes in Pod objects and ReplicationController objects.
Label Selector Logic: The RC does not “own” the pods by name. It owns them by Labels. If you manually create a standalone Pod with the label app: nginx, the RC will think, “Oh! I already have a pod!” and might even delete it if the count exceeds the limit.
Shared State: The RC reads the state from etcd (via API Server). It calculates $CurrentReplicas - $DesiredReplicas.
- If Result < 0: It creates specific pods using the template.
- If Result > 0: It sorts the pods (usually killing the youngest or unready ones first) and deletes them.
Orphan Adoption: If you delete an RC with --cascade=orphan, the pods keep running. A new RC with the same selector will “adopt” these existing pods immediately.

Kubernetes ReplicationController Docs

Key Characteristics

Resiliency: It survives Node failures. If a Node dies, the RC notices the pods are gone and starts new ones on a healthy Node.
Efficiency: It uses very low resources as it is part of the compiled Go binary of the Controller Manager.

Use Case

Stateless Applications: Perfect for web servers (Nginx, Apache) where one pod is exactly the same as another.
Queue Consumers: Workers listening to a message queue (RabbitMQ/Kafka) where you just need “5 workers” regardless of their identity.

Benefits

High Availability: Your app theoretically never goes down completely.
Automation: No need for a human to wake up at 3 AM to restart a server.
Resource Optimization: Keeps exactly the resources you asked for, no waste.

Common Issues, Problems, and Solutions

Problem	Symptom	Solution
Pod Flapping	Pods start and terminate repeatedly.	Check if two controllers (e.g., an RC and a Deployment) are selecting the same label. They will fight for control.
Pending Pods	`replicas: 3` but only 0 running.	Check your Cluster Resources (CPU/Memory) or if Nodes are `Ready`. The RC creates the pod request, but the Scheduler cannot find a place for it.
ImageErr	Pod created but status is Error.	The RC did its job (created the pod), but your Docker image name might be wrong. Check `kubectl describe pod`.

3. Endpoint Controller

The Endpoint Controller is the receptionist who has a list of all employees.

It checks who belongs to the “Billing Department” (Label Selector).
It checks if they are actually at their desk and ready to work (Readiness Probe).
It updates the internal phone directory (The Endpoints Object) so that when you call the main number, the call is forwarded to an actual available person.

If this receptionist goes on leave, the phone directory stops updating. New employees won’t get calls, and employees who left will still get calls (which will fail).

The Rule: If a Service has a selector, the Endpoint Controller creates an Endpoints object with the same name.
The Job: It bridges the gap between a Service (stable IP) and Pods (ephemeral IPs).
The Condition: Only Ready pods get into the Endpoints list (unless publishNotReadyAddresses is true).
The Modern Way: In newer Kubernetes, this controller works alongside the EndpointSlice controller for better performance.

It runs inside the kube-controller-manager binary.
It watches Services and Pods.
It does not route traffic; it only maintains the list of IPs that kube-proxy uses to route traffic.

Feature	Details
Binary	`kube-controller-manager`
Input	Service (with Selector) & Pods
Output	`Endpoints` (and `EndpointSlices`)
Trigger	Pod creation/deletion, Label changes, Readiness state change
Default Sync	5 concurrent syncs (configurable)
Scalability Limit	Traditional `Endpoints` object hits limits at ~1000 pods (use EndpointSlices for more)

The Endpoint Controller is a loop that runs forever. Here is the simple logic flow:

Watch: It watches for changes in Services.
Check: If a Service has a selector (like app: nginx), the controller wakes up.
Search: It searches the cluster for all Pods that have that exact label.
Verify: It looks at the status of those Pods. Are they Running? Is the ReadinessProbe passing?
Write: It collects the IPs of the “Ready” pods and writes them into a Kubernetes object called Endpoints.

Official Link: Kubernetes Endpoints API

At an architect level, you must understand the performance implications and the shift to EndpointSlices.

1. The “Thundering Herd” Problem: In the old days, the Endpoints object contained all IPs for a service in one single API object.

If you had 5,000 pods backing a service, the Endpoints object would be huge (e.g., 1.5 MB).
If one pod changed status, the entire 1.5 MB object had to be resent to every single Node (kube-proxy) in the cluster.
This caused massive network congestion.

2. The Solution: EndpointSlices: Kubernetes introduced the EndpointSlice Controller (which runs parallel to the Endpoint Controller). It splits that massive list into smaller “slices” (chunks of 100 IPs).

Architect Note: Ensure your cluster uses EndpointSlices if you scale beyond 100 pods per service.

3. Controller Configuration Flags: You can tune the kube-controller-manager to handle high churn (lots of pods dying/starting).

--concurrent-endpoint-syncs: Default is 5. If you have a massive cluster, increase this to process endpoint updates faster.
--concurrent-service-endpoint-syncs: Controls the newer EndpointSlice syncing concurrency.

Kube Controller Manager Options

Key Characteristics

Reactive: It reacts to Pod state changes instantly.
Stateless: It doesn’t store state locally; it rebuilds the state from the API server.
Selector-Driven: It relies entirely on Label Selectors.

Use Case

Load Balancing: Ensuring traffic is distributed only to healthy containers.
Zero-Downtime Deployments: When a Deployment does a rolling update, the Endpoint Controller removes the old pod IP and adds the new one only when it is ready.

Common Issues & Solutions

Problem: Service is up, but connection refused (Endpoints are empty).

Reason: The Service Selector does not match the Pod Labels.
Solution: Check for typos in YAML. kubectl get ep <service-name> should show IPs. If it says <none>, labels don’t match.

Problem: Traffic going to dead pods.

Reason: The Endpoint Controller is stuck or the Kubelet hasn’t reported the pod as “NotReady” yet.
Solution: Check kube-controller-manager logs for crashes. Use Readiness Probes to force the status update.

Tech should learn

AWS(Draft)

AWS-Cloud-Tech

AWS-Compute

DevOps Essentials

DevSecOps Essentials(Draft)

Programming

Python

CI/CD

GitHub Actions

Kubernetes (Draft)

The Foundation

Kubernetes Architecture

Kubernetes Setting Up the Lab

Kubernetes Core Workloads

Docker

Kube-Controller Manager

Key Characteristics to Remember

Sheet Table

Key Controllers running inside this single binary:

1. Kubernetes Node Controller: The Heartbeat Monitor of Your Cluster

Key Characteristics to Remember

DevSecOps Architect Level

Key Characteristics

Use Case

Benefits

Common Issues, Problems, and Solutions

2. Kubernetes Replication Controller (RC)

Key Characteristics

Use Case

Benefits

Common Issues, Problems, and Solutions

3. Endpoint Controller

Key Characteristics

Use Case

Common Issues & Solutions