Skip to main content
< All Topics

Kube-Controller Manager

Imagine a Smart Air Conditioner (AC). You set the remote to 24°C (this is your Desired State). The AC constantly checks the room temperature (the Current State). If the room gets hot (26°C), the AC turns on to cool it down. If it gets too cold, it stops. It works non-stop to ensure the room is exactly how you want it.

In Kubernetes, you declare, “I want 3 copies of my application running.” The Kube-Controller Manager is the brain that constantly checks: “Do we have 3 copies?” If one crashes and only 2 are left, it notices the gap and immediately orders a new one to be created. It enforces the rules.

“The Kube-Controller Manager is the brain that fixes the difference between what you want and what you have.”

Quick Reference
  • State Watcher: Continuously monitors the cluster state via the API Server.
  • Self-Healing: Automatically corrects failures (like restarting crashed pods).
  • Single Binary: Even though it does many jobs (Node control, Job control), it runs as one process to keep things simple and reduce overhead.

FeatureDescriptionReal-World Analogy
Primary RoleRegulates the state of the cluster.A Cruise Control system in a car.
InputDesired State (YAML configuration).Setting speed to 80 km/h.
ActionCompares Current State vs. Desired State.Checking speedometer vs. setting.
OutputCorrective action (Create/Delete Pods).Accelerating or braking.

DevSecOps Architect Level

Now that we understand the basics, let’s look at how a DevSecOps Architect leverages the Controller Manager for high availability and cluster stability.

The Kubernetes Controller Manager is a daemon that embeds the core control loops shipped with Kubernetes. In robotics and automation, a “control loop” is a non-terminating loop that regulates the state of a system.

For a DevSecOps Architect, the Kube-Controller Manager is critical for High Availability (HA) and Cluster Stability. Controllers do not constantly poll the API Server (which would overload etcd). Instead, they use a SharedInformer and DeltaFIFO queue pattern. This “edge-driven, level-triggered” approach means they react instantly to event streams but look at the entire cluster state to decide the next action.

  • Leader Election: In a multi-master production cluster, multiple instances of the Controller Manager run for redundancy. Only one can be active at a time to prevent conflicts. They use a Leader Election mechanism (Leases API) to decide the active “Enforcer.”
  • Performance Tuning: You can tune flags like --concurrent-deployment-syncs. Increasing this allows faster reconciling in large clusters but puts more load on the API Server.
  • Security Context: The Controller Manager requires significant privileges. It must run with its own Service Account and restricted RBAC permissions.
  • Metrics & Monitoring: It exposes a /metrics endpoint for Prometheus. Alert on metrics like workqueue_depth (is the controller falling behind?) and rest_client_requests_total (is it spamming the API server?).

Kubernetes Node Controller: The Heartbeat Monitor of Your Cluster

The Node Controller is the Heartbeat Monitor of your cluster. It runs inside the kube-controller-manager on the Control Plane and manages the lifecycle of nodes (Registration, Monitoring, Deletion).

It relies on Heartbeats sent by the Kubelet every 10 seconds. If a node stops sending heartbeats, the controller waits for a grace period before evicting pods.

FeatureDefault ValueDescription
Monitor Period5 secondsHow often the controller checks the status of each node.
Node Monitor Grace Period40 secondsHow long the controller waits before marking a node as Unhealthy / Unknown.
Pod Eviction Timeout5 minutesHow long the controller waits after a node is “Unknown” before deleting its Pods.
Eviction Rate0.1/secSpeed at which pods are deleted from a down node (prevents mass deletion panic).
Zone AwarenessEnabledHandles failures differently if a whole Availability Zone goes down vs. just one node.
Architect Tuning: Failure Handling & State Consistency

For an architect, tuning the Node Controller dictates how gracefully a cluster handles catastrophic failures.

The “Unknown” State Logic:

  1. Step 1: Controller marks Node status condition Ready to Unknown.
  2. Step 2: It adds a taint node.kubernetes.io/unreachable with a NoExecute effect.
  3. Step 3: The API server starts a countdown (default 5 mins). If the node doesn’t return, pods are evicted to be rescheduled elsewhere.

Cloud Provider Architecture (KCM vs CCM): In modern Kubernetes (v1.29+), the kube-controller-manager (KCM) no longer talks directly to cloud APIs like AWS, Azure, or GCP. All cloud-specific logic has been decoupled into the Cloud Controller Manager (CCM).

  • If a node goes down, the CCM asks the Cloud API: “Does this VM instance still exist?”
  • If the Cloud API says “No,” the CCM immediately deletes the Node object from Kubernetes, skipping the standard 5-minute wait.
Common Issues, Problems, and Solutions

Problem 1: Node Flapping (Ready <-> NotReady)

  • Cause: Kubelet is under high load and cannot send heartbeats in time.
  • Solution: Increase --node-monitor-grace-period in the controller manager or give more CPU to the Kubelet.

Problem 2: Pods Stuck in “Terminating” on Dead Node

  • Cause: The Node is down, so the Kubelet cannot confirm the pod is deleted. The Controller waits.
  • Solution: Force delete the pod: kubectl delete pod <pod-name> --grace-period=0 --force.

Problem 3: “Thundering Herd” on Recovery

  • Cause: Massive eviction causes Scheduler to overload.
  • Solution: Configure --node-eviction-rate carefully (default is 0.1 per second).


Deployment & ReplicaSet Controllers

While ReplicationControllers were the legacy standard (apiVersion: v1), modern Kubernetes entirely relies on Deployments and ReplicaSets.

  • The ReplicaSet Controller: Ensures a specific number of Pod replicas are running at any given time. It uses Set-Based selectors (e.g., env in (prod, stage)).
  • The Deployment Controller: Orchestrates the ReplicaSets. It provides declarative updates, enabling zero-downtime rolling updates and easy version rollbacks.

The Goal: Guarantee the Desired State (e.g., 3 Pods running v2 of an app) always matches the Actual State.

Architect Tuning: Garbage Collection & Reconciliation

  • OwnerReferences and Garbage Collection: A Deployment does not own Pods directly. The Deployment creates a ReplicaSet (injecting an ownerReference), and the ReplicaSet creates the Pods (injecting its own ownerReference). When you delete a Deployment, the Garbage Collector Controller reads these references to cascade the deletion down to the Pods safely.
  • Reconciliation & Label Selectors: The ReplicaSet owns pods strictly by Labels. If it calculates CurrentReplicas < DesiredReplicas, it creates new pods. If CurrentReplicas > DesiredReplicas, it terminates the youngest or unready pods first.

Endpoint Controller & EndpointSlices

The Endpoint Controller is the receptionist who has a list of all employees. It checks who belongs to the “Billing Department” (Label Selector), verifies if they are at their desk (Readiness Probe), and updates the internal phone directory (The Endpoints Object) so calls are routed correctly.

The Rule: If a Service has a selector, the Endpoint Controller creates an Endpoints object to bridge the gap between the Service’s stable IP and the Pods’ ephemeral IPs.

FeatureDetails
TriggerPod creation/deletion, Label changes, Readiness state change.
ConditionOnly “Ready” pods get into the Endpoints list (unless publishNotReadyAddresses is true).
OutputEndpoints (and EndpointSlices).

Architect Tuning: The “Thundering Herd” Problem

The “Thundering Herd” Problem & EndpointSlices: Historically, the Endpoints object contained all IPs for a service in one massive list. If you had 5,000 pods and one changed status, a massive 1.5 MB object had to be sent to every Node’s kube-proxy, causing severe network congestion.

Kubernetes solved this by introducing the EndpointSlice Controller. It splits that massive list into smaller “slices” (chunks of 100 IPs). kube-proxy now defaults to watching EndpointSlices instead of Endpoints, drastically reducing API server load and network traffic.

  • Architect Tuning: You can adjust --concurrent-endpoint-syncs and --concurrent-service-endpoint-syncs to handle high pod churn rates in massive clusters.

Contents
Scroll to Top