Cloud Controller Manager
Historically, Kubernetes included specific code for every cloud provider (AWS, GCE, Azure) directly inside the main Kubernetes source code. This was known as “In-Tree” code.
- The Problem:Â If AWS updated their load balancer API, you had to wait for a full Kubernetes release to get the fix.
- The Solution: The Cloud Controller Manager extracts this cloud-specific logic into a separate binary. It allows cloud providers to release updates independently of the Kubernetes release cycle. This is the “Out-of-Tree” architecture.
If you run a cluster with --cloud-provider=external, you are telling Kubernetes: “Do not use your built-in cloud logic. Wait for the CCM to handle it.”
The CCM runs in the Control Plane (usually as a Deployment or DaemonSet). It interacts with two sides:
- Kubernetes API Server:Â To watch for changes in Nodes and Services.
- Cloud Provider API:Â To provision actual infrastructure (VMs, LBs, Routes).
It runs three specific control loops that you must understand deeply:
The CCM does not manage Pods or Deployments. It manages only three specific things. Mastering CCM means understanding exactly what these three loops do:
1. Node Controller (The “Inventory” Manager).
This is responsible for initializing the link between a Kubernetes Node object and the actual Cloud VM.
- Initialization:Â When a new Node joins, the CCM initializes it with cloud-specific labels (e.g.,Â
failure-domain.beta.kubernetes.io/zone) and addresses (Public/Private IPs). - Taint Management:Â When a node registers, the Kubelet usually adds the taintÂ
node.cloudprovider.kubernetes.io/uninitialized:NoSchedule. The CCM detects this, configures the node, and then removes the taint, allowing Pods to schedule. - Health Checks: If a Node stops sending heartbeats, the CCM queries the Cloud API to see if the VM still exists. If the VM is deleted in the cloud, the CCM deletes the Node object in Kubernetes.
2. Route Controller (The “Networking” Manager)
- Purpose:Â Sets up networking so Pods on different nodes can talk to each other.
- Action:Â It watches for node creation and configures the cloud’s underlying route table (e.g., AWS VPC Route Table) to route traffic for that Node’sÂ
PodCIDRÂ to the Node’s VM instance. - Note:Â If you use a CNI plugin that uses overlays (like VXLAN in Flannel/Calico) or direct VPC routing (like AWS VPC CNI), this controller might be disabled or unused.
3. Service Controller (The “Load Balancer” Manager)
- Purpose:Â Manages ingress and load balancing.
- Trigger:Â When you create a Service ofÂ
type: LoadBalancer. - Action:Â The CCM calls the Cloud API to provision a Load Balancer (ELB, ALB, SLB), configures the listeners/health checks, and updates the Service status with the external IP/DNS.
–
the Go interface that providers implement. The CCM is essentially a wrapper around the cloud.go interface. Key interfaces include:
Instances(): Lists and manages cloud VMs.LoadBalancer(): Create/Delete/Update Load Balancers.Routes(): Configure network routes.Zones(): Determine which region/zone a node is in.
Configuration: The CCM typically reads a cloud-config file (passed via --cloud-config), which contains authentication credentials (IAM roles, Tenant IDs) and global settings (VPC ID, Subnet ID).
High Availability (HA)
The CCM is a critical control plane component.
- Leader Election:Â You should run multiple replicas (e.g., 3) for redundancy.
- Mechanism:Â Like the Scheduler, CCMs use a lease lock in the API Server. Only the active leader executes the control loops; the others wait on standby.
Troubleshooting & Common Pitfalls
Scenario 1: Nodes are stuck in NotReady or unschedulable.
- Cause:Â The Kubelet added theÂ
uninitialized taint, but the CCM is not running or crashing. Therefore, it never removes the taint. - Fix: Check CCM logs. Ensure the CCM has proper RBAC permissions toÂ
patch Node objects.
Scenario 2: Service LoadBalancer stays in <Pending> forever.
- Cause:Â The Service Controller in CCM cannot talk to the Cloud API.
- Fix:Â Check cloud credentials (IAM roles/Service Principals). Ensure the cluster ID tags match what the cloud expects.
Scenario 3: Infinite Route Growth.
- Cause:Â The Route Controller is failing to clean up old routes for deleted nodes.
- Fix:Â Verify that the CCM hasÂ
delete permissions on the cloud route table.