Skip to main content
< All Topics

Kubernetes Open Standards

Kubernetes, an open-source system for container orchestration under the Cloud Native Computing Foundation (CNCF), serves as the de facto standard for managing containerized applications. Its success is built upon a set of key open standards and interfaces that ensure interoperability, vendor neutrality, and a rich ecosystem of third-party tools.

Imagine everyone in the world decided to use a different shape for electrical plugs. You would need a different adapter for every single house you visit! That would be a mess, right?

Basically, OCI and these related standards are the “agreements” or “rules” that ensure all container tools (like Docker, Kubernetes, Podman) speak the same language. They make sure that if you package an application (put it in a container), it will run exactly the same way on your laptop, on a server, or in the cloud.

In the container world, before the Open Container Initiative (OCI) and other standards like CRI, CNI, and CSI, things were a bit like that. Different tools couldn’t easily talk to each other.

AcronymFull NamePrimary PurposePopular Tools/Implementations
OCIOpen Container InitiativeGovernance for container standards.runC, crun
CRIContainer Runtime InterfacePlug-in interface for K8s runtimes.containerd, CRI-O
CNIContainer Network InterfaceNetworking for Pods.Calico, Flannel, Cilium
CSIContainer Storage InterfaceStorage volumes for Pods.AWS EBS, Ceph, Portworx
SMIService Mesh InterfaceStandard for Service Meshes.Istio, Linkerd

Open Container Initiative (OCI)

Buy a USB drive from any brand (SanDisk, Sony, HP) and it fits perfectly into any computer (Dell, Apple, Lenovo)? That is because of a standard.

The Open Container Initiative (OCI) is the “USB standard” for containers. Before OCI, if you built a container with Docker, you had to use Docker to run it. It was like buying a lightbulb that only works with one specific brand of lamp. The OCI changed this. It created a set of open rules so that any tool can build a container, and any tool can run it.

It mainly looks after three things:

  1. Image Spec: How the container looks (the files).
  2. Runtime Spec: How the container runs (the process).
  3. Distribution Spec: How the container is shared (the download/upload).
  • “Build Once, Run Anywhere”: The core promise of OCI.
  • Decoupling: OCI separates the “building” of images from the “running” of containers.
  • Standardization: It prevents vendor lock-in (you are not stuck with Docker Inc.).
  • Content Addressability: Everything is identified by a “Digest” (a unique SHA256 hash), not just a file name.
SpecificationPrimary JobWhat it DefinesKey File/Component
Runtime SpecExecutionLifecycle (create, start, kill, delete)config.json
Image SpecPackagingFile format & Metadatamanifest.json, Layers
Distribution SpecSharingAPI Protocols (Push/Pull)HTTP API V2

The OCI is an open governance structure that creates specifications for container formats and runtimes.

1. The OCI Runtime Specification (runtime-spec) This defines the behavior of the “backend” program that actually runs the container. It details how to unpack a “Filesystem Bundle” and run it. The standard ensures that whether you use Linux namespaces (like runc) or a micro-VM (like Kata), the command to “start” the container is identical.

When an image is “pulled” and ready to run, it is unpacked into an OCI Bundle.

  • config.json: This is the holy grail. It contains:
    • Root: Path to the root filesystem (rootfs).
    • Mounts: Definitions for /proc, /sys, and bind mounts.
    • Process: The args, env vars (PATH, app=v1), and user (UID/GID).
    • Hooks: Pre-start, Post-start, and Post-stop hooks. (Crucial for injecting secrets or setting up networking before the app starts).
  • Lifecycle: The runtime MUST support these states: creating -> created -> running -> stopped.

2. The OCI Image Specification (image-spec) This defines the on-disk format of the container. It explains how to serialize the filesystem into “layers” (tarballs) and how to write the configuration (JSON) so that any tool can read it.

An OCI Image is not one single file; it is a Directed Acyclic Graph (DAG) of content.

  • Manifest (application/vnd.oci.image.manifest.v1+json): This is the “packing list.” It lists the config blob and the layer blobs.
  • Config Blob: Contains the metadata (author, creation date, architecture). changing this changes the Image ID.
  • Layers: These are usually .tar.gzip files. They use SHA256 digests. If two images share the same base layer (like Ubuntu), they reference the exact same SHA256 hash, saving disk space.
  • Image Index: This allows one image tag (e.g., my-app:latest) to point to multiple manifests (one for amd64, one for arm64). This is how multi-arch images work!

3. The OCI Distribution Specification (distribution-spec) This is the newest of the three. It standardizes the API for Container Registries. It ensures that when you type docker pull or podman pull, the tool speaks the exact same HTTP language to the server (Docker Hub, GitHub Container Registry, etc.).

This defines the HTTP interaction.

  • Discovery: GET /v2/ (Check if registry supports V2).
  • Pulling:
    1. GET /v2/<name>/manifests/<reference> (Get the manifest).
    2. GET /v2/<name>/blobs/<digest> (Download the layers in parallel).
  • Pushing:
    1. POST /v2/<name>/blobs/uploads/ (Start an upload).
    2. PUT (Upload the actual data chunks).
    3. PUT /v2/<name>/manifests/<reference> (Upload the manifest last to “seal” the image).
Additional Details
  • OCI Artifacts: This is a huge trend. Since OCI registries are so good at storing “content-addressable data,” people started storing non-container things in them.
    • Helm Charts: Now stored in OCI registries.
    • WASM Modules: Stored as OCI artifacts.
    • SBOMs (Software Bill of Materials): You can attach a security scan result directly to the image in the registry using the OCI “Referrers API”.
  • Supply Chain Security: Tools like Cosign (Sigstore) rely entirely on OCI standards. They sign the Digest (hash) of the image and store the signature as a separate OCI artifact in the same registry.
  • Rootless Containers: The OCI Runtime Spec was updated to heavily support “Rootless” mode, allowing containers to run without root privileges on the host (using User Namespaces).
Anatomy of an OCI Image (using skopeo)

Goal: See the actual JSON files defined by the Image Spec without downloading the heavy image.

  1. Install Skopeo (if not installed).
  2. Inspect the Raw Manifest:
    • skopeo inspect --raw docker://docker.io/library/alpine:latest | jq
  3. Observation: Look for mediaType: "application/vnd.oci.image.manifest.v1+json". See the layers array? Those are your SHA256 digests.

Container Runtime Interface (CRI)

The Universal Translator for Kubernetes

In the early days of Kubernetes, it only knew how to talk to one guy: Docker. It was hardcoded. If you wanted to use a different tool to run your containers, you couldn’t!

Imagine you have a universal remote control (Kubernetes) that only works with Sony TVs (Docker). That is very limiting, right? What if you buy a Samsung or LG TV?

CRI is like a Universal Adapter. It allows Kubernetes (the remote) to talk to any brand of TV (Container Runtime) like containerd, CRI-O, or Mirantis. As long as the TV follows the rules of the adapter, Kubernetes can turn it on, change the channel, and turn it off without knowing exactly how the TV’s internal circuits work.

  • Decoupling: CRI separates the Kubernetes logic from the container running logic.
  • gRPC: It uses a high-performance communication protocol called gRPC to talk.
  • No More Dockershim: This is the reason Kubernetes v1.24+ no longer needs the “Dockershim” code to support Docker.
  • Pod Sandbox: CRI introduces the concept of a “Sandbox” (an environment) that must be created before the actual container starts.
FeaturePre-CRI (Old K8s)With CRI (Modern K8s)
IntegrationHardcoded into Kubelet source code.Decoupled via a plugin interface.
FlexibilityLocked to Docker Engine.Can use containerd, CRI-O, gVisor, Kata.
MaintenanceK8s team had to fix Docker bugs.Runtime teams fix their own bugs.
CommunicationDirect function calls.gRPC over Unix Socket.
  • The “Dockershim” Story: You might hear old tutorials saying “Kubernetes deprecated Docker.” Don’t panic! They just removed the hardcoded support (Dockershim). You can still build images with Docker, but Kubernetes will likely use containerd to run them.
  • Socket File: On your Kubernetes node, there is a special file (a Unix Socket). The Kubelet talks to this file.
    • For containerd: /run/containerd/containerd.sock
    • For CRI-O: /var/run/crio/crio.sock

As an Architect, you need to understand the internal services of CRI and how it affects security and debugging.

The Protocol (gRPC & Protobuf)

CRI is defined using Protocol Buffers. The API has two main services:

  1. ImageService: Responsible for pulling images, listing images, and removing images.
    • RPC calls: PullImage, ListImages, ImageStatus.
  2. RuntimeService: Responsible for the lifecycle of the Pod and Container.
    • RPC calls: RunPodSandbox, CreateContainer, StartContainer, StopPodSandbox.

The “Pod Sandbox” Concept

This is critical. In Docker, there is no real “Pod” object. But in CRI, there is.

  • When Kubelet starts a Pod, it first calls RunPodSandbox.
  • The runtime creates a “pause container” (or a VM sandbox). This holds the Network Namespace (IP address) and IPC Namespace.
  • Then, the actual application containers are created and joined to this Sandbox.
  • Why this matters: If your application crashes and restarts, it keeps the same IP address because the Sandbox (pause container) stays alive!

CNI Integration

Wait, where does networking fit in?

  • The Runtime (CRI impl) is responsible for calling CNI.
  • Kubelet tells CRI: “Make a sandbox.”
  • CRI (e.g., CRI-O) creates the pause container -> calls CNI to get an IP -> assigns IP to pause container -> returns “Ready” to Kubelet.
Additional Details
  • crictl (CLI Tool): Since you are likely not using the Docker daemon on your nodes anymore, the docker ps command won’t work! You must use crictl. It is a command-line tool specifically built to talk to the CRI socket. It is the “new docker” for debugging Kubernetes nodes.
  • Log location: CRI standardized where logs go. Usually, /var/log/pods/. The Kubelet reads these files to show you kubectl logs.
  • Streaming API: How does kubectl exec work? The Kubelet talks to the CRI, which opens a streaming connection to the container runtime, acting as a proxy for your terminal data.

Container Network Interface (CNI)

The Nervous System of Kubernetes

In Kubernetes, every time a Pod creates a container, it needs a network connection. CNI is the language Kubernetes uses to tell a plugin (like Calico or Flannel): “Hey, I just built a new Pod. Please connect it to the network and give it an IP address right now.”

Without CNI, your Pods would be isolated islands, unable to talk to each other or the outside world.

Think of CNI as the Standard Power Socket in your house.

  • Kubernetes: The house owner who wants to plug in an appliance.
  • CNI Spec: The rule that says “The socket must have 3 pins and provide 240V.”
  • CNI Plugin (Calico/Flannel): The actual wiring behind the wall. Some wiring is simple (Flannel), some is heavy-duty industrial (Cilium), but they all end in the same 3-pin socket so the appliance (Pod) doesn’t care.
ConceptDescriptionPopular Examples
CNI PluginThe binary that configures the network interface.Flannel, Calico, Weave, Cilium
IPAMIP Address Management (assigning IPs).host-local, dhcp
OverlayCreating a “virtual network” on top of physical servers.VXLAN, IP-in-IP
UnderlayUsing the physical network directly (faster).BGP, Direct Routing

When the Kubelet decides to start a Pod:

  1. It calls the CRI (Container Runtime) to create the container sandbox.
  2. The Runtime then looks at the CNI configuration files (in /etc/cni/net.d/).
  3. The Runtime executes the CNI Plugin binary (like /opt/cni/bin/calico).
  4. The Plugin creates a network interface (usually a veth pair), attaches one end to the container and the other to the host, and assigns an IP address.
  5. Once the Plugin says “Success!”, the Pod is marked as Running.

If this process fails (e.g., ran out of IP addresses), the Pod stays in ContainerCreating error state.

  • Where are the files? If you SSH into a Kubernetes node, go to /etc/cni/net.d/. You will see a file like 10-calico.conflist. That file tells the Kubelet which plugin to use.
  • Where are the programs? Go to /opt/cni/bin/. You will see small executable files like bridge, dhcp, flannel, loopback. These are the actual workers.
DevSecOps Architect Level

As an Architect, you are not just “installing” CNI; you are choosing the networking model that dictates performance and security.

CNI Modes: Overlay vs. Underlay

  • Overlay (Encapsulation):
    • Examples: Flannel (VXLAN), Calico (IPIP).
    • How it works: Packets from Pod A are wrapped (encapsulated) inside a packet from Node A to Node B.
    • Pros: Easy setup. Works on any cloud/network (even if you don’t control the router).
    • Cons: Performance penalty due to encapsulation overhead (CPU usage). MTU issues are common.
  • Underlay (Direct Routing):
    • Examples: Calico (BGP), AWS VPC CNI, Azure CNI.
    • How it works: Pod IP addresses are routable on the physical network. No wrapping.
    • Pros: Best performance (near bare-metal speed). Easier to debug with traditional tools.
    • Cons: Requires control over physical routers (BGP peering) or drains IP addresses from the cloud VPC (AWS ENI limits).

Security: Network Policies

  • Standard Kubernetes Network Policies (firewalls for Pods) are implemented by the CNI plugin.
  • Note: Flannel does NOT support Network Policies. If you use Flannel, your network is wide open.
  • Architect Choice: Use Calico or Cilium if you need security (Micro-segmentation).

The Rise of eBPF (Cilium)

  • Legacy plugins use iptables (Linux firewall rules) to route traffic. As you scale to thousands of services, iptables gets slow.
  • Cilium uses eBPF (Extended Berkeley Packet Filter). It runs logic directly in the Linux kernel kernel, bypassing iptables. It is incredibly fast and provides deep observability (Layer 7 visibility).
Best Practices
  • Don’t Mix Plugins: Unless you know exactly what you are doing (e.g., migration), sticking to one main CNI is safer.
  • Monitor IP Usage: If a node runs out of IPs (IPAM exhaustion), it cannot schedule new Pods. Monitor the IP pool.
  • Use CNI with Policy Support: Even if you don’t use Network Policies today, you will need them tomorrow for compliance. Avoid plugins that don’t support them.
Common Issues
  • “CrashLoopBackOff” due to CNI: If the CNI plugin fails to install or connect, the CoreDNS pods will crash. The cluster will look “Ready,” but DNS won’t work.
  • Slow Pod Startup: If IPAM is slow or the API server is overloaded, adding the network interface can time out.
  • Leaked Interfaces: Sometimes, when a Pod is force-deleted, the veth pair on the host isn’t cleaned up. Over time, you see thousands of unused network interfaces on the node.
Solutions
  • Check Logs: Look at /var/log/syslog or journalctl -u kubelet for CNI errors.
  • Restart CNI Pods: Most plugins run as a DaemonSet (e.g., calico-node). Deleting these pods to let them restart often fixes transient issues.

Container Storage Interface (CSI)

The Universal Storage Adapter

In the old days of Kubernetes, the code to talk to AWS EBS, Google Persistent Disk, or NFS was actually inside the main Kubernetes code. This was called “in-tree.”

CSI (Container Storage Interface) fixes this. It is like the “USB Driver” for storage. It allows storage vendors (like AWS, Azure, NetApp, Portworx) to write their own drivers. Kubernetes just talks to the CSI driver, and the driver handles the rest. Now, you can add new storage types without ever touching the core Kubernetes code.

Key Characteristics to Remember
  • “Out-of-Tree”: CSI drivers live outside the Kubernetes source code.
  • Dynamic Provisioning: CSI allows K8s to create a disk automatically when you ask for it (via a PVC).
  • Sidecars: CSI drivers usually run as a set of Pods: a “Controller” (talks to the cloud API) and a “Node” agent (mounts the disk on the server).
  • RPCs: It uses gRPC (Remote Procedure Calls) to communicate, just like CRI.
FeatureIn-Tree (Old)CSI (New Standard)
LocationInside K8s source code.Separate Pods/Containers.
UpdatesRequires K8s upgrade.Vendor can update anytime.
FlexibilityLimited to built-in clouds.Supports any storage system.
SecurityHard to secure secrets.Secrets passed via CSI calls.

When you create a PersistentVolumeClaim (PVC):

  1. Kubernetes sees the request and looks at the StorageClass.
  2. The StorageClass points to a specific CSI Driver (e.g., ebs.csi.aws.com).
  3. The CSI Provisioner (a sidecar container) calls the AWS API to create the disk.
  4. The CSI Attacher attaches the disk to the Worker Node.
  5. The CSI Node Driver (running on the node) mounts the disk into the Pod’s folder so the application can use it.
  • PVC vs. PV:
    • PVC (Claim): “I want 10GB of storage.” (The user’s ticket).
    • PV (Volume): “Here is the actual 10GB disk.” (The fulfilled ticket).
  • StorageClass: This is the menu. It tells K8s which CSI driver to use.
    • Example: standard might use AWS EBS gp2. fast might use AWS EBS io1.
DevSecOps Architect Level

The Sidecar Containers (The Helpers)

These are standard containers provided by the Kubernetes team that “help” the vendor’s driver talk to K8s.

  1. external-provisioner: Watches for new PVCs and tells the driver to “Create Volume.”
  2. external-attacher: Watches for VolumeAttachment objects and tells the driver to “Attach Volume” to a node.
  3. external-resizer: Watches for PVC edits (e.g., expanding 10GB to 20GB) and calls the driver’s expand function.
  4. external-snapshotter: Watches for VolumeSnapshot objects to take backups.
  5. node-driver-registrar: Runs on every node (DaemonSet). It registers the driver with the Kubelet so the node knows “I can handle this storage.”

The Driver Components (The Vendor’s Code)

The vendor (e.g., AWS) provides a binary that implements the CSI gRPC services:

  • Identity Service: “Who am I?” (Name, version).
  • Controller Service: “Create/Delete Volume,” “Attach/Detach Volume.” (Runs as a Deployment, usually single replica).
  • Node Service: “Mount/Unmount Volume,” “Format Drive.” (Runs as a DaemonSet on every node).

Topology Awareness

  • The Problem: You cannot attach an AWS EBS volume in us-east-1a to a Node in us-east-1b.
  • The CSI Solution: CSI drivers understand “Topology.” When a Pod is scheduled, the CSI driver waits to see which zone the Pod lands in, creating the volume in that same zone. This is called Volume Binding Mode: WaitForFirstConsumer.
  • Access Modes: CSI enforces these strictly.
    • RWO (ReadWriteOnce): Block storage (EBS, Azure Disk). Can attach to only one node.
    • RWX (ReadWriteMany): File storage (EFS, NFS). Can attach to many nodes.
Benefits
  • Snapshots: You can take a snapshot of a database volume and restore it to a new PVC using standard Kubernetes YAML (VolumeSnapshot).
  • Expansion: You can resize a volume live (Online Expansion) without stopping the Pod.
  • Cloning: Create a new PVC filled with data from an existing PVC immediately.
Best Practices
  • Always use WaitForFirstConsumer: For cloud block storage, set volumeBindingMode: WaitForFirstConsumer in your StorageClass. This prevents creating a volume in the wrong availability zone.
  • Set Default StorageClass: Ensure one class is marked as default so users don’t have to specify it every time.
  • Monitor “Stuck” Volumes: Use tools to watch for VolumeAttachment objects that are stuck in “Attaching” state.
Common Issues
  • “Multi-Attach Error”: A Pod moves to Node B, but the volume is still stuck attached to Node A. K8s can’t attach it to B.
    • Solution: This usually fixes itself after a timeout (6-12 mins). If not, force delete the old Pod.
  • “Stuck in Terminating”: You delete a PVC, but it hangs.
    • Reason: The kubernetes.io/pvc-protection finalizer prevents deletion because a Pod is still using it. Delete the Pod first!
  • Driver Crashes: If the CSI Node Driver (DaemonSet) crashes on a node, that node cannot mount any new volumes.

Service Mesh Interface (SMI)

The “Standard API” for Service Meshes

There are many “Service Meshes” out there (Istio, Linkerd, Consul, Open Service Mesh). Each one has its own unique way of doing things. If you write code for Istio, it won’t work on Linkerd.

SMI was created to fix this. It is a standard set of “rules” (APIs) that allows you to define things like “send 10% of traffic to the new version” or “allow Service A to talk to Service B” in a way that works on any service mesh.

Think of SMI like SQL (Standard Query Language) but for microservices networking.

  • Database World: You write SELECT * FROM users. It works on MySQL, PostgreSQL, and Oracle. You don’t need to learn a new language for every database.
  • Service Mesh World: You write an SMI TrafficSplit. It works on Linkerd, Istio (with adapter), and Consul. You don’t need to rewrite your YAML files if you switch mesh providers.
Key Characteristics to Remember
  • Lowest Common Denominator: SMI defines the basic features common to all meshes (splitting traffic, checking metrics, allowing access). It does not cover advanced, vendor-specific features.
  • Kubernetes Native: SMI is implemented as standard Kubernetes Custom Resource Definitions (CRDs).
  • Tooling Friendly: Tools like Flagger (for Canary deployments) love SMI because they can write one automation script that works on every mesh.
  • Gateway API Evolution: Important Note: The industry is currently moving from SMI towards the Kubernetes Gateway API (GAMMA initiative) as the new standard.
FeatureSMI Resource NamePurposeExample Use Case
Traffic SplitTrafficSplitWeighing traffic between services.“Send 5% of traffic to v2 (canary).”
Access ControlTrafficTargetdefining who can talk to whom.“Only the Frontend can talk to the Backend.”
MetricsTrafficMetricsStandard format for HTTP stats.“Show me the error rate of the payment service.”
SpecsHTTPRouteGroupDefining specific routes.“Apply rules only to /api/v1/login.”
  • It’s Just YAML: To use SMI, you simply install the “CRDs” (Custom Resource Definitions) into your cluster. Then you can apply YAML files with kind: TrafficSplit.
  • You still need a Mesh: SMI is just a piece of paper (a spec). It doesn’t do anything by itself. You must have an actual Service Mesh (like Linkerd or Open Service Mesh) installed to read and enact the SMI rules.
DevSecOps Architect Level

As an Architect, you need to know the current state of this technology.

The Core APIs

  1. Traffic Access Control (TrafficTarget):
    • Used for mTLS policies. It defines essentially “Service A is allowed to GET from Service B on path /data“.
    • Architect Note: This replaces the need for vendor-specific AuthorizationPolicies.
  2. Traffic Specs (HTTPRouteGroup & TCPRoute):
    • Used to define what traffic looks like. You define a “Group” of routes (e.g., “all admin routes”) and then apply policies to that group.
  3. Traffic Split (TrafficSplit):
    • The most popular API. Used for Blue/Green and Canary releases.
    • Mechanism: It sits in front of the Kubernetes Service. When a request hits the service IP, the mesh sidecar looks at the TrafficSplit weight and routes the packet accordingly.

The “GAMMA” Initiative (Crucial Update)

  • Status Check: While SMI was the first attempt at standardization, the Kubernetes community is coalescing around the Gateway API for future service mesh configuration.
  • GAMMA (Gateway API for Mesh Management and Administration): This is the “successor” to SMI principles.
  • Architect Decision: If you are building new automation today, look at Gateway API. If you are using existing tools like Flagger, they rely heavily on SMI TrafficSplit, so SMI is still very relevant for now.

Ecosystem Adoption

  • Linkerd: Implements SMI natively. No adapters needed.
  • Istio: Requires an adapter. It does not support SMI out of the box.
  • Open Service Mesh (OSM): Built entirely on SMI (Managed by Microsoft/Azure).
  • Consul Connect: Supports SMI via controller.
  • Progressive Delivery: The “Killer App” for SMI is Progressive Delivery. This is the ability to automate a release: “Deploy v2 -> Send 1% traffic -> Check Metrics (Success Rate) -> If good, increase to 10% -> If bad, rollback.” Tools like Flagger automate this entire loop using SMI metrics and traffic splitting.
Benefits
  • Interoperability: Switch meshes without breaking your CD pipelines.
  • Simplicity: SMI APIs are much simpler than Istio’s complex VirtualService/DestinationRule setup.
  • Ecosystem: Unlocks access to tools like Flagger, Okteto, and Layer5 Meshery.
Best Practices
  • Use for Canary: If you want to do Canary deployments, use SMI TrafficSplit. It is the industry standard way to describe traffic weighting.
  • Combine with GitOps: Store your SMI YAMLs in Git. Let ArgoCD sync them. Let Flagger update the weights automatically.
  • Don’t Mix: Don’t try to use native mesh config (e.g., Istio VirtualService) and SMI TrafficSplit on the same service. They might conflict. Choose one method.
Common Issues
  • Lowest Common Denominator: SMI only supports features that everyone can do. It doesn’t support advanced features like “Circuit Breaking,” “Retries,” or “Mirroring” (Shadowing). For those, you still need vendor-specific config.
  • Adapter Lag: Sometimes the “SMI Adapter” for a mesh (like Istio) lags behind the actual mesh version, causing bugs.
  • Conflict: If you manually edit the TrafficSplit while an automated tool (Flagger) is also editing it, you will have a “fight” (race condition).

Cloud Provider Interface (CPI)

Cloud Controller Manager

Imagine you are building a house (Kubernetes). You need electricity (Cloud Resources like Load Balancers and Disk Drives). In the past, the blueprints for the house included specific wiring diagrams for every single power company in the world.

  • If you lived in an area with “AWS Power,” you used page 50.
  • If you lived in an area with “Azure Power,” you used page 100.
  • If “Google Power” changed their voltage, you had to tear down the whole house and rebuild it just to update the wiring!

This was the old way (“In-Tree”). It made the Kubernetes software huge and hard to update.

Cloud Provider Interface (CPI) is the new way. The house now just has a standard plug on the outside.

  • If you use AWS, you plug in the “AWS Adapter.”
  • If you use Azure, you plug in the “Azure Adapter.”
  • If AWS changes something, they just send you a new adapter. You don’t touch the house.
Key Characteristics to Remember
  • “Out-of-Tree”: This is the key phrase. Cloud code is moved out of the core Kubernetes binary.
  • cloud-controller-manager (CCM): The actual binary (daemon) that runs the cloud-specific loops.
  • Smaller Binaries: Because K8s doesn’t carry AWS/Azure/GCP code inside it anymore, the core K8s download is smaller.
  • Decoupled Releases: AWS can release a bug fix for their Load Balancers today without waiting for the next Kubernetes version to be released.
FeatureIn-Tree (Old Way)Out-of-Tree (CPI / New Way)
Code LocationInside k8s.io/kubernetes repo.Separate repo (e.g., kubernetes/cloud-provider-aws).
Binarykube-controller-manager (KCM) did everything.cloud-controller-manager (CCM) does cloud stuff.
Flag--cloud-provider=aws--cloud-provider=external
UpdatesTied to K8s releases (3x/year).Anytime the vendor wants.

The Cloud Provider Interface (CPI) acts as the bridge between Kubernetes and the underlying cloud infrastructure. Historically, Kubernetes included code for every major cloud provider directly in its source code. This “monolithic” approach became unmanageable.

The industry has shifted to the Cloud Controller Manager (CCM) pattern. In this model, the core Kubernetes control plane (kube-apiserver, kube-scheduler, kube-controller-manager) focuses solely on container orchestration. It doesn’t know what a “Load Balancer” is in AWS terms. Instead, it talks to the CCM, a separate daemonset running in the control plane.

  • When you create a Service of type LoadBalancer, the KCM sees it but does nothing.
  • The CCM sees it, recognizes it needs an AWS ALB, and makes the API call to AWS to create it.
  • The Flag: If you are setting up a cluster manually (like with kubeadm), you will see a flag --cloud-provider.
    • If set to external, K8s expects you to install a CCM (Cloud Controller Manager).
    • If you forget to install the CCM, your Nodes will be stuck in NotReady state because they are waiting for the cloud provider to confirm they exist!
DevSecOps Architect Level

The Route Controller

  • Role: Configures routes in the underlying cloud networking (VPC/VNet) so that containers on different nodes can talk to each other.
  • Action: When a new Node joins, this controller allocates a CIDR (e.g., 10.244.0.0/24) and updates the Cloud Routing Table to send traffic for that CIDR to that Node instance.
  • Note: If you use CNI plugins with overlay networks (like VXLAN), you might disable this controller.

The Service Controller

  • Role: Manages Load Balancers.
  • Action: Watches for Services of type: LoadBalancer.
    • Create: Calls Cloud API (e.g., elb:CreateLoadBalancer).
    • Update: Adds/Removes Nodes from the Load Balancer Target Group as Pods move around.
    • Delete: Cleans up the cloud resource when the K8s Service is deleted.

The Node Lifecycle Controller

  • Role: Determines if a Node is actually dead or just disconnected.
  • Action: If a Node stops sending heartbeats, K8s doesn’t know if it crashed or if the network is down. This controller asks the Cloud API: “Is instance i-12345 still running?”
    • If Cloud says “Terminated,” the controller deletes the Node object from K8s immediately.
    • If Cloud says “Running,” K8s keeps the Node object but marks it Unreachable.
Additional Details
  • Initialization Taints: When a node first joins an “external” cloud provider cluster, it has a taint: node.cloudprovider.kubernetes.io/uninitialized. The Scheduler will not put pods on this node until the CCM starts up, validates the node exists in the cloud, and removes the taint. This prevents “phantom nodes” from accepting workloads.
  • Hollow Nodes: This architecture allows for things like Virtual Kubelet. You can have a “Node” that isn’t a VM at all, but rather an interface to a serverless platform (like AWS Fargate or Azure Container Instances), bridged via a custom Cloud Provider implementation.
Benefits
  • Security: If a vulnerability is found in the Azure Load Balancer logic, Microsoft can release a patched CCM image instantly. You update just that Deployment. You don’t have to upgrade your entire Kubernetes cluster.
  • Performance: The core K8s binaries are leaner and start faster.
  • Vendor Neutrality: Smaller clouds (DigitalOcean, Linode, Hetzner) are first-class citizens. They just write a CCM; they don’t need to beg Google/RedHat to merge their code into K8s core.
Best Practices
  • Use Managed Kubernetes: EKS, AKS, and GKE handle this for you invisibly. You usually don’t see the CCM pods.
  • Self-Managed: If you run kops or kubespray on AWS/Azure, ensure you are using the “external” cloud provider mode. The “in-tree” providers are deprecated and will be removed soon.
  • GitOps: Manage your CCM installation (usually a Helm chart) via GitOps if you are on bare metal or a custom cloud.
Common Issues
  • “Stuck” Nodes: If the CCM crashes or fails to authenticate with the Cloud API (e.g., wrong IAM permissions), new Nodes will join but stay NotReady forever with the uninitialized taint.
  • IAM Permissions: The CCM needs powerful permissions (Create LoadBalancer, Modify Routes, Describe Instances). If these are too tight, Services won’t get external IPs.
  • Migration Pain: Moving an existing cluster from “in-tree” to “external” is complex and risky. It usually involves a dedicated migration tool or a cluster rebuild.

Kubernetes Service Discovery (KSD)

The Internal GPS of Kubernetes

Kubernetes Service Discovery is that system. It allows your Frontend application to simply say, “Connect to the Backend,” without ever worrying about which specific IP address the Backend is using at that moment.

Think of KSD as the Contacts App on your phone.

  • Without KSD: You have to memorize your friend’s phone number (IP Address). If they change their number, you lose contact.
  • With KSD: You just tap “Mom” (Service Name). The phone automatically dials whatever number is currently linked to “Mom.” You don’t care about the digits; you just care about the name.
Key Characteristics to Remember
  • DNS is King: 99% of service discovery happens via standard DNS names (e.g., my-service.default.svc.cluster.local).
  • CoreDNS: This is the specific software running inside Kubernetes that answers these “Who is where?” questions.
  • Environment Variables: An older, simpler way where K8s injects IP addresses directly into the container’s environment (e.g., MY_SERVICE_HOST=10.0.0.5).
  • Stable IP: A “Service” gets a Virtual IP (ClusterIP) that never changes, even if the Pods behind it die and respawn.
MechanismSpeedReliabilityBest For
DNS (CoreDNS)FastHighStandard communication between microservices.
Environment VarsInstantLow (Requires restart)Legacy apps or simple configuration injection.
Headless ServiceFastHighDatabases, StatefulSets, or custom load balancing.
Kubernetes APISlowHighOperators or advanced controllers querying the state.

Kubernetes Service Discovery (KSD) is the mechanism that solves the problem of “ephemeral” (temporary) nature of Pods. Since Pods are designed to die and be replaced, their IP addresses are unreliable.

KSD solves this by introducing the Service object.

  1. Abstraction: A Service creates a single, constant entry point (a Virtual IP) for a group of Pods.
  2. Selection: It uses Labels and Selectors (e.g., app: backend) to know which Pods belong to it.
  3. Naming: It assigns a DNS name to that Virtual IP.

When Pod A wants to talk to Pod B:

  1. Pod A asks CoreDNS: “Where is backend-service?”
  2. CoreDNS replies: “It is at 10.96.0.50.”
  3. Pod A sends traffic to 10.96.0.50.
  4. kube-proxy intercepts that traffic and forwards it to one of the actual healthy Pods (e.g., 10.244.1.5).
  • The “Magic” Name: Inside a namespace, you can just use the short name. If your service is named db, your app can just connect to http://db.
  • Across Namespaces: If you need to talk to a service in a different namespace, you need the full name: service-name.namespace.svc.cluster.local.
  • Don’t rely on IPs: Never hardcode a Pod’s IP address in your code. It will break. Always use the Service Name.
DevSecOps Architect Level

CoreDNS & The ndots Problem

  • How it works: CoreDNS runs as a Deployment. It watches the Kubernetes API for new Services and updates its internal DNS records.
  • The ndots:5 Issue: By default, K8s DNS search paths are deep (default.svc.cluster.local, svc.cluster.local, etc.). If your app tries to resolve google.com, it might first try google.com.default.svc.cluster.local, then google.com.svc.cluster.local… failing 5 times before succeeding!
  • Architect Tip: To reduce latency, always use the Fully Qualified Domain Name (FQDN) in your connection strings (ending with a dot .) or tune the ndots config in your Pod spec.

kube-proxy Modes (The ” Glue”)

Service Discovery gives you an IP, but how does traffic get there?

  • iptables mode (Standard): Randomly selects a backend pod using Linux iptables rules. It is reliable but can get slow if you have 10,000+ services.
  • IPVS mode (Performance): Uses the Linux IPVS kernel module (hash tables). Much faster for large clusters.
  • eBPF (Cilium): Replaces kube-proxy entirely. It does service discovery and load balancing directly in the kernel without iptables.

Headless Services

Sometimes you don’t want a load balancer; you want to talk to a specific Pod (e.g., Database Master vs. Replica).

  • Configuration: Set clusterIP: None in the YAML.
  • Result: DNS returns a list of the actual Pod IPs instead of one virtual Service IP. Your app (like Kafka or Cassandra) handles the load balancing itself.
Additional Details
  • ExternalName Services: You can map a local Kubernetes service name to an external DNS name (like my-db.aws.com). This is great for migrations your app talks to db-local, but Kube proxies it to AWS RDS.
  • EndpointSlices: In massive clusters with thousands of Pods, the old Endpoints object became too big and slow. K8s now uses EndpointSlices to group endpoints into smaller chunks for better scalability.
Benefits
  • Decoupling: Frontend devs don’t need to talk to Backend devs when IPs change.
  • Load Balancing: Traffic is automatically spread across all healthy Pods.
  • High Availability: If a Pod dies, KSD stops sending traffic to it immediately.
Best Practices
  • Use Namespaces: Don’t dump everything in default. Use DNS logic service.namespace to organize traffic.
  • Readiness Probes: KSD works with Readiness Probes. If your Pod isn’t ready, it is removed from Service Discovery automatically so no user hits an error.
  • Short TTL: Keep DNS Time-To-Live (TTL) low so that changes propagate quickly.
Common Issues
  • DNS Latency: If CoreDNS is overloaded (too many requests), your apps will feel slow.
    • Solution: Use NodeLocal DNSCache to run a tiny DNS cache on every node.
  • 5-Second Delay: A known Linux conntrack race condition can cause 5-second DNS timeouts.
  • Environment Variable Clutter: If you have 50 services, every Pod starts with 100s of environment variables injected. It’s messy. Stick to DNS.

Container Image Library (CIL) & OCI Distribution

(Note: In the industry, this is formally known as the OCI Distribution Specification and implemented by Container Registries.)

The Container Image Library (CIL), technically called a Container Registry, is exactly this. It is the central server where you store your container images (your applications).

  • Docker Hub, Quay.io, Harbor: These are the “Libraries.”
  • The Standard (CIL/OCI): This is the rule that ensures every library organizes the books in the same way, so your computer knows how to find them.

Think of CIL like the Maven Central (for Java) or npm (for Node.js), but for Containers.

  • The Warehouse: The Registry (e.g., Docker Hub) is a warehouse.
  • The Box: The Container Image is a sealed box.
  • The Manifest: The packing slip on the box describing what’s inside.
  • The Protocol: The standard language (API) the truck drivers use to check in and check out boxes.
Key Characteristics to Remember
  • The “Registry” is the Server: The software that stores the images (e.g., Harbor, ECR).
  • The “Repository” is the Folder: A specific place inside the registry for one app (e.g., my-app).
  • The “Tag” is the Version: The label on the image (e.g., v1.0, latest).
  • Push & Pull: The two main commands. You push to upload, pull to download.
ComponentDefinitionExample
RegistryThe service hosting the images.docker.io, quay.io, 10.5.4.3:5000
RepositoryA collection of related images.library/ubuntu, my-company/backend
TagA mutable alias for a specific version.latest, v1.2.0, stable
DigestAn immutable ID (SHA256 hash).sha256:8b0a...

The concept you referred to as Container Image Library (CIL) is technically standardized by the OCI Distribution Specification.

It solves the “Distribution” problem. You build an image on your laptop, but Kubernetes runs on a server in the cloud. How do you get the bits from A to B efficiently?

The CIL/Registry allows:

  1. Centralization: A single source of truth for your software artifacts.
  2. Versioning: Keeping history of every build (v1, v2, v3).
  3. Access Control: Deciding who can read (pull) or write (push) images.
  4. Efficiency: It uses “Layers.” If you update your app but keep the same OS, the registry only uploads the small app layer, not the whole OS again.

Public vs. Private:

  • Public: Docker Hub. Anyone can download. Great for open source (nginx, python).
  • Private: AWS ECR, Azure ACR, Harbor. Only people with a password can access. Used for your company’s proprietary code.
  • Naming Convention:
    • registry-url / project-name / image-name : tag
    • Example: quay.io / my-team / web-server : v1
DevSecOps Architect Level

The OCI Distribution API (The Protocol)

  • The standard defines a REST API that all tools (Docker, Podman, K8s) speak.
  • Layer Deduplication: If Team A pushes Ubuntu + App A and Team B pushes Ubuntu + App B, the registry is smart enough to store the Ubuntu layer only once physically (saving TBs of storage).
  • Manifest Lists (Multi-Arch): A single tag (e.g., python:3.9) can point to a list of manifests. If a Raspberry Pi pulls it, it gets the ARM64 version. If a Server pulls it, it gets the AMD64 version. The Registry handles this negotiation.

OCI Artifacts (Beyond Images)

  • Modern registries are not just for container images. You can store anything that follows the OCI format.
  • Helm Charts: helm push my-chart:1.0 oci://my-registry
  • Signatures: Cosign stores .sig files in the registry next to the image.
  • SBOMs: Software Bill of Materials can be attached to the image in the registry.

Replication & Caching

  • Pull-Through Cache: To save bandwidth and money (NAT Gateway costs), you run a local registry (like Harbor) that caches images from Docker Hub.
  • Geo-Replication: Enterprise registries (Harbor, Artifactory) can automatically sync images between US-East and EU-West regions so servers pull from the closest source.
Additional Details
  • The “Latest” Tag Trap: Never use the :latest tag in production. It is a moving target. If you deploy :latest today, and the registry updates tomorrow, your auto-scaling nodes might pull a different version than your existing nodes, causing a “Split Brain” crash. Always use specific version tags or SHA digests.
  • Garbage Collection: Registries fill up fast. You need a policy to delete old, unused images (e.g., “Delete development images older than 30 days”).
Key Components
  1. Blob Storage: The backend storage (S3, GCS, Local Disk) where the actual binary layers sit.
  2. Database: Stores metadata (tags, access logs, user permissions).
  3. Scanner (Optional): Tools like Trivy or Clair run inside the registry to check for CVEs.
Benefits
  • Interoperability: You can build with Docker, push to AWS ECR, and pull with Kubernetes (containerd).
  • Security: A central place to enforce scanning and signing policies.
Best Practices
  • Immutable Tags: Configure your registry to prevent overwriting tags. If v1.0 is pushed, nobody should be allowed to push a different code as v1.0.
  • Scan on Push: Configure the registry to reject any image with “Critical” vulnerabilities immediately.
  • Role-Based Access (RBAC): Developers can Push. Production Servers can only Pull.
Common Issues
  • Rate Limiting: Docker Hub limits anonymous pulls (e.g., 100 pulls/6 hours). This breaks CI/CD pipelines often.
    • Solution: Use a paid account or run a Pull-Through Cache.
  • Storage Cost: Registries grow indefinitely.
    • Solution: aggressive Garbage Collection rules.
  • Slow Pulls: Large images (GBs) take time to download.
    • Solution: Use “Distroless” images (small) or lazy-loading technologies (like eStargz).

Contents
Scroll to Top