Skip to main content
< All Topics

Kernel Namespaces & Cgroups

The Foundation of Containerization: Under the Hood of Linux

Welcome to the very foundation of containerization! Before jumping into Docker or Kubernetes, it is crucial to understand what is happening “under the hood” of the Linux Operating System. Many people think containers are magic, but they are not “real” physical objects. They are simply isolated execution environments created by combining standard Linux kernel features in a smart way.

Think of it like an apartment building:

  • Kernel Namespaces: Think of these as the “Walls” of a room in an apartment. They provide isolation. Even though you are in the same building (Server), you cannot see what is happening in the neighbor’s room (Container) because of the walls.
  • Cgroups (Control Groups): Think of these as the “Electricity Meters”. They provide limits. They ensure one room doesn’t use up all the electricity (CPU/RAM) of the entire building.
  • Linux Bridge: Think of this as a “Virtual Network Switch” connecting all the rooms to the outside world.

The Golden Rule of Containers: Namespaces decide what a process can see. Cgroups decide how much a process can use.

Quick Reference
FeatureRoleLinux Command (Try it!)Complexity
NamespacesIsolation (Visibility)unshare, lsnsMedium
CgroupsResource Limiting (Usage)systemd-cgtopHigh
BridgeLayer 2 Switchingbrctl, ip linkLow
IptablesFirewall & Routing rulesiptables -LHigh
IPVSHigh-performance Load BalancingipvsadmVery High
SystemdHost Init System (Service Manager)systemctlMedium

Linux Namespaces: The Illusion of Isolation

Namespaces provide isolation. They restrict what a process can see. By wrapping a process in a namespace, the Linux kernel makes the process believe it has its own isolated instance of the global system resources.

When you spin up a pod or a container, the runtime uses the following types of namespaces to build the isolation boundary:

Types of Namespaces

  • PID (Process ID): Isolates the process ID number space. A process in a new PID namespace can be PID 1 (the init process) inside its container, while having a completely different, unprivileged PID on the host machine.
  • NET (Network): Isolates the network stack. This gives the container its own network interfaces (like eth0), IP addresses, routing tables, and firewall rules. This is why every Kubernetes pod can have its own IP.
  • MNT (Mount): Isolates mount points. The process sees a distinct filesystem hierarchy. When a container mounts its root filesystem, it doesn’t affect the host’s filesystem.
  • UTS (UNIX Timesharing System): Isolates the hostname and NIS domain name. This allows each container to have its own custom hostname.
  • IPC (Inter-Process Communication): Isolates System V IPC objects and POSIX message queues. It prevents processes in one container from communicating directly with processes in another via shared memory.
  • USER: Isolates user and group IDs. A process can run as the root user (UID 0) inside the container but map to a non-privileged user on the host. This is a critical feature for DevSecOps and minimizing blast radius.
  • CGROUP: Isolates the cgroup root directory, preventing a container from seeing the host’s cgroup configuration.
  • TIME: (Introduced in newer kernels) Allows containers to have different system times than the host.

Cgroups (Control Groups): Resource Management

While namespaces hide processes from each other, they do not stop one process from consuming all the RAM or CPU. Control Groups (Cgroups) provide resource limitation and accounting down to the specific byte of memory.

When you define requests and limits in a Kubernetes Pod manifest, the Kubelet translates those directives into Cgroup rules on the worker node.

  • Resource Limiting: You can set a hard limit on how much memory (e.g., 512MB) or how many CPU shares a container can use.
  • Prioritization: You can guarantee that critical containers get more CPU time than background tasks.
  • Accounting: Cgroups track exactly how much resource usage a group of processes has consumed, which is essential for billing and monitoring.

Core Cgroup Subsystems (Controllers)

  • cpu: Guarantees a minimum number of “CPU shares” or limits the maximum CPU bandwidth a group of processes can consume.
  • memory: Sets limits on RAM usage. If a container exceeds its memory limit, the Linux Out-Of-Memory (OOM) killer will terminate it (often seen as OOMKilled in K8s).
  • blkio (Block I/O): Sets limits on reads and writes to block devices (like disk drives).
  • pids: Limits the maximum number of processes that can be created within the cgroup, preventing fork bombs from exhausting host resources.
  • devices: Controls which devices the processes can read, write, or create.
  • cpuset: Binds a cgroup to specific CPU cores and NUMA nodes (highly useful for performance-critical workloads).

Cgroups v1 vs. Cgroups v2

  • v1: Hierarchies are decentralized. Each controller (CPU, memory) has its own separate tree. It can be complex to manage when processes belong to multiple trees.
  • v2: Introduced a unified hierarchy. A process belongs to exactly one cgroup, and all controllers apply to that single group. Modern distributions and runtimes default to v2 for better consistency and features (like improved memory tracking).

Use Cases & Benefits:

  • Multi-tenancy: Run apps for different customers on the same server without them accessing each other’s data (Namespaces).
  • Performance Protection: Ensure a background backup job doesn’t slow down the main web server (Cgroups).
  • Cost Efficiency: Squeeze more applications onto fewer servers safely.
  • Security Depth: If an app is hacked, the damage is contained within the Namespace “walls.”

Limitations:

  • Kernel Dependency: Unlike Virtual Machines (VMs) which have their own Kernel, containers share the Host Kernel. If the Host Kernel crashes (Kernel Panic), all containers die.
  • Security Boundaries: Namespaces are not as strictly secure as the hardware virtualization used in VMs. There are known “container escape” vulnerabilities.

Putting It All Together: The Anatomy of a Container

To summarize how this applies to modern infrastructure:

  1. A container runtime (like containerd or runC) is invoked.
  2. It uses clone() or unshare() to create a new set of Namespaces (PID, NET, MNT, etc.), effectively giving the process its own isolated “room.”
  3. It creates a new directory in the /sys/fs/cgroup filesystem, defining the Cgroup limits (e.g., max 512MB RAM, 0.5 CPU cores).
  4. It moves the new isolated process into this Cgroup.
  5. The process begins executing (e.g., running a Python app, a Node server, or a database) within these strict boundaries.

Understanding these primitives is crucial for debugging complex deployments, optimizing performance on EKS, and hardening your infrastructure against security threats.

DevSecOps Architect Level

Now that we understand the basics, let’s look at how these concepts apply at a massive scale in production environments.

1. The “PID 1” Problem (Systemd vs Docker Init)

In a standard Linux server, systemd is PID 1. It initializes the system, starts services, and crucially, it “reaps” zombie processes (cleans up dead child processes).

  • The Issue: In a container, your application (e.g., Java or Python) becomes PID 1. Most apps are not written to handle zombie reaping or system signals (like SIGTERM).
  • The Consequence: If your app crashes or spawns child processes that die, they become “zombies” and fill up the process table, eventually killing the container.
  • The Solution: Use a lightweight init system like Tini (built into Docker with the --init flag) or ensure your entrypoint script handles signals correctly.
2. Networking at Scale: Iptables vs. IPVS
  • Iptables: The traditional way Kubernetes/Docker handles service networking. It is a long sequential list of rules. If you have 5,000 services, the Kernel has to process a massive list of rules for every packet. This is slow (O(n) complexity).
  • IPVS (IP Virtual Server): Built into the Linux Kernel for Layer 4 load balancing. It uses a hash table structure. Even with 10,000 services, lookup is instant (O(1) complexity).
  • Architect Advice: For large-scale production clusters, always tune your CNI (Container Network Interface) to use IPVS mode instead of iptables for better performance.
3. Hardening: Seccomp & AppArmor

While namespaces provide isolation, a compromised container can still make dangerous system calls to the shared host kernel. DevSecOps architects use Seccomp (Secure Computing Mode) to filter and block unnecessary system calls, and AppArmor/SELinux to enforce mandatory access controls, preventing the container from touching sensitive host files even if it manages to break out of its namespace.


Troubleshooting Common Issues
IssueProblem DescriptionSolution
Zombie ApocalypseContainer process table fills up with defunct processes.Use --init flag in Docker or use a base image with tini installed.
OOM KilledContainer suddenly dies with “OOMKilled” error.The Cgroup memory limit was reached. Increase the limit or fix memory leaks in the app code.
Port Conflict“Address already in use” error.You are trying to bind a port on the Host that is already taken. Use Docker port mapping to map to a different host port.
Slow NetworkingHigh latency in service discovery at scale.Switch from Iptables mode to IPVS mode in your Kubernetes/Docker config.

Further Reading

Quiz Linux Kernel Namespaces & Cgroups


Contents
Scroll to Top