Kernel Namespaces & Cgroups
Welcome to the very foundation of containerization! Before jumping into Docker or Kubernetes, it is crucial to understand what is happening “under the hood” of the Linux Operating System. Many people think containers are magic, but they are just standard Linux features put together in a smart way.
Think as:
- Kernel Namespaces: Think of these as the “Walls” of a room in an apartment. They provide isolation. Even though you are in the same building (Server), you cannot see what is happening in the neighbor’s room (Container) because of the walls.
- Cgroups (Control Groups): Think of these as the “Electricity Meters”. They provide limits. They ensure one room doesn’t use up all the electricity (CPU/RAM) of the entire building.
- Linux Bridge: Think of this as a “Virtual Network Switch” connecting all the rooms to the outside world.
Here are some easy to remember the core concepts:
- Namespaces decide what a process can see.
- Cgroups decide how much a process can use.
- Bridges allow containers to talk to each other on a single host.
- IPVS is the heavy-lifter for load balancing, much faster than iptables at scale.
- PID 1 is the boss process; if it dies, the container dies.
Key Characteristics to Remember
- Isolation: Achieved via Namespaces (PID, NET, MNT, UTS, IPC, USER).
- Resource Management: Achieved via Cgroups (CPU, Memory, PIDs).
- Packet Filtering: Managed by iptables or IPVS.
- Process Lifecycle: Containers usually do not run
systemd; they run the application directly as PID 1.
| Feature | Role | Linux Command (Try it!) | Complexity |
| Namespaces | Isolation (Visibility) | unshare, lsns | Medium |
| Cgroups | Resource Limiting (Usage) | systemd-cgtop | High |
| Bridge | Layer 2 Switching | brctl, ip link | Low |
| Iptables | Firewall & Routing rules | iptables -L | High |
| IPVS | High-performance Load Balancing | ipvsadm | Very High |
| Systemd | Host Init System (Service Manager) | systemctl | Medium |
Containers are not “real” physical objects; they are isolated execution environments created by combining two specific Linux kernel features: Namespaces and Control Groups (Cgroups).
Kernel Namespaces (Isolation)
Namespaces provide isolation. They trick a process into thinking it has its own dedicated system resources, separate from other processes. When a process is wrapped in a namespace, it cannot see or affect the resources of other processes outside that namespace.
Key namespaces include:
- PID (Process ID): The process looks like PID 1 inside the container, even if it is PID 12345 on the host.
- NET (Network): Gives the container its own network stack (IP, localhost, routing table) separate from the host.
- MNT (Mount): Allows the process to have its own root filesystem (
/) different from the host. - UTS (Unix Timesharing): Allows the container to have its own hostname.
- IPC (Inter-Process Communication): Isolates shared memory and semaphores.
- USER: Maps user IDs inside the container to different IDs on the host (e.g., root inside, non-root outside).
Cgroups (Resource Control)
While namespaces hide processes from each other, they do not stop one process from consuming all the RAM or CPU. Control Groups (Cgroups) provide resource limitation and accounting.
- Resource Limiting: You can set a hard limit on how much memory (e.g., 512MB) or CPU shares a container can use.
- Prioritization: You can guarantee that critical containers get more CPU time than background tasks.
- Accounting: Cgroups track exactly how much resource usage a group of processes has consumed (essential for billing and monitoring).
Namespaces decide what a process can see. Cgroups decide what a process can use.
–
DevSecOps Architect Level
The “PID 1” Problem (Systemd vs Docker Init) In a standard Linux server, Systemd is PID 1. It initializes the system, starts services, and crucially, it “reaps” zombie processes (cleans up dead child processes).
- The Issue: In a container, your application (e.g., Java or Python) becomes PID 1. Most apps are not written to handle zombie reaping or system signals (like SIGTERM).
- The Consequence: If your app crashes or spawns child processes that die, they become “zombies” and fill up the process table, eventually killing the container.
- The Solution: Use a lightweight init system like Tini (built into Docker with
--init) or ensure your entrypoint script handles signals correctly.
Iptables vs. IPVS (Scaling Networking)
- Iptables: The traditional way Kubernetes/Docker handles Service networking. It is a long list of rules. If you have 5,000 services, the Kernel has to process a massive list of rules for every packet. This is slow (O(n) complexity).
- IPVS (IP Virtual Server): Built into the Linux Kernel for Layer 4 load balancing. It uses a hash table structure. Even with 10,000 services, lookup is instant (O(1) complexity).
- Architect Advice: For large-scale production clusters, always tune your CNI (Container Network Interface) to use IPVS mode instead of iptables for better performance.
–
Key Characteristics
- Granularity: Control down to the specific byte of memory.
- Transparency: The application inside doesn’t know it’s being limited; it just sees the resources provided.
- Ephemerality: Network interfaces (veth pairs) are created and destroyed dynamically as containers spin up and down.
Use Case
- Multi-tenancy: Running apps for different customers on the same server without them accessing each other’s data (Namespaces).
- Performance Protection: Ensuring a background backup job doesn’t slow down the main web server (Cgroups).
Benefits
- Cost Efficiency: Squeeze more applications onto fewer servers safely.
- Security Depth: If an app is hacked, the damage is contained within the Namespace “walls.”
Limitations
- Kernel Dependency: Unlike Virtual Machines (VMs) which have their own Kernel, containers share the Host Kernel. If the Host Kernel crashes (Kernel Panic), all containers die.
- Security Boundaries: Namespaces are not as secure as the hardware virtualization used in VMs. There are “escape” vulnerabilities.
Common Issues, Problems, and Solutions
| Issue | Problem Description | Solution |
| Zombie Apocalypse | Container process table fills up with defunct processes. | Use --init flag in Docker or use a base image with tini installed. |
| OOM Killed | Container suddenly dies with “OOMKilled” error. | The Cgroup memory limit was reached. Increase the limit or fix memory leaks in the app code. |
| Port Conflict | “Address already in use” error. | You are trying to bind a port on the Host that is already taken. Use Docker port mapping to map to a different host port. |
| Slow Networking | High latency in service discovery. | Switch from Iptables mode to IPVS mode in your Kubernetes/Docker config. |
- Docker Runtime Spec (OCI): Open Container Initiative
- Systemd: Systemd Page