Kubernetes Pod Restart Policy
In Kubernetes, the RestartPolicy is just like a set of instructions you give to Kubernetes on how to handle your Pods (containers) when they stop running or crash. It decides whether to restart them or leave them be.
Think of the RestartPolicy like the “Auto-Start” feature in your car or bike.
- Always(Default for Deployments): Like a car with a “Start-Stop” system that always turns the engine back on, no matter why it stopped (even if you turned it off manually, it tries to start again in this context).
- OnFailure: Like a safety mechanism that only restarts the engine if it stalled unexpectedly. If you turn the key to “Off” (completed the trip), it stays off.
- Never: Like a manual engine. Once it stops (either you stopped it or it stalled), it stays stopped until a human intervenes. It never restarts automatically.
| Policy Name | Behavior on Success (Exit Code 0) | Behavior on Failure (Non-zero Exit Code) | Best Used For |
| Always | Restart | Restart | Long-running apps (Web servers, APIs, Databases) |
| OnFailure | Do Not Restart | Restart | Batch Jobs, Data processing, Backups |
| Never | Do Not Restart | Do Not Restart | One-time testing, Debugging, Static Pods |
Deployments manage stateless applications (like a React frontend or Node.js backend).
- The Rule: Deployments must use
Always. In fact, if you try to useOnFailureorNeverin a Deployment, it might barely work or is generally not supported because a Deployment’s goal is to keep the app running forever. - The Job: If you are running a Job (like a script to send daily emails), you must use
OnFailureorNever. If you useAlways, the email script will finish sending emails, stop, and then Kubernetes will start it again, sending the emails twice, then thrice… a disaster!
At an architect level, you need to understand the Exponential Back-off Delay. When a container keeps crashing, Kubelet doesn’t hammer the CPU by restarting it every millisecond.
- Delay Calculation: It starts at 10 seconds, then 20s, 40s, 80s… up to a maximum of 5 minutes (300s).
- Reset: The delay is reset if the container runs successfully for 10 minutes.
- Sidecar Containers (v1.28+): Kubernetes recently introduced native sidecar support. Now, init containers can also have a restart policy (set to
Always), allowing them to survive alongside the main application. This is huge for service mesh proxies like Istio or logging agents.
Key Characteristics
- Defined at Pod Spec Level: You cannot set this for individual containers inside a regular Pod (except for the new Sidecar feature). It applies to the whole Pod logic.
- Kubelet Managed: The logic is handled by the Kubelet on the specific node, not by the Control Plane (API Server).
Use Case
- Web Server (Nginx/Apache): Use
Always. If Nginx crashes, we need it back ASAP. - Database Migration Script: Use
OnFailure. If the migration works, stop. If it fails due to a network glitch, retry. - Video Processing Worker: Use
OnFailure. Process the video, then shut down to save money (if using spot instances or auto-scaling).
Common Issues, Problems, and Solutions
- Problem:CrashLoopBackOff.
- Scenario: You deployed an app, but the config is wrong. It starts, crashes, restarts, crashes…
- Solution: Check logs using
kubectl logs <pod-name> --previous. The--previousflag is a lifesaver here because it shows the logs of the dead container, not the currently restarting one.
- Problem: Job marked as “Completed” but actually failed.
- Scenario: Your code returns
Exit Code 0even when it catches an exception. - Solution: Fix your code! Ensure that if an error occurs, the program exits with
sys.exit(1)(in Python) or similar, soOnFailureknows to kick in.
- Scenario: Your code returns
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy