Kubernetes Pod Health Probes

PostedDecember 26, 2021

UpdatedFebruary 15, 2026

Author -Rajkumar Aute

In distributed systems, ensuring that your application is running is not enough; you must ensure it is functioning correctly and capable of handling traffic. Kubernetes uses Probes essentially periodic health checks to monitor the state of containers. By configuring these probes, you automate the detection of failures and the recovery of services, ensuring high availability without manual intervention.

Probe Type	The Role	The Question	When does it run?	on Fail?	Traffic Status	Best Analogy
1. Startup	The Bodyguard (The Initializer)	“Has the application actually started?”	FIRST Runs exclusively. Pauses all other probes.	YES (Restarts Pod)	BLOCKED (Implicitly)	The BIOS Check When you turn on a PC, nothing works until the BIOS loads. If it hangs, you hard-reset the machine.
2. Readiness	The Traffic Cop (The Gatekeeper)	“Can you accept a customer request right now?”	CONTINUOUS Starts only after Startup passes.	NO ❌ (Safe! It just isolates the Pod)	PAUSED (Removes IP from Load Balancer)	“Closed for Lunch” The shop is open (Alive) and staff is there, but they flip the sign to “Closed” temporarily to restock.
3. Liveness	The Doctor (The Watchdog)	“Are you deadlocked, frozen, or crashed?”	CONTINUOUS Starts only after Startup passes.	YES (Restarts Pod)	STOPS (Pod dies, so traffic stops)	Defibrillator The heart monitor sees the patient has flatlined (frozen). It shocks them (restart) to force a rhythm.

1. Startup Probe (The “Protector”)

Think of a Startup Probe as a “construction barrier” or a “bodyguard” for a slow-starting application. When a complex application (like a legacy Java system) is first turning on, it’s fragile and slow. If you poke it too early, it might crash. The Startup Probe holds back the other checks (Liveness and Readiness) and says, “Wait! Let it finish waking up first.” It only steps aside once the application is fully up and running.

Startup Probe runs first: It disables Liveness and Readiness checks until it succeeds.
Exclusive: When Startup Probe is running, no other probes run
One-time success: Once it succeeds one time, it stops running forever for that container lifecycle.
Saves slow apps: It prevents Kubernetes from killing an app that is just slow to start, not broken.

A common mistake is confusing initialDelaySeconds with Startup Probes.

initialDelaySeconds: A “dumb” timer. “Wait 60 seconds, then check.” If the app is ready in 10 seconds, you waste 50 seconds. If it takes 70 seconds, the app gets killed.
Startup Probe: A “smart” poller. “Check every 5 seconds up to a maximum of 60 seconds.” If the app is ready in 10 seconds, traffic starts immediately.

2. Readiness Probe (The “Traffic Controller”)

Readiness Probe is the way your application tells the system: “I am up, but please wait! I am not ready to take customer orders (traffic) yet.”

Think of a Readiness Probe like the Traffic Signal at a toll gate.

Green Light (Success): The vehicle (traffic/request) is allowed to pass through to the destination (Pod).
Red Light (Failure): The gate stays closed. No vehicles are let in, but the toll booth (Pod) is not destroyed; it just sits there waiting until it can turn the light green.

Key Characteristics to Remember

Non-Destructive: Unlike Liveness probes, a failed Readiness probe never kills or restarts the Pod. It only stops traffic.
Continuous: It runs periodically throughout the Pod’s life, not just at the start. If your app gets overloaded later, it can fail the probe to stop new traffic temporarily.
Service-Linked: It directly controls the Endpoints object. If the probe fails, the Pod’s IP is removed from the Kubernetes Service.

Parameter	Default	Description	Recommended Setting (General)
`initialDelaySeconds`	0s	Wait time before the first check.	Set to your average startup time (e.g., 5s).
`periodSeconds`	10s	How often to check (gap between checks).	10s is standard; lower for high-performance apps.
`timeoutSeconds`	1s	How long to wait for a reply before counting a failure.	1s-3s (Don’t make this too high).
`successThreshold`	1	How many “Passes” needed to open the gate.	1 is usually enough.
`failureThreshold`	3	How many “Fails” needed to close the gate.	3 (Gives a buffer for temporary blips).

At its core, a Readiness probe uses one of three mechanisms to “knock” on the container’s door:

HTTP GET: Most common for web apps. Kubelet sends a GET request to a path like /healthz or /ready.
TCP Socket: Checks if a specific port is open (useful for databases or non-HTTP apps).
Exec: Runs a specific command (e.g., cat /tmp/ready) inside the container. If it returns exit code 0, it’s healthy.
gRPC: Newer method for checking gRPC-native services.

Use Cases

Monolithic App Startup: A Java app needs 60 seconds to load Spring Context. Readiness probe waits until it’s done.
Data Loading: An AI Service needs to download a 5GB model file from S3 before it can answer queries.
Overload Protection: If an app is processing too many requests, it can intentionally fail its own readiness check to stop receiving new requests while it finishes the current work (Backpressure).

Benefits

Zero Downtime: Users never hit a “dead” backend during updates.
Self-Healing Traffic: Routes around broken Pods automatically without human intervention.
Smart Load Balancing: Ensures traffic is distributed only to healthy workers.

Common Issues and Solutions

Problem:Pod enters “CrashLoopBackOff” but Readiness Probe is passing.
- Solution: This is impossible. CrashLoop means the container died (Liveness issue). Readiness only matters if the container is actually running.
Problem:Readiness Probe fails with “Connection Refused”.
- Solution: Your app is not listening on 0.0.0.0 (all interfaces). It might be listening on 127.0.0.1 (localhost), which the Kubelet cannot reach from outside the container. Change your app config to listen on 0.0.0.0.
Problem:Probe fails due to timeout.
- Solution: Your app is too slow to respond. Increase timeoutSeconds or optimize your /healthz endpoint code to do less work (don’t run a heavy DB query in the health check!).

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes

3. Liveness Probe (The “Restarter”)

In Kubernetes, a container can get into a similar “zombie” state the process is running (PID 1 is active), but the application is stuck in a deadlock or infinite loop. The Liveness Probe is the automated finger that presses the restart button when this happens.

Goal: Recovers the app from broken states (deadlocks, freezes).
Action: If the probe fails, Kubernetes kills the container and restarts it.
Default Behavior: Without this probe, Kubernetes only restarts if the application crashes completely (process stops). It misses “stuck” apps.
Golden Rule: Never make a Liveness Probe depend on external systems (like a Database). If the DB goes down, you don’t want your App to restart endlessly!

When you define a Liveness Probe in your YAML file, you are telling the kubelet (the node agent) how to check your app. There are three main ways to check:

HTTP Get: Kubelet pings an endpoint (like /healthz). If it gets a 200 OK, you are good.
Exec Command: Kubelet runs a command inside the container (like cat /tmp/healthy). If it returns exit code 0, you are good.
TCP Socket: Kubelet tries to open a TCP connection to your port. If it connects, you are good.

Use Case

Java Applications: Detecting OutOfMemory errors where the JVM is still running but can’t allocate heap.
Go/C++ Applications: Detecting deadlocks where threads are waiting on each other forever.

Benefits

Self-Healing: The system automatically recovers from bugs without you waking up at 3 AM.
High Availability: Ensures that traffic is not routed to “zombie” containers (eventually, as the restart clears them out).

Limitations

It cannot fix the root cause. If your code has a memory leak, Liveness Probe will just restart it every few hours. It’s a band-aid, not a cure.
It destroys local state. Any data in memory is lost when the container restarts.

Common Issues, Problems, and Solutions

Problem:Restart Loop (CrashLoopBackOff). The probe checks too early.
- Solution: Increase initialDelaySeconds or use a startupProbe.
Problem:False Positives. The app is busy processing a big request and answers the probe too slowly.
- Solution: Increase timeoutSeconds or optimize the health check code to be lightweight.

Kubernetes Probes

Tech should learn

AWS(Draft)

AWS-Cloud-Tech

AWS-Compute

DevOps Essentials

DevSecOps Essentials(Draft)

CI/CD

GitHub Actions

Docker

Kubernetes (Draft)

The Kubernetes Foundation

Kubernetes Architecture

Kubernetes Setting Up the Lab

Kubernetes Namespace

Kubernetes Pod

Kubernetes Pod Controller

Programming

Python

Kubernetes Pod Health Probes

1. Startup Probe (The “Protector”)

2. Readiness Probe (The “Traffic Controller”)

Key Characteristics to Remember

Use Cases

Benefits

Common Issues and Solutions

3. Liveness Probe (The “Restarter”)

Use Case

Benefits

Limitations

Common Issues, Problems, and Solutions