Skip to main content
< All Topics

Kubernetes Pod Health Probes

In distributed systems, ensuring that your application is running is not enough; you must ensure it is functioning correctly and capable of handling traffic. Kubernetes uses Probes essentially periodic health checks to monitor the state of containers. By configuring these probes, you automate the detection of failures and the recovery of services, ensuring high availability without manual intervention.

Probe TypeThe RoleThe QuestionWhen does it run?on Fail?Traffic StatusBest Analogy
1. StartupThe Bodyguard
(The Initializer)
“Has the application actually started?”FIRST
Runs exclusively. Pauses all other probes.
YES
(Restarts Pod)
BLOCKED
(Implicitly)
The BIOS Check
When you turn on a PC, nothing works until the BIOS loads. If it hangs, you hard-reset the machine.
2. ReadinessThe Traffic Cop
(The Gatekeeper)
“Can you accept a customer request right now?”CONTINUOUS
Starts only after Startup passes.
NO
(Safe! It just isolates the Pod)
PAUSED
(Removes IP from Load Balancer)
“Closed for Lunch”
The shop is open (Alive) and staff is there, but they flip the sign to “Closed” temporarily to restock.
3. LivenessThe Doctor
(The Watchdog)
“Are you deadlocked, frozen, or crashed?”CONTINUOUS
Starts only after Startup passes.
YES
(Restarts Pod)
STOPS
(Pod dies, so traffic stops)
Defibrillator
The heart monitor sees the patient has flatlined (frozen). It shocks them (restart) to force a rhythm.

1. Startup Probe (The “Protector”)

Think of a Startup Probe as a “construction barrier” or a “bodyguard” for a slow-starting application. When a complex application (like a legacy Java system) is first turning on, it’s fragile and slow. If you poke it too early, it might crash. The Startup Probe holds back the other checks (Liveness and Readiness) and says, “Wait! Let it finish waking up first.” It only steps aside once the application is fully up and running.

  • Startup Probe runs first: It disables Liveness and Readiness checks until it succeeds.
  • Exclusive: When Startup Probe is running, no other probes run
  • One-time success: Once it succeeds one time, it stops running forever for that container lifecycle.
  • Saves slow apps: It prevents Kubernetes from killing an app that is just slow to start, not broken.

A common mistake is confusing initialDelaySeconds with Startup Probes.

  • initialDelaySeconds: A “dumb” timer. “Wait 60 seconds, then check.” If the app is ready in 10 seconds, you waste 50 seconds. If it takes 70 seconds, the app gets killed.
  • Startup Probe: A “smart” poller. “Check every 5 seconds up to a maximum of 60 seconds.” If the app is ready in 10 seconds, traffic starts immediately.

2. Readiness Probe (The “Traffic Controller”)

Readiness Probe is the way your application tells the system: “I am up, but please wait! I am not ready to take customer orders (traffic) yet.”

Think of a Readiness Probe like the Traffic Signal at a toll gate.

  • Green Light (Success): The vehicle (traffic/request) is allowed to pass through to the destination (Pod).
  • Red Light (Failure): The gate stays closed. No vehicles are let in, but the toll booth (Pod) is not destroyed; it just sits there waiting until it can turn the light green.
Key Characteristics to Remember
  • Non-Destructive: Unlike Liveness probes, a failed Readiness probe never kills or restarts the Pod. It only stops traffic.
  • Continuous: It runs periodically throughout the Pod’s life, not just at the start. If your app gets overloaded later, it can fail the probe to stop new traffic temporarily.
  • Service-Linked: It directly controls the Endpoints object. If the probe fails, the Pod’s IP is removed from the Kubernetes Service.
ParameterDefaultDescriptionRecommended Setting (General)
initialDelaySeconds0sWait time before the first check.Set to your average startup time (e.g., 5s).
periodSeconds10sHow often to check (gap between checks).10s is standard; lower for high-performance apps.
timeoutSeconds1sHow long to wait for a reply before counting a failure.1s-3s (Don’t make this too high).
successThreshold1How many “Passes” needed to open the gate.1 is usually enough.
failureThreshold3How many “Fails” needed to close the gate.3 (Gives a buffer for temporary blips).

At its core, a Readiness probe uses one of three mechanisms to “knock” on the container’s door:

  • HTTP GET: Most common for web apps. Kubelet sends a GET request to a path like /healthz or /ready.
  • TCP Socket: Checks if a specific port is open (useful for databases or non-HTTP apps).
  • Exec: Runs a specific command (e.g., cat /tmp/ready) inside the container. If it returns exit code 0, it’s healthy.
  • gRPC: Newer method for checking gRPC-native services.
Use Cases
  • Monolithic App Startup: A Java app needs 60 seconds to load Spring Context. Readiness probe waits until it’s done.
  • Data Loading: An AI Service needs to download a 5GB model file from S3 before it can answer queries.
  • Overload Protection: If an app is processing too many requests, it can intentionally fail its own readiness check to stop receiving new requests while it finishes the current work (Backpressure).
Benefits
  • Zero Downtime: Users never hit a “dead” backend during updates.
  • Self-Healing Traffic: Routes around broken Pods automatically without human intervention.
  • Smart Load Balancing: Ensures traffic is distributed only to healthy workers.
Common Issues and Solutions
  • Problem:Pod enters “CrashLoopBackOff” but Readiness Probe is passing.
    • Solution: This is impossible. CrashLoop means the container died (Liveness issue). Readiness only matters if the container is actually running.
  • Problem:Readiness Probe fails with “Connection Refused”.
    • Solution: Your app is not listening on 0.0.0.0 (all interfaces). It might be listening on 127.0.0.1 (localhost), which the Kubelet cannot reach from outside the container. Change your app config to listen on 0.0.0.0.
  • Problem:Probe fails due to timeout.
    • Solution: Your app is too slow to respond. Increase timeoutSeconds or optimize your /healthz endpoint code to do less work (don’t run a heavy DB query in the health check!).

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes

3. Liveness Probe (The “Restarter”)

In Kubernetes, a container can get into a similar “zombie” state the process is running (PID 1 is active), but the application is stuck in a deadlock or infinite loop. The Liveness Probe is the automated finger that presses the restart button when this happens.

  • Goal: Recovers the app from broken states (deadlocks, freezes).
  • Action: If the probe fails, Kubernetes kills the container and restarts it.
  • Default Behavior: Without this probe, Kubernetes only restarts if the application crashes completely (process stops). It misses “stuck” apps.
  • Golden Rule: Never make a Liveness Probe depend on external systems (like a Database). If the DB goes down, you don’t want your App to restart endlessly!

When you define a Liveness Probe in your YAML file, you are telling the kubelet (the node agent) how to check your app. There are three main ways to check:

  1. HTTP Get: Kubelet pings an endpoint (like /healthz). If it gets a 200 OK, you are good.
  2. Exec Command: Kubelet runs a command inside the container (like cat /tmp/healthy). If it returns exit code 0, you are good.
  3. TCP Socket: Kubelet tries to open a TCP connection to your port. If it connects, you are good.
Use Case
  • Java Applications: Detecting OutOfMemory errors where the JVM is still running but can’t allocate heap.
  • Go/C++ Applications: Detecting deadlocks where threads are waiting on each other forever.
Benefits
  • Self-Healing: The system automatically recovers from bugs without you waking up at 3 AM.
  • High Availability: Ensures that traffic is not routed to “zombie” containers (eventually, as the restart clears them out).
Limitations
  • It cannot fix the root cause. If your code has a memory leak, Liveness Probe will just restart it every few hours. It’s a band-aid, not a cure.
  • It destroys local state. Any data in memory is lost when the container restarts.
Common Issues, Problems, and Solutions
  • Problem:Restart Loop (CrashLoopBackOff). The probe checks too early.
    • Solution: Increase initialDelaySeconds or use a startupProbe.
  • Problem:False Positives. The app is busy processing a big request and answers the probe too slowly.
    • Solution: Increase timeoutSeconds or optimize the health check code to be lightweight.

Kubernetes Probes

Contents
Scroll to Top