Kubernetes TTL Controller
Imagine you are baking cookies. You use a timer. When the timer goes off (the job is done), you don’t leave the dirty trays in the oven forever; you clean them up.
In Kubernetes, Jobs create Pods to do work (like a backup or a calculation). Once the work is Finished (Completed or Failed), these Pods and Job objects usually stay there forever, cluttering your cluster like dirty dishes. The TTL Controller is the automatic dishwasher. You set a timer (ttlSecondsAfterFinished), and once the Job is done, the controller waits for that time and then automatically deletes the Job and its Pods.
Key Characteristics
- “If your completed Jobs are piling up and cluttering
kubectl get jobs, usettlSecondsAfterFinished.” - “Setting
ttlSecondsAfterFinished: 0triggers immediate deletion after completion.” - “This controller does NOT terminate running Pods; it only cleans up dead ones.”
| Feature | Description |
| Primary Goal | Automatically clean up finished Jobs and Pods to save resources. |
| Key Field | .spec.ttlSecondsAfterFinished |
| Trigger | Triggers only when the resource status is Complete or Failed. |
| Scope | Currently supports Jobs and Pods (Beta/Stable depending on version). |
| Component | Runs inside the kube-controller-manager. |
The TTL (Time-To-Live) Controller is a control loop within the kube-controller-manager that manages the lifecycle of resource objects that have finished execution. Its primary function is to enforce a TTL policy on Jobs (and Pods), ensuring that they are garbage collected after a user-defined duration once they reach a terminal state (Completed or Failed).
This mechanism solves the “resource leak” problem where thousands of old, completed Job objects remain in the API server, consuming etcd storage and slowing down API responses. Unlike the standard Garbage Collector (which handles owner-dependent relationships), the TTL controller handles time-based cleanup for finished resources.
- The “Zombie” Job Problem: By default, if you run a Kubernetes Job, it stays there until you delete it. If you run a cronjob every minute, after 24 hours you have 1,440 dead Job objects.
- The Fix: You add one line to your YAML:
ttlSecondsAfterFinished: 100. - The Result: 100 seconds after the Job says “I’m done!”, it vanishes.
- Supported Resources: Primarily Jobs. (Pod support exists but is less commonly used directly by users, as Jobs manage Pods).
Use Cases
- CI/CD Runners: Ephemeral build agents spawned as Jobs.
- Machine Learning Training: Massive batch jobs where keeping metadata for 10,000 completed runs crashes the dashboard.
- Database Migrations: One-off tasks that run on deploy.
Best Practices
- Logging: If you rely on
kubectl logsto debug failures, do not set TTL to 0. Give yourself a buffer (e.g.,3600seconds / 1 hour) to inspect logs before they are deleted. - Centralized Logging: If you ship logs to Elastic/Splunk/Datadog, you can safely set a low TTL (e.g.,
60seconds) since you don’t need the Pod for logs. - Default Policy: Use a Mutating Admission Webhook (like Kyverno) to inject a default
ttlSecondsAfterFinished: 86400(24h) to all Jobs to prevent clutter.
https://kubernetes.io/docs/concepts/workloads/controllers/job/#clean-up-finished-jobs-automatically
https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates