Kubernetes Jobs and CronJobs
In Kubernetes, usually, we want our applications to run forever, like a web server or a database. But sometimes, we have tasks that just need to run once and then stop, like performing a database backup, processing a batch of files, or sending out emails.
This is where Jobs and CronJobs come in.
- Job: Use this when you have a specific task to do now. Once the task finishes successfully, the Job is considered complete.
- CronJob: Use this when you have a task that needs to happen repeatedly on a schedule (like “every day at 5 PM” or “every Monday”).
| Feature | Job | CronJob |
| Primary Goal | Run a task once until completion. | Run a task periodically on a schedule. |
| Trigger | Manual (kubectl apply) or external trigger. | Time-based (Unix Cron format). |
| Pod Lifecycle | Pods terminate (Exit 0) after success. | Creates a Job object, which then creates Pods. |
| Restart Policy | OnFailure or Never (cannot be Always). | OnFailure or Never. |
| Key Parameter | completions (how many times to succeed). | schedule (when to run). |
| Failure Handling | Retries based on backoffLimit. | Retries via the Job it creates. |
When we talk about Workload Controllers like Deployments or StatefulSets, we are talking about Long-Running Processes. However, Jobs and CronJobs handle Batch Processes.
The Job Controller:
When you create a Job, the Job Controller starts a Pod. It watches that Pod closely. If the Pod crashes (exit code non-zero), the controller starts a new one to replace it. It keeps doing this until the Pod finishes successfully (exit code 0).
The CronJob Controller:
The CronJob Controller is actually a manager of Jobs. It does not touch Pods directly. Every time the schedule strikes (e.g., midnight), the CronJob creates a new Job object. That new Job object then goes ahead and creates the Pods. This separation is important for stability.
- Exit Codes Matter: In a Job, your container must send an “Exit Code 0” to tell Kubernetes “I finished successfully.” If your script crashes or returns exit code 1, Kubernetes thinks it failed and will retry it.
- Restart Policy: You cannot use
restartPolicy: Alwaysfor a Job. Why? BecauseAlwaysmeans “if it stops, start it again.” But a Job wants to stop when it is done. So, we useOnFailure(restart only if it crashes) orNever(create a totally new pod if it fails).
DevSecOps Architect Level
For a production-grade DevSecOps environment, simply running a Job isn’t enough. You must handle resources, security, and cleanup.
1. Automatic Cleanup (TTL Controller) One common problem is that completed Jobs stay in your cluster forever, cluttering up your kubectl get jobs list.
- Solution: Use
.spec.ttlSecondsAfterFinished. - Architect Note: Set this to e.g.,
100seconds. This automatically deletes the Job and its Pods after they finish.
2. Handling “Sidecars” in Jobs
- The Problem: If you use a service mesh (like Istio or Linkerd) or a log shipper sidecar, the main application container might finish, but the sidecar keeps running. Because one container is still running, the Pod never “completes,” and the Job hangs forever.
- The Solution: You often need a script wrapper to kill the sidecar once the main app is done, or use native Kubernetes sidecar support (SidecarContainers feature gate in newer K8s versions).
3. Concurrency Policy in CronJobs This is critical for data integrity.
Allow(Default): If the 1:00 PM backup is slow and takes 2 hours, and the 2:00 PM backup starts, both run at the same time. This might crash your database!Forbid: If the 1:00 PM backup is still running, the 2:00 PM backup is skipped entirely. This is usually the safest for heavy ops.Replace: The 1:00 PM backup is killed, and the 2:00 PM starts.
Lab 1: The Robust “Pi” Calculator Job
This version includes resource limits, retry logic, and cleanup strategies suitable for a shared cluster.
1: Create file pi-job-robust.yaml
apiVersion: batch/v1 # The API version for Batch workloads (Jobs/CronJobs)
kind: Job # The type of resource we are creating
metadata:
name: pi-calculator-robust
labels:
app: math-processing
owner: devsecops-team
spec:
# --- RETRY STRATEGY ---
# If the Pod fails, how many times should K8s try again?
# Default is 6. We set it to 4 to save resources if code is broken.
backoffLimit: 4
# --- CLEANUP STRATEGY (Cost Saving) ---
# Critical Feature: Automatically delete this Job (and its Pods)
# 60 seconds after it finishes successfully.
# This prevents thousands of "Completed" pods from clogging your cluster.
ttlSecondsAfterFinished: 60
# --- DEADLINE (Safety Valve) ---
# If the job takes longer than 5 minutes (300s), kill it.
# This prevents a "zombie" job from stuck running forever.
activeDeadlineSeconds: 300
template:
metadata:
name: pi-calculator
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
# --- RESOURCE LIMITS (Best Practice) ---
# Always set these so one job doesn't eat all cluster memory.
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
# --- RESTART POLICY ---
# 'OnFailure': If the pod crashes, restart the container on the same node.
# 'Never': If it crashes, create a totally NEW pod (good for debugging).
# Note: 'Always' is NOT allowed for Jobs.
restartPolicy: OnFailure
2: Run Command:
kubectl apply -f pi-job-robust.yaml3: Verify Job Creation & Status
First, check if the Job has been accepted and is currently running.
kubectl get jobs -l owner=devsecops-team- Expected Output: You should see
pi-calculator-robustwithCOMPLETIONSas0/1(running) or1/1(finished).
4: View the Output (The Value of Pi)
kubectl logs job/pi-calculator-robust- Expected Output: A long string of numbers starting with
3.14....
5: Verify the Pod Details
Check the Pod to see if the Resource Limits and Restart Policy were applied correctly.
# List pods associated with this specific job
kubectl get pods -l job-name=pi-calculator-robust
# Describe the pod to see Events and Resource Limits
kubectl describe pod -l job-name=pi-calculator-robust6: What to look for:
- Under
Limits, verifycpu: 500mandmemory: 128Mi. - Under
Events, ensure there are noOOMKilled(Out of Memory) errors, which would mean our limits were too tight.
7: Test the “Cleanup Strategy” (TTL)
Your YAML included ttlSecondsAfterFinished: 60. This is a critical feature to test.
- Ensure the job shows
COMPLETIONS: 1/1. - Wait for 60 seconds.
- Run the get command again:
kubectl get jobs pi-calculator-robust
kubectl get pods -l job-name=pi-calculator-robust- Expected Result: Kubernetes returns
Error from server (NotFound), confirming that the Job and its Pods were automatically garbage collected to save cluster space.
8: Troubleshooting (If it fails)
If the job fails or gets stuck (perhaps due to activeDeadlineSeconds), look at the events:
kubectl describe job pi-calculator-robustLook for: DeadlineExceeded (if it took > 300s) or BackoffLimitExceeded (if the code crashed more than 4 times).
Lab 2: The Production-Grade Nightly Backup CronJob
This version adds history limits, starting deadlines, and concurrency controls to ensure your backups are reliable and don’t crash the server.
1: Create file backup-cron-robust.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-backup-secure
spec:
# --- SCHEDULING ---
# Run at 00:00 (Midnight) every day.
# Syntax: Minute | Hour | Day of Month | Month | Day of Week
schedule: "0 0 * * *"
# --- TIMEZONE (New in K8s 1.27+) ---
# Optional: Ensures the job runs at midnight YOUR time, not UTC.
# timeZone: "Asia/Kolkata"
# --- MISSED SCHEDULE HANDLING ---
# If the cluster is down at midnight and comes up at 00:30,
# should it run the missed job?
# If the delay is > 200s, skip it. Prevents old jobs from piling up.
startingDeadlineSeconds: 200
# --- CONCURRENCY (Data Safety) ---
# 'Forbid': If the previous backup is still running, SKIP this new one.
# 'Allow': Run both (Dangerous for backups!).
# 'Replace': Kill the old one, start new.
concurrencyPolicy: Forbid
# --- HISTORY (Log Management) ---
# Keep the last 3 successful jobs so we can check logs if needed.
successfulJobsHistoryLimit: 3
# Keep only 1 failed job so we can debug, but don't clutter the list.
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
spec:
# --- SECURITY CONTEXT (DevSecOps) ---
# Run as non-root user for security.
securityContext:
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
containers:
- name: backup-tool
image: busybox
# Simulate a backup process
args:
- /bin/sh
- -c
- "echo 'Starting secure backup...'; sleep 10; echo 'Backup Complete'"
# --- RESOURCES ---
resources:
requests:
memory: "100Mi"
cpu: "100m"
limits:
memory: "200Mi"
cpu: "200m"
restartPolicy: OnFailure
2: Run command:
kubectl apply -f backup-cron-robust.yaml3: Verify the CronJob is Active
First, confirm the scheduler has registered your CronJob.
kubectl get cronjob nightly-backup-secure- Expected Output: You should see
SCHEDULE: 0 0 * * *andSUSPEND: False. TheLAST SCHEDULEcolumn will likely be<none>since it hasn’t run yet.
4: Manually Trigger a Job (The “Test Run”)
Instead of waiting for midnight, we can manually create a Job from the CronJob template. This tests if the permissions, image, and commands work.
# Create a manual job named 'manual-test-1' from the CronJob
kubectl create job --from=cronjob/nightly-backup-secure manual-test-1- Why do this? This validates your
jobTemplatelogic without changing the actual CronJob schedule.
5: Verify Execution & Logs
Now watch the manual job execute.
# Watch the pod status until it shows 'Completed'
kubectl get pods -w
# Once completed (or running), check the logs
kubectl logs -l job-name=manual-test-1- Expected Output:
Starting secure backup...
(10 second pause)
Backup Complete6: Verify History Limits
Your YAML has successfulJobsHistoryLimit: 3. To test this, you can trigger the job 4 or 5 times rapidly.
# Trigger multiple manual jobs quickly
kubectl create job --from=cronjob/nightly-backup-secure manual-test-2
kubectl create job --from=cronjob/nightly-backup-secure manual-test-3
kubectl create job --from=cronjob/nightly-backup-secure manual-test-4
kubectl create job --from=cronjob/nightly-backup-secure manual-test-5
# Check the list of jobs
kubectl get jobs -l job-name!=manual-test-1- Note: Depending on your Kubernetes version and garbage collector timing, you might see exactly 3 completed jobs (plus the running ones), or you might see the older ones marked for deletion.
7: Verify Security Context
Ensure the Pod is actually running as the non-root user (User ID 1000) as specified in your YAML.
# Check the UID of the running process inside the pod
kubectl exec -it job/manual-test-1 -- id- Expected Output:
uid=1000 gid=3000 groups=2000,3000 - If you see
uid=0 (root), thesecurityContextwas ignored or configured incorrectly.
8: Clean Up Manual Tests
Since manual jobs created via kubectl create job are not managed by the CronJob’s history limit, you should delete them manually to keep the cluster clean.
kubectl delete job manual-test-1 manual-test-2 manual-test-3 manual-test-4 manual-test-5