Guide to Upgrading AWS EKS with Terraform
Upgrading a production Kubernetes cluster can feel like performing open-heart surgery while the patient is running a marathon. However, by treating your infrastructure as code (IaC) with Terraform, you turn a terrifying operational risk into a predictable, repeatable process.
This guide covers how to safely perform a “Day-2” EKS upgrade using Terraform, managing everything from the Control Plane to worker nodes and critical add-ons.
Phase 0: Pre-Flight Checks
Before touching your Terraform code, you must ensure your workloads are ready for the new Kubernetes version.
- Check the EKS Release Notes: Always read the AWS EKS release notes for the version you are upgrading to. Note any changes in default behaviors or required IAM permissions.
- Scan for Deprecated APIs: Kubernetes regularly removes older API versions (e.g., moving
v1beta1tov1). If you have Helm charts or YAML manifests using removed APIs, they will fail after the upgrade.- Pro-Tip: Use open-source tools like Pluto or Kube-No-Trouble (kubent) to scan your cluster for deprecated APIs before upgrading.
- Check Controller Compatibility: Ensure your critical cluster controllers (AWS Load Balancer Controller, Cluster Autoscaler, Metrics Server) have a version that supports your target Kubernetes version.
- Crucial Note: Cluster Autoscaler versions are tightly coupled to K8s versions. If you upgrade EKS to
1.30, your Autoscaler must be upgraded to9.37.x(which corresponds to 1.30).
- Crucial Note: Cluster Autoscaler versions are tightly coupled to K8s versions. If you upgrade EKS to
Phase 1: Update Your Terraform Variables
Follow this setup initial setup. before doing following step create cluster. Kubernetes Setting Up the Lab
creating EKS cluster with old Kubernetes Version.
git clone https://github.com/Rajkumar-Aute/eks-cluster-with-terraform.git
cd eks-cluster-with-terraform/
terraform init
terraform plan -var-file=environment/learning.tfvars
terraform apply -var-file=environment/learning.tfvars
EKS cluster creation may take 15 to 30 minutes.
Open your learning.tfvars file. Comment out the old versions and uncomment the new versions:
# --- [ OLD VERSIONS ] ---
# cluster_version = "1.29"
# alb_controller_version = "1.7.1"
# cluster_autoscaler_version = "9.34.0"
# metrics_server_version = "3.12.0"
# --- [ NEW VERSIONS ] ---
cluster_version = "1.30"
alb_controller_version = "1.8.1"
cluster_autoscaler_version = "9.37.0"
metrics_server_version = "3.12.1"
Phase 2: Execute the Upgrade
Once your variables are updated, it is time to let Terraform orchestrate the upgrade.
1. Run the Plan
Always run a plan first to see exactly what Terraform intends to do.
terraform plan -var-file="learning.tfvars"
Look closely at the output. You should see Terraform planning to update:
- The
aws_eks_clusterversion. - The
aws_eks_node_groupversions. - The
helm_releaseversions for your controllers.
2. Apply the Changes
Execute the upgrade:
terraform apply -var-file=environment/learning.tfvarsHow Terraform & AWS Handle the Order of Operations:
- Control Plane First: AWS EKS upgrades the highly available control plane. This takes about 10–15 minutes. Your applications remain online during this time, though you might experience brief connection drops if you are actively running
kubectlcommands. - Managed Add-ons: Terraform upgrades AWS managed add-ons (like VPC CNI, CoreDNS, and
kube-proxy) to match the new control plane version. - Worker Nodes: AWS triggers a rolling update of your Managed Node Groups. It spins up a new node with the updated AMI, waits for it to join the cluster, cordons and drains an old node, and then terminates it.
- Helm Controllers: Finally, Terraform’s Helm provider updates the deployments for your Load Balancer Controller, Autoscaler, and Metrics Server.
Phase 3: Post-Upgrade Verification
Once Terraform completes successfully, do not just assume everything is perfect. Verify the cluster health manually.
1. Verify the Control Plane Version:
kubectl version --short
# Look for the Server Version matching your target (e.g., v1.30.x-eks)
2. Verify the Worker Nodes: Check that all nodes have rotated to the new version and are in a Ready state.
kubectl get nodes
3. Verify Critical System Pods: Ensure that CoreDNS, the VPC CNI, and your controllers are running and haven’t entered a CrashLoopBackOff state.
kubectl get pods -n kube-system
4. Check Application Health: Finally, check your ingress controllers, services, and application pods to ensure traffic is flowing normally.
Best Practices
- No Skipping Versions: You cannot skip minor versions in EKS. If you are on
1.28and want to reach1.30, you must upgrade to1.29first, let it stabilize, and then upgrade to1.30. - Pod Disruption Budgets (PDBs): If your applications have overly strict PDBs (e.g., requiring 100% of pods to be available at all times), the node rolling upgrade will get stuck because AWS cannot legally drain the old nodes. Ensure your PDBs allow for at least 1 unavailable pod during rollouts.
- Spot Instances: Upgrading Spot node groups can sometimes be faster because AWS simply terminates the spot instances and replaces them, rather than doing a graceful drain. Be prepared for application churn if your spot nodes are heavily loaded.