Skip to main content
< All Topics

Guide to Upgrading AWS EKS with Terraform

Upgrading a production Kubernetes cluster can feel like performing open-heart surgery while the patient is running a marathon. However, by treating your infrastructure as code (IaC) with Terraform, you turn a terrifying operational risk into a predictable, repeatable process.

This guide covers how to safely perform a “Day-2” EKS upgrade using Terraform, managing everything from the Control Plane to worker nodes and critical add-ons.

Phase 0: Pre-Flight Checks

Before touching your Terraform code, you must ensure your workloads are ready for the new Kubernetes version.

  1. Check the EKS Release Notes: Always read the AWS EKS release notes for the version you are upgrading to. Note any changes in default behaviors or required IAM permissions.
  2. Scan for Deprecated APIs: Kubernetes regularly removes older API versions (e.g., moving v1beta1 to v1). If you have Helm charts or YAML manifests using removed APIs, they will fail after the upgrade.
    • Pro-Tip: Use open-source tools like Pluto or Kube-No-Trouble (kubent) to scan your cluster for deprecated APIs before upgrading.
  3. Check Controller Compatibility: Ensure your critical cluster controllers (AWS Load Balancer Controller, Cluster Autoscaler, Metrics Server) have a version that supports your target Kubernetes version.
    • Crucial Note: Cluster Autoscaler versions are tightly coupled to K8s versions. If you upgrade EKS to 1.30, your Autoscaler must be upgraded to 9.37.x (which corresponds to 1.30).

Phase 1: Update Your Terraform Variables

Follow this setup initial setup. before doing following step create cluster. Kubernetes Setting Up the Lab

creating EKS cluster with old Kubernetes Version.

Bash
git clone https://github.com/Rajkumar-Aute/eks-cluster-with-terraform.git
cd eks-cluster-with-terraform/
terraform init
terraform plan -var-file=environment/learning.tfvars 
terraform apply -var-file=environment/learning.tfvars

EKS cluster creation may take 15 to 30 minutes.

Open your learning.tfvars file. Comment out the old versions and uncomment the new versions:

Bash
# --- [ OLD VERSIONS ] ---
# cluster_version            = "1.29"
# alb_controller_version     = "1.7.1"
# cluster_autoscaler_version = "9.34.0"
# metrics_server_version     = "3.12.0"

# --- [ NEW VERSIONS ] ---
cluster_version            = "1.30"
alb_controller_version     = "1.8.1"
cluster_autoscaler_version = "9.37.0"
metrics_server_version     = "3.12.1"

Phase 2: Execute the Upgrade

Once your variables are updated, it is time to let Terraform orchestrate the upgrade.

1. Run the Plan

Always run a plan first to see exactly what Terraform intends to do.

Bash
terraform plan -var-file="learning.tfvars"

Look closely at the output. You should see Terraform planning to update:

  • The aws_eks_cluster version.
  • The aws_eks_node_group versions.
  • The helm_release versions for your controllers.

2. Apply the Changes

Execute the upgrade:

Bash
terraform apply -var-file=environment/learning.tfvars

How Terraform & AWS Handle the Order of Operations:

  1. Control Plane First: AWS EKS upgrades the highly available control plane. This takes about 10–15 minutes. Your applications remain online during this time, though you might experience brief connection drops if you are actively running kubectl commands.
  2. Managed Add-ons: Terraform upgrades AWS managed add-ons (like VPC CNI, CoreDNS, and kube-proxy) to match the new control plane version.
  3. Worker Nodes: AWS triggers a rolling update of your Managed Node Groups. It spins up a new node with the updated AMI, waits for it to join the cluster, cordons and drains an old node, and then terminates it.
  4. Helm Controllers: Finally, Terraform’s Helm provider updates the deployments for your Load Balancer Controller, Autoscaler, and Metrics Server.

Phase 3: Post-Upgrade Verification

Once Terraform completes successfully, do not just assume everything is perfect. Verify the cluster health manually.

1. Verify the Control Plane Version:

Bash
kubectl version --short
# Look for the Server Version matching your target (e.g., v1.30.x-eks)

2. Verify the Worker Nodes: Check that all nodes have rotated to the new version and are in a Ready state.

Bash
kubectl get nodes

3. Verify Critical System Pods: Ensure that CoreDNS, the VPC CNI, and your controllers are running and haven’t entered a CrashLoopBackOff state.

Bash
kubectl get pods -n kube-system

4. Check Application Health: Finally, check your ingress controllers, services, and application pods to ensure traffic is flowing normally.


Best Practices

  • No Skipping Versions: You cannot skip minor versions in EKS. If you are on 1.28 and want to reach 1.30, you must upgrade to 1.29 first, let it stabilize, and then upgrade to 1.30.
  • Pod Disruption Budgets (PDBs): If your applications have overly strict PDBs (e.g., requiring 100% of pods to be available at all times), the node rolling upgrade will get stuck because AWS cannot legally drain the old nodes. Ensure your PDBs allow for at least 1 unavailable pod during rollouts.
  • Spot Instances: Upgrading Spot node groups can sometimes be faster because AWS simply terminates the spot instances and replaces them, rather than doing a graceful drain. Be prepared for application churn if your spot nodes are heavily loaded.

Contents
Scroll to Top