EKS Cluster Upgrades & Reliability

PostedDecember 26, 2021

UpdatedMarch 3, 2026

Author -Rajkumar Aute

The Zero-Downtime EKS Kubernetes Upgrade

To achieve a zero-downtime Amazon EKS version upgrade, you must follow a structured approach emphasizing thorough preparation, a strict upgrade sequence, and rigorous post-upgrade validation. The key to avoiding disruptions is leveraging Kubernetes’ native rolling update capabilities alongside robust application configuration.

Phase 1: Preparation & Planning (The “Safety First” Phase)

The most common causes of downtime are not the upgrade itself, but deprecated APIs, resource exhaustion, or misconfigured add-ons.

Review Release Notes and API Changes: Check the Kubernetes release notes for the target version. Identify and remediate any deprecated or removed APIs in your workloads using tools like Pluto to ensure application compatibility.
Scan for Deprecated APIs: Use tools like kubent (Kube No Trouble) or pluto to identify resources relying on APIs that are removed in your target version.
Audit Pod Disruption Budgets (PDBs): Ensure every critical service has a PDB (e.g., minAvailable: 2). This is the “secret sauce” for zero downtime, preventing the node drain process from terminating all replicas of a service simultaneously.
Configure Application Resiliency: Verify your applications run with multiple replicas distributed evenly across different Availability Zones.
Verify Subnet Capacity: The control plane upgrade requires free IP addresses (usually 5–10) in your cluster subnets to provision new elastic network interfaces (ENIs) during the transition.
Check Add-on Compatibility: Consult the EKS Add-on Compatibility Matrix to ensure your target Kubernetes version supports your current VPC CNI, CoreDNS, and kube-proxy versions.
Update CLI Tools: Upgrade kubectl, aws, and eksctl on your local machine or CI/CD runner to versions compatible with the new Kubernetes release.
Test in Staging: Always perform the full upgrade process in a staging environment that mirrors your production setup.
Backup: Back up your cluster configuration and critical state data as a final precaution.

Phase 2: Execution (The Three-Step Rollout)

You must execute the upgrade in this specific, sequential order to maintain cluster stability and avoid routing failures.

Step 1: Upgrade the EKS Control Plane

The EKS control plane is highly available and managed by AWS. While the API server may briefly pause during the version flip, your running workloads remain untouched.

Initiate the upgrade via Terraform, the AWS Console, or the AWS CLI:

Bash

aws eks update-cluster-version \
  --name <cluster_name> \
  --kubernetes-version <target_version>

aws eks update-cluster-version \
  --name <cluster_name> \
  --kubernetes-version <target_version>

Wait until the cluster status returns to ACTIVE before proceeding to the next step.

Step 2: Upgrade Worker Nodes (The Data Plane)

This step carries the highest risk of downtime if pods are not handled gracefully. Your strategy depends on how you manage your nodes:

For Managed Node Groups: AWS EKS supports a rolling update feature. It automatically provisions new instances with the updated AMI, safely cordons and drains the old nodes, and terminates them only after pods have migrated.

Bash

aws eks update-nodegroup-version \
  --cluster-name <cluster_name> \
  --nodegroup-name <node_group_name> \
  --kubernetes-version <target_version>

aws eks update-nodegroup-version \
  --cluster-name <cluster_name> \
  --nodegroup-name <node_group_name> \
  --kubernetes-version <target_version>

For Karpenter: If Drift Detection is enabled, Karpenter will automatically detect the version mismatch once the control plane is upgraded and begin a graceful rolling replacement of your nodes.
For Self-Managed Nodes (Manual Rolling Update):
1. Create a new node group running the target Kubernetes version.
2. Cordon old nodes to prevent new pod scheduling: kubectl cordon <node_name>
3. Drain old nodes to gracefully evict existing pods: kubectl drain <node_name> --ignore-daemonsets --delete-emptydir-data
4. Terminate the old node group only after verifying all pods are running safely on the new nodes.

Step 3: Update EKS Add-ons

Immediately after the control plane and nodes are updated, upgrade the “Big Three” add-ons to ensure networking and service discovery remain stable.

VPC CNI: Essential for pod networking and IP allocation.
CoreDNS: Crucial for internal service discovery.
Kube-Proxy: Manages network routing rules on individual nodes.

Update via the AWS CLI:

Bash

aws eks update-addon \
  --cluster-name <cluster_name> \
  --addon-name <addon_name> \
  --addon-version <new_version>

aws eks update-addon \
  --cluster-name <cluster_name> \
  --addon-name <addon_name> \
  --addon-version <new_version>

Phase 3: Validation and Monitoring

Once the infrastructure is upgraded, immediately verify the health of the environment.

Verify Node Status: Run kubectl get nodes to confirm all nodes report a Ready status and reflect the new Kubernetes version.
Monitor Application Health: Check critical workloads (kubectl get pods -A) to ensure pods are running correctly without crash loops.
Test Functionality: Conduct functional and performance testing to guarantee ingress controllers, load balancers, and application traffic are routing as expected.

EKS Upgrade Strategies: Choosing Your Path

The decision between In-Place and Blue/Green usually comes down to your risk tolerance and how many versions you are “skipping.”

Feature	In-Place Upgrade	Blue/Green Migration
Risk Level	Medium (One-way control plane)	Low (Instant rollback available)
Cost	Baseline (Standard usage)	Double (Running two clusters during transition)
Complexity	Low (AWS handles the “Brain”)	High (DNS, OIDC, and Data migration)
Best For	Single-version bumps (e.g., 1.31 → 1.32)	Multi-version jumps or critical apps
Rollback	Complex (Manual node rollback only)	Simple (Point DNS back to “Blue”)

The “Version Skew” Hack

Since Kubernetes 1.28, EKS supports a 3-minor-version skew between the control plane and worker nodes. This means if you upgrade your control plane to 1.32, your worker nodes can safely stay on 1.29 for a short period while you test, giving you more breathing room than in previous years.

EKS Cluster Insights: Your Pre-Flight Check

Before you touch the “Upgrade” button, you must clear any ERROR flags in EKS Upgrade Insights.

What to Look For in 1.32+:

Flow Control APIs: The v1beta3 versions of FlowSchema and PriorityLevelConfiguration are deprecated. You must move to v1.
AL2 Deprecation: If your nodes use Amazon Linux 2 (AL2), Cluster Insights will warn you. EKS is phasing out AL2 in favor of AL2023 or Bottlerocket.
Add-on Pinning: If you have pinned your vpc-cni or coredns to a specific old version, Cluster Insights will mark this as an Error. You must update the add-on version to one compatible with 1.32 first.

Tip: AWS recently added an On-Demand Refresh for Insights. You no longer have to wait 24 hours to see if your fix worked; you can trigger a scan immediately via the CLI or Console.

Tech should learn

AWS(Draft)

AWS-Cloud-Tech

AWS-Compute

DevOps Essentials

DevSecOps Essentials(Draft)

CI/CD

GitHub Actions

Docker

Kubernetes (Draft)

The Kubernetes Foundation

Kubernetes Architecture

Kubernetes Setting Up the Lab

Kubernetes Namespace

Kubernetes Pod

Kubernetes Workload Controller

Kubernetes Storage and Configurations

Kubernetes Networking

Kubernetes Authentication & Authorization

AWS Elastic Kubernetes Service

EKS Architecture

AWS EKS Identity & Access Management

EKS Configuration & Storage

EKS Workload Controllers

EKS Advanced Networking & Traffic Management

EKS Workload Security

EKS Observability & Troubleshooting

EKS CI/CD, GitOps

EKS Platform Engineering