EKS Disaster Recovery
Disaster Recovery: The Velero “Time Machine”
For EKS, a “backup” isn’t just a copy; it’s a recovery plan. Velero is the gold standard because it handles both the Metadata (YAMLs) and the Persistent Data (EBS).
The Cross-Region “Golden Rule”
To survive a full region outage (e.g., us-east-1 going dark), follow this architecture:
- Backup Location: Set your Velero
BackupStorageLocationto an S3 bucket in a different region. - Read-Only Mode: In your DR region, install Velero and point it to that same S3 bucket, but set the access mode to
ReadOnly. This prevents the DR cluster from accidentally deleting your production backups. - Storage Class Mapping: Since EBS volumes are region-specific, you must use a Velero ConfigMap to map your
gp3storage class in Region A to thegp3class in Region B during restore.
Critical 2026 Checklist:
- Audit logs: Ensure they are enabled. Cluster Insights relies on them.
- Velero Plugins: Use the latest
velero-plugin-for-awsto ensure compatibility with CSI snapshots, which are now the default for EBS. - EKS Pod Identity: If you use the new Pod Identity (instead of IRSA), ensure your Velero IAM roles are mapped correctly in the new region.
Designing a multi-region disaster recovery (DR) strategy on AWS requires balancing two critical metrics: RPO (Recovery Point Objective), the maximum acceptable data loss, and RTO (Recovery Time Objective), the maximum acceptable downtime.
1. Core Disaster Recovery Patterns
AWS categorizes DR into four distinct patterns, ranging from low-cost/high-downtime to high-cost/zero-downtime.
| Pattern | RPO | RTO | Cost | Strategy |
| Backup & Restore | Hours | 24h+ | $ | Periodic backups to S3; restore on failure. |
| Pilot Light | Minutes | Minutes/Hours | $$ | Live data replication; core “quiet” infra ready. |
| Warm Standby | Seconds | Minutes | $$$ | Scaled-down, functional version always running. |
| Active-Active | Near Zero | Near Zero | Traffic served from both regions simultaneously. |
1. Backup & Restore (EKS)
In this pattern, you maintain the infrastructure as code (IaC) but do not run an EKS cluster in the DR region until a disaster occurs.
- Cluster State & Workloads: Use a backup tool like Velero to snapshot your Kubernetes objects (Deployments, Services, ConfigMaps, Secrets). Velero acts as a workload controller that zips up the API server state and ships it to an Amazon S3 bucket.
- Persistent Storage: Velero interacts with the EBS Container Storage Interface (CSI) driver to take volume snapshots of your Persistent Volumes (PVs) and moves them to S3. Enable S3 Cross-Region Replication (CRR) to ensure the backups reach the DR region.
- External Data: Rely on automated, scheduled snapshots for Amazon RDS or Amazon EFS, copied to the secondary region.
- Failover Process: Upon disaster, trigger your IaC (Terraform/eksctl) to build the EKS control plane and worker nodes from scratch. Once the cluster is up, point Velero to the replicated S3 bucket and initiate a restore of the manifests and storage.
- Security Posture: Ensure your IAM Roles for Service Accounts (IRSA) map correctly in the DR region, as AWS account IDs or KMS keys might differ.
- Lab eks-cluster-DR-Setup-Backup-Restore-Patterns-terraform
2. Pilot Light (EKS)
The “Pilot Light” for EKS means the control plane is ready and waiting, but the compute resources (worker nodes) are scaled to zero to save costs.
- Cluster State & Workloads: Instead of relying solely on Velero, transition to a GitOps model (using Argo CD or Flux). Git becomes the single source of truth for your cluster’s desired state. The GitOps controller runs in the DR cluster but has nothing to schedule yet.
- Compute: Provision an EKS cluster in the DR region, but configure the Managed Node Groups with an Auto Scaling Group (ASG) where
min-size=0anddesired-size=0. - External Data: Maintain live, asynchronous data replication using Amazon RDS Read Replicas in the DR region.
- Failover Process: Update the ASG to scale up the worker nodes. As nodes register, the GitOps controller automatically pulls your workload manifests and schedules the pods. Promote the RDS Read Replica to a standalone primary database. Update Route 53 DNS records to point to the newly spun-up Ingress controller.
3. Warm Standby (EKS)
This pattern maintains a fully functional, but scaled-down EKS environment in the DR region.
- Cluster State & Workloads: GitOps ensures that both the primary and DR EKS clusters have identical configurations. However, the workload controllers (Deployments/StatefulSets) in the DR region are configured with a minimal replica count (e.g.,
replicas: 1). - Compute: A small set of worker nodes is constantly running. Essential Kubernetes networking components like CoreDNS, the VPC CNI, and your Ingress controllers are active and healthy.
- External Data: Similar to Pilot Light, database read replicas are running and synchronizing continuously.
- Failover Process: Failover is extremely fast. You update the Horizontal Pod Autoscaler (HPA) or manually scale the replica counts in Git. Auto Scaling groups quickly add more worker nodes to handle the incoming pods. Amazon Route 53 uses a Failover Routing Policy (backed by health checks) to instantly shift traffic to the DR region’s Ingress endpoint.
- DevSecOps Advantage: Because the cluster is always running, security auditing tools, image vulnerability scanners, and network policies are continuously enforced and validated in the DR environment, preventing configuration drift.
- Lab eks-cluster-DR-Setup-Warm-Standby-Patterns-terraform
4. Active-Active (Multi-Region EKS)
This is the most complex and expensive setup, providing near-zero downtime. Two identical EKS clusters in two different AWS Regions serve production traffic simultaneously.
- Cluster State & Workloads: GitOps is mandatory here to ensure deployment parity. Any code merged to the main branch is deployed to both EKS clusters concurrently.
- Compute: Fully scaled EKS clusters run in both regions, independently autoscaling based on their respective localized traffic demands.
- Networking: Use AWS Global Accelerator or Amazon Route 53 with Latency-based or Weighted routing. Global Accelerator provides two static Anycast IP addresses that act as a fixed entry point to your application, routing user traffic to the closest healthy EKS Ingress.
- Persistent Storage (The Hardest Part): Kubernetes Persistent Volumes (like EBS) are strictly zonal/regional. You cannot share an EBS volume across regions. Therefore, stateful workloads must be re-architected to externalize their state. You must rely on global datastores like Amazon DynamoDB Global Tables or Amazon Aurora Global Database, which handle active-active, multi-region synchronous/asynchronous writes under the hood.
- Lab eks-cluster-DR-Setup-Active-Active-Patterns-terraform
2. Recommended Services for Minimal Downtime
To achieve minimal downtime (RTO in minutes/seconds), you should focus on Warm Standby or Active-Active patterns using these specific services:
Global Traffic Management
- Amazon Route 53 Application Recovery Controller (ARC): Provides “Routing Controls” that act as on/off switches for entire regions. It ensures you can failover without relying on the AWS Management Console during a regional event.
- AWS Global Accelerator: Uses Anycast IPs to route traffic over the AWS private backbone. It can failover between regional endpoints in seconds, bypassing DNS caching delays.
Database & Data Tier
- Amazon Aurora Global Database: Replicates data across regions with latency typically under 1 second. In a disaster, you can promote a secondary region to full read/write status in less than 1 minute.
- Amazon DynamoDB Global Tables: A multi-active NoSQL solution that allows local reads/writes in every region with automatic conflict resolution. This is the gold standard for Active-Active architectures.
- Amazon S3 Cross-Region Replication (CRR): Ensures your objects and static assets are asynchronously synced to a bucket in the DR region.
Compute & Orchestration
- AWS CloudFormation / Terraform: You must use Infrastructure as Code (IaC) to ensure the DR region’s environment is an exact, version-controlled replica of the primary.
- AWS Backup: Provides a centralized way to automate and copy snapshots across regions for services like EBS, RDS, and EFS.
3. Implementation Steps for a “Warm Standby” Strategy
This is the most common “high-tier” DR choice because it balances cost with rapid recovery.
- Replicate Data: Set up Aurora Global Database or RDS Read Replicas in the secondary region.
- Deploy Core Infra: Deploy your VPC, Load Balancers, and at least one small instance of your application (e.g., a single t3.medium) in the secondary region.
- Health Checks: Configure Route 53 Health Checks to monitor your primary endpoint.
- Automation: Create an AWS Lambda or use Route 53 ARC to trigger a “Scale Up” event (via Auto Scaling Groups) in the secondary region when a failover is detected.
Critical Tip: A DR plan is only as good as its last test. Use AWS Fault Injection Simulator (FIS) to simulate a regional outage and verify that your failover triggers as expected without manual intervention.
To implement a high-availability disaster recovery (DR) plan for EKS in 2026, you need to configure Velero to bridge the gap between your primary and secondary regions.
Since EBS snapshots are region-bound, a standard restore will fail in a new region unless you’ve replicated the data or used File System Backup (FSB).
1. Storage Class Mapping (The “Translation” Layer)
If your primary region uses a specific storage class (e.g., gp3-encrypted) and your DR region uses another (e.g., gp3-standard), you must map them during the restore.
You can do this via the CLI or by creating a ConfigMap in the velero namespace of your DR cluster before running the restore.
Option A: The ConfigMap (Recommended for Automation)
apiVersion: v1
kind: ConfigMap
metadata:
name: change-storage-class-config
namespace: velero
labels:
velero.io/plugin-config: ""
velero.io/change-storage-class: RestoreItemAction
data:
gp3-primary: gp3-dr # "OldClass": "NewClass"
ebs-sc: gp3 # Standardize different SC names
Option B: The CLI Flag (One-off Restores)
velero restore create dr-restore-2026 \
--from-backup primary-backup-0303 \
--storage-class-mappings gp3-primary:gp3-dr
2. Cross-Region Restore Strategy
To ensure Velero can actually “see” and “use” the backups in a second region, follow this configuration:
Step 1: The “Passive” Backup Storage Location (DR Region)
In your DR cluster, configure the BackupStorageLocation to point to your primary S3 bucket but set it to Read-Only.
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: primary-s3-sync
namespace: velero
spec:
provider: aws
objectStorage:
bucket: my-eks-backups-2026 # Same bucket used in Region A
prefix: backups
config:
region: us-east-1 # Region where the S3 bucket lives
accessMode: ReadOnly # CRITICAL: Prevents DR cluster from deleting Prod backups
Step 2: Handling the Data (EBS vs. File System)
- EBS Snapshots: Do not work across regions natively. You must enable AWS EBS Cross-Region Replication for your snapshots or use AWS Backup to copy them.
- File System Backup (Kopia/Restic): If you use the
--default-volumes-to-fs-backupflag, Velero uploads the data directly to S3. This is much easier for DR because S3 is globally accessible, allowing you to restore data in any region without worrying about snapshot replication.
3. Final DR Checklist for 2026
- [ ] OIDC Providers: Ensure your IAM roles for Service Accounts (IRSA) or Pod Identity in the DR region have permissions to read from the Primary S3 bucket.
- [ ] Load Balancers: Remember that ALBs are region-specific. Your restored Ingress objects will create new ALBs in the DR region. You will need to update Route53 to point to the new DNS names.
- [ ] Image Registry: If you use a private ECR, ensure your DR region has a replica of the images or can reach the primary region’s ECR.