EKS Disaster Recovery
Disaster Recovery: The Velero “Time Machine”
For EKS, a “backup” isn’t just a copy; it’s a recovery plan. Velero is the gold standard because it handles both the Metadata (YAMLs) and the Persistent Data (EBS).
The Cross-Region “Golden Rule”
To survive a full region outage (e.g., us-east-1 going dark), follow this architecture:
- Backup Location: Set your Velero
BackupStorageLocationto an S3 bucket in a different region. - Read-Only Mode: In your DR region, install Velero and point it to that same S3 bucket, but set the access mode to
ReadOnly. This prevents the DR cluster from accidentally deleting your production backups. - Storage Class Mapping: Since EBS volumes are region-specific, you must use a Velero ConfigMap to map your
gp3storage class in Region A to thegp3class in Region B during restore.
Critical 2026 Checklist:
- Audit logs: Ensure they are enabled. Cluster Insights relies on them.
- Velero Plugins: Use the latest
velero-plugin-for-awsto ensure compatibility with CSI snapshots, which are now the default for EBS. - EKS Pod Identity: If you use the new Pod Identity (instead of IRSA), ensure your Velero IAM roles are mapped correctly in the new region.
Designing a multi-region disaster recovery (DR) strategy on AWS requires balancing two critical metrics: RPO (Recovery Point Objective), the maximum acceptable data loss, and RTO (Recovery Time Objective), the maximum acceptable downtime.
1. Core Disaster Recovery Patterns
AWS categorizes DR into four distinct patterns, ranging from low-cost/high-downtime to high-cost/zero-downtime.
| Pattern | RPO | RTO | Cost | Strategy |
| Backup & Restore | Hours | 24h+ | $ | Periodic backups to S3; restore on failure. |
| Pilot Light | Minutes | Minutes/Hours | $$ | Live data replication; core “quiet” infra ready. |
| Warm Standby | Seconds | Minutes | $$$ | Scaled-down, functional version always running. |
| Active-Active | Near Zero | Near Zero | Traffic served from both regions simultaneously. |
2. Recommended Services for Minimal Downtime
To achieve minimal downtime (RTO in minutes/seconds), you should focus on Warm Standby or Active-Active patterns using these specific services:
Global Traffic Management
- Amazon Route 53 Application Recovery Controller (ARC): Provides “Routing Controls” that act as on/off switches for entire regions. It ensures you can failover without relying on the AWS Management Console during a regional event.
- AWS Global Accelerator: Uses Anycast IPs to route traffic over the AWS private backbone. It can failover between regional endpoints in seconds, bypassing DNS caching delays.
Database & Data Tier
- Amazon Aurora Global Database: Replicates data across regions with latency typically under 1 second. In a disaster, you can promote a secondary region to full read/write status in less than 1 minute.
- Amazon DynamoDB Global Tables: A multi-active NoSQL solution that allows local reads/writes in every region with automatic conflict resolution. This is the gold standard for Active-Active architectures.
- Amazon S3 Cross-Region Replication (CRR): Ensures your objects and static assets are asynchronously synced to a bucket in the DR region.
Compute & Orchestration
- AWS CloudFormation / Terraform: You must use Infrastructure as Code (IaC) to ensure the DR region’s environment is an exact, version-controlled replica of the primary.
- AWS Backup: Provides a centralized way to automate and copy snapshots across regions for services like EBS, RDS, and EFS.
3. Implementation Steps for a “Warm Standby” Strategy
This is the most common “high-tier” DR choice because it balances cost with rapid recovery.
- Replicate Data: Set up Aurora Global Database or RDS Read Replicas in the secondary region.
- Deploy Core Infra: Deploy your VPC, Load Balancers, and at least one small instance of your application (e.g., a single t3.medium) in the secondary region.
- Health Checks: Configure Route 53 Health Checks to monitor your primary endpoint.
- Automation: Create an AWS Lambda or use Route 53 ARC to trigger a “Scale Up” event (via Auto Scaling Groups) in the secondary region when a failover is detected.
Critical Tip: A DR plan is only as good as its last test. Use AWS Fault Injection Simulator (FIS) to simulate a regional outage and verify that your failover triggers as expected without manual intervention.
To implement a high-availability disaster recovery (DR) plan for EKS in 2026, you need to configure Velero to bridge the gap between your primary and secondary regions.
Since EBS snapshots are region-bound, a standard restore will fail in a new region unless you’ve replicated the data or used File System Backup (FSB).
1. Storage Class Mapping (The “Translation” Layer)
If your primary region uses a specific storage class (e.g., gp3-encrypted) and your DR region uses another (e.g., gp3-standard), you must map them during the restore.
You can do this via the CLI or by creating a ConfigMap in the velero namespace of your DR cluster before running the restore.
Option A: The ConfigMap (Recommended for Automation)
apiVersion: v1
kind: ConfigMap
metadata:
name: change-storage-class-config
namespace: velero
labels:
velero.io/plugin-config: ""
velero.io/change-storage-class: RestoreItemAction
data:
gp3-primary: gp3-dr # "OldClass": "NewClass"
ebs-sc: gp3 # Standardize different SC names
Option B: The CLI Flag (One-off Restores)
velero restore create dr-restore-2026 \
--from-backup primary-backup-0303 \
--storage-class-mappings gp3-primary:gp3-dr
2. Cross-Region Restore Strategy
To ensure Velero can actually “see” and “use” the backups in a second region, follow this configuration:
Step 1: The “Passive” Backup Storage Location (DR Region)
In your DR cluster, configure the BackupStorageLocation to point to your primary S3 bucket but set it to Read-Only.
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: primary-s3-sync
namespace: velero
spec:
provider: aws
objectStorage:
bucket: my-eks-backups-2026 # Same bucket used in Region A
prefix: backups
config:
region: us-east-1 # Region where the S3 bucket lives
accessMode: ReadOnly # CRITICAL: Prevents DR cluster from deleting Prod backups
Step 2: Handling the Data (EBS vs. File System)
- EBS Snapshots: Do not work across regions natively. You must enable AWS EBS Cross-Region Replication for your snapshots or use AWS Backup to copy them.
- File System Backup (Kopia/Restic): If you use the
--default-volumes-to-fs-backupflag, Velero uploads the data directly to S3. This is much easier for DR because S3 is globally accessible, allowing you to restore data in any region without worrying about snapshot replication.
3. Final DR Checklist for 2026
- [ ] OIDC Providers: Ensure your IAM roles for Service Accounts (IRSA) or Pod Identity in the DR region have permissions to read from the Primary S3 bucket.
- [ ] Load Balancers: Remember that ALBs are region-specific. Your restored Ingress objects will create new ALBs in the DR region. You will need to update Route53 to point to the new DNS names.
- [ ] Image Registry: If you use a private ECR, ensure your DR region has a replica of the images or can reach the primary region’s ECR.