EKS Disaster Recovery

PostedDecember 26, 2021

UpdatedMarch 3, 2026

Author -Rajkumar Aute

Disaster Recovery: The Velero “Time Machine”

For EKS, a “backup” isn’t just a copy; it’s a recovery plan. Velero is the gold standard because it handles both the Metadata (YAMLs) and the Persistent Data (EBS).

The Cross-Region “Golden Rule”

To survive a full region outage (e.g., us-east-1 going dark), follow this architecture:

Backup Location: Set your Velero BackupStorageLocation to an S3 bucket in a different region.
Read-Only Mode: In your DR region, install Velero and point it to that same S3 bucket, but set the access mode to ReadOnly. This prevents the DR cluster from accidentally deleting your production backups.
Storage Class Mapping: Since EBS volumes are region-specific, you must use a Velero ConfigMap to map your gp3 storage class in Region A to the gp3 class in Region B during restore.

Critical 2026 Checklist:

Audit logs: Ensure they are enabled. Cluster Insights relies on them.
Velero Plugins: Use the latest velero-plugin-for-aws to ensure compatibility with CSI snapshots, which are now the default for EBS.
EKS Pod Identity: If you use the new Pod Identity (instead of IRSA), ensure your Velero IAM roles are mapped correctly in the new region.

Designing a multi-region disaster recovery (DR) strategy on AWS requires balancing two critical metrics: RPO (Recovery Point Objective), the maximum acceptable data loss, and RTO (Recovery Time Objective), the maximum acceptable downtime.

1. Core Disaster Recovery Patterns

AWS categorizes DR into four distinct patterns, ranging from low-cost/high-downtime to high-cost/zero-downtime.

Pattern	RPO	RTO	Cost	Strategy
Backup & Restore	Hours	24h+	$	Periodic backups to S3; restore on failure.
Pilot Light	Minutes	Minutes/Hours	$$	Live data replication; core “quiet” infra ready.
Warm Standby	Seconds	Minutes	$$$	Scaled-down, functional version always running.
Active-Active	Near Zero	Near Zero		Traffic served from both regions simultaneously.

2. Recommended Services for Minimal Downtime

To achieve minimal downtime (RTO in minutes/seconds), you should focus on Warm Standby or Active-Active patterns using these specific services:

Global Traffic Management

Amazon Route 53 Application Recovery Controller (ARC): Provides “Routing Controls” that act as on/off switches for entire regions. It ensures you can failover without relying on the AWS Management Console during a regional event.
AWS Global Accelerator: Uses Anycast IPs to route traffic over the AWS private backbone. It can failover between regional endpoints in seconds, bypassing DNS caching delays.

Database & Data Tier

Amazon Aurora Global Database: Replicates data across regions with latency typically under 1 second. In a disaster, you can promote a secondary region to full read/write status in less than 1 minute.
Amazon DynamoDB Global Tables: A multi-active NoSQL solution that allows local reads/writes in every region with automatic conflict resolution. This is the gold standard for Active-Active architectures.
Amazon S3 Cross-Region Replication (CRR): Ensures your objects and static assets are asynchronously synced to a bucket in the DR region.

Compute & Orchestration

AWS CloudFormation / Terraform: You must use Infrastructure as Code (IaC) to ensure the DR region’s environment is an exact, version-controlled replica of the primary.
AWS Backup: Provides a centralized way to automate and copy snapshots across regions for services like EBS, RDS, and EFS.

3. Implementation Steps for a “Warm Standby” Strategy

This is the most common “high-tier” DR choice because it balances cost with rapid recovery.

Replicate Data: Set up Aurora Global Database or RDS Read Replicas in the secondary region.
Deploy Core Infra: Deploy your VPC, Load Balancers, and at least one small instance of your application (e.g., a single t3.medium) in the secondary region.
Health Checks: Configure Route 53 Health Checks to monitor your primary endpoint.
Automation: Create an AWS Lambda or use Route 53 ARC to trigger a “Scale Up” event (via Auto Scaling Groups) in the secondary region when a failover is detected.

Critical Tip: A DR plan is only as good as its last test. Use AWS Fault Injection Simulator (FIS) to simulate a regional outage and verify that your failover triggers as expected without manual intervention.

To implement a high-availability disaster recovery (DR) plan for EKS in 2026, you need to configure Velero to bridge the gap between your primary and secondary regions.

Since EBS snapshots are region-bound, a standard restore will fail in a new region unless you’ve replicated the data or used File System Backup (FSB).

1. Storage Class Mapping (The “Translation” Layer)

If your primary region uses a specific storage class (e.g., gp3-encrypted) and your DR region uses another (e.g., gp3-standard), you must map them during the restore.

You can do this via the CLI or by creating a ConfigMap in the velero namespace of your DR cluster before running the restore.

Option A: The ConfigMap (Recommended for Automation)

apiVersion: v1
kind: ConfigMap
metadata:
  name: change-storage-class-config
  namespace: velero
  labels:
    velero.io/plugin-config: ""
    velero.io/change-storage-class: RestoreItemAction
data:
  gp3-primary: gp3-dr  # "OldClass": "NewClass"
  ebs-sc: gp3          # Standardize different SC names

Option B: The CLI Flag (One-off Restores)

velero restore create dr-restore-2026 \
  --from-backup primary-backup-0303 \
  --storage-class-mappings gp3-primary:gp3-dr

2. Cross-Region Restore Strategy

To ensure Velero can actually “see” and “use” the backups in a second region, follow this configuration:

Step 1: The “Passive” Backup Storage Location (DR Region)

In your DR cluster, configure the BackupStorageLocation to point to your primary S3 bucket but set it to Read-Only.

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: primary-s3-sync
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: my-eks-backups-2026  # Same bucket used in Region A
    prefix: backups
  config:
    region: us-east-1            # Region where the S3 bucket lives
  accessMode: ReadOnly           # CRITICAL: Prevents DR cluster from deleting Prod backups

Step 2: Handling the Data (EBS vs. File System)

EBS Snapshots: Do not work across regions natively. You must enable AWS EBS Cross-Region Replication for your snapshots or use AWS Backup to copy them.
File System Backup (Kopia/Restic): If you use the --default-volumes-to-fs-backup flag, Velero uploads the data directly to S3. This is much easier for DR because S3 is globally accessible, allowing you to restore data in any region without worrying about snapshot replication.

3. Final DR Checklist for 2026

[ ] OIDC Providers: Ensure your IAM roles for Service Accounts (IRSA) or Pod Identity in the DR region have permissions to read from the Primary S3 bucket.
[ ] Load Balancers: Remember that ALBs are region-specific. Your restored Ingress objects will create new ALBs in the DR region. You will need to update Route53 to point to the new DNS names.
[ ] Image Registry: If you use a private ECR, ensure your DR region has a replica of the images or can reach the primary region’s ECR.

Tech should learn

AWS(Draft)

AWS-Cloud-Tech

AWS-Compute

DevOps Essentials

DevSecOps Essentials(Draft)

CI/CD

GitHub Actions

Docker

Kubernetes (Draft)

The Kubernetes Foundation

Kubernetes Architecture

Kubernetes Setting Up the Lab

Kubernetes Namespace

Kubernetes Pod

Kubernetes Workload Controller

Kubernetes Storage and Configurations

Kubernetes Networking

Kubernetes Authentication & Authorization

AWS Elastic Kubernetes Service

EKS Architecture

AWS EKS Identity & Access Management

EKS Configuration & Storage

EKS Workload Controllers

EKS Advanced Networking & Traffic Management

EKS Workload Security

EKS Observability & Troubleshooting

EKS CI/CD, GitOps

EKS Platform Engineering

EKS Cluster Upgrades & Reliability

EKS AI, ML, LLMs

Programming

Python