Disaster Recovery

Nadtakan Futhoem
2 min readJun 30, 2021

--

Recovery Time Objective(RTO) is the time it takes after disruption to restore a business process to its service level.

For example, if a disaster occurs at 12 PM and the RTO is eight hours then the disaster recovery process should restore the business process to the acceptable service level by 8 PM.

RTO — The ordering system must be back up within 8 hours to prevent loss of widget revenue.

Recovery Point Objective(RPO) is the acceptable amount of data loss measured in time.

For example, if the disaster occurred at 12 PM around lunchtime and the RPO is an hour so the system should recover all data that was in the system before 11 AM.

RPO — the ordering system must be able to recover data within an hour of the outage to avoid loss of widget orders.

Scenario

Back up & Restore — Using AWS as a virtual tape library

Service used: AWS Storage Gateway, AWS import/export, Amazon Glacier, Amazon S3

RTO: High (8–26 hours)

RPO: Since the last backup, this may be a daily backup only so it could be up to 24 hours which may be acceptable for some systems.

Cost: Low

Consideration: Recovery time may involve getting tapes/media delivered to the site. Disk/tape management

Pilot Light (data is mirrored) — Minimal version of environment running on AWS which can be “lit up” and expanded to full size from this pilot light.

Service used: AMI’s, bootstrapping, EIPs, ELBs CloudFormation, Amazon RDS replication

RTO: Lower than backup/restore(eg 4–8 hours)

RPO: Since the last snapshot. While core pieces of the system are in place, some installation and preparation may be required.

Cost: Low

Consideration: Keeping all services/libraries/patches up to date adds administrative overhead.

Warm Stand By — Scaled down version of fully functional environment always running.

Service used: AMI’s, bootstrapping, EIPs, ELBs, CloudFormation, Amazon RDS replication

RTO: Lower than pilot light as some services are always running.( eg < 4 hours)

RPO: Since the last data write if a master-slave multi-AZ DB. Maybe asynchronous only which would increase the RPO.

Cost: Medium

Consideration: Environment can be used for dev/test offsetting cost

Multi-Site — Fully operational version of fully functional environment running off-site or in another region.

Service used: All

RTO: Lowest — could be seconds if using active/active failover

RPO: Choice of data replication influences. RPO — last data write if synchronous DB/ app stack

Cost: High

Consideration: Ideal for regular testing of DR processes

Nadtakan Futhoem — Sr. Software Engineer

--

--

Nadtakan Futhoem
Nadtakan Futhoem

No responses yet