Disaster Recovery
Recovery Time Objective(RTO) is the time it takes after disruption to restore a business process to its service level.
For example, if a disaster occurs at 12 PM and the RTO is eight hours then the disaster recovery process should restore the business process to the acceptable service level by 8 PM.
RTO — The ordering system must be back up within 8 hours to prevent loss of widget revenue.
Recovery Point Objective(RPO) is the acceptable amount of data loss measured in time.
For example, if the disaster occurred at 12 PM around lunchtime and the RPO is an hour so the system should recover all data that was in the system before 11 AM.
RPO — the ordering system must be able to recover data within an hour of the outage to avoid loss of widget orders.
Scenario
Back up & Restore — Using AWS as a virtual tape library
Service used: AWS Storage Gateway, AWS import/export, Amazon Glacier, Amazon S3
RTO: High (8–26 hours)
RPO: Since the last backup, this may be a daily backup only so it could be up to 24 hours which may be acceptable for some systems.
Cost: Low
Consideration: Recovery time may involve getting tapes/media delivered to the site. Disk/tape management
Pilot Light (data is mirrored) — Minimal version of environment running on AWS which can be “lit up” and expanded to full size from this pilot light.
Service used: AMI’s, bootstrapping, EIPs, ELBs CloudFormation, Amazon RDS replication
RTO: Lower than backup/restore(eg 4–8 hours)
RPO: Since the last snapshot. While core pieces of the system are in place, some installation and preparation may be required.
Cost: Low
Consideration: Keeping all services/libraries/patches up to date adds administrative overhead.
Warm Stand By — Scaled down version of fully functional environment always running.
Service used: AMI’s, bootstrapping, EIPs, ELBs, CloudFormation, Amazon RDS replication
RTO: Lower than pilot light as some services are always running.( eg < 4 hours)
RPO: Since the last data write if a master-slave multi-AZ DB. Maybe asynchronous only which would increase the RPO.
Cost: Medium
Consideration: Environment can be used for dev/test offsetting cost
Multi-Site — Fully operational version of fully functional environment running off-site or in another region.
Service used: All
RTO: Lowest — could be seconds if using active/active failover
RPO: Choice of data replication influences. RPO — last data write if synchronous DB/ app stack
Cost: High
Consideration: Ideal for regular testing of DR processes