Disaster Recovery Planning: Building a DR Plan That Actually Works

Back to Blog

Most businesses have some form of data backup. Fewer have a disaster recovery plan. And of those that do, a surprising number have plans that exist only as documents — documents that have never been tested, are based on infrastructure that has since changed, and reference contacts who no longer work at the company. A DR plan that has not been tested is not a DR plan. It is a liability.

This guide explains how to build a disaster recovery plan that functions when you actually need it — starting with the two metrics that define everything in DR: Recovery Point Objective and Recovery Time Objective. Understanding those two numbers before designing anything else is the difference between a real DR program and a document that creates false confidence.

RPO and RTO: The Two Numbers That Drive Every DR Decision

Recovery Point Objective (RPO) defines how much data loss is acceptable. More precisely, it answers: if a disaster occurs right now, how old can the data we recover from be? An RPO of 24 hours means you are willing to lose up to 24 hours of transactions and data changes. An RPO of 1 hour means you can lose at most 1 hour of work. An RPO of near-zero means continuous data replication — the recovered data is essentially current at the moment of the disaster.

RPO is directly tied to your backup frequency and replication strategy. If you take a single daily backup at midnight, your maximum RPO is 24 hours — a disaster at 11:59 PM means losing almost an entire day of work. To achieve a 4-hour RPO, you need backups or replication running every 4 hours or more frequently. Near-zero RPO requires synchronous or near-synchronous replication to a secondary site.

Recovery Time Objective (RTO) defines how long you can be down. It answers: if a disaster occurs, how quickly must we be operational again? An RTO of 72 hours means you have three days to restore service. An RTO of 4 hours means you must be back up within a business half-day. An RTO near-zero means immediate failover — users are redirected to a standby system with no perceptible outage.

RTO is tied to your recovery infrastructure and process maturity. Restoring a server from backup tapes shipped from an offsite vault might take 48–72 hours. Failing over to a warm standby environment that already has your servers provisioned and data replicated might take 2–4 hours. Cutting over to a hot standby with live replication and pre-tested automated failover might take minutes.

The cost curve: Reducing RPO and RTO costs money — sometimes exponentially more money as you approach zero. Define your RTO and RPO for each business-critical system based on the actual cost of downtime for that system, then design the minimum DR solution that meets those requirements. Do not over-engineer DR for systems where 24-hour RTO is genuinely acceptable.

Disaster Categories: What You're Actually Planning For

DR plans fail when they are written for a single disaster scenario that never materializes while the actual disaster that hits is something different. A robust plan addresses multiple categories:

Hardware failure. Individual component failures — a failed RAID drive, a dead power supply, a corrupted boot disk — are the most common events. These are not "disasters" in the traditional sense but they require documented recovery procedures. The response is typically local: replace the failed component and restore from the most recent backup or snapshot. Hardware failure response must be in the DR plan because it is statistically the most likely event.

Ransomware and malicious encryption. Ransomware encrypts data files and increasingly targets backup systems to prevent recovery. A ransomware event where backups are also encrypted or deleted is a full disaster scenario. Recovery from ransomware requires air-gapped or immutable backup copies that the ransomware cannot reach — cloud backup with retention locks, tape, or an offline copy. The recovery procedure involves identifying the last clean backup before infection, validating it, rebuilding a clean environment, and restoring.

Site-level disaster. Fire, flood, power grid failure, or physical destruction affecting your facility or data center. This is the scenario that requires an offsite DR capability — backup data at a geographically separate location and the ability to provision replacement infrastructure at a secondary site. For Southern California businesses, earthquake preparedness is a real planning consideration, and site selection for DR facilities should account for whether the primary and secondary sites share common seismic risk.

Extended outage without physical damage. Extended power outages, building inaccessibility, or network provider failures that make your primary facility unusable without destroying data. The response here is often a combination of UPS and generator for extended power, and failover to cloud or secondary site for workloads that cannot tolerate the outage.

DR Tiers: Matching Cost to Recovery Requirements

Cold site. A cold site is a physical location with power, cooling, and network connectivity but no pre-configured servers or data. In a disaster, you procure hardware (or have pre-purchased and stored it), ship it to the cold site, restore from backup media, and build the environment. RTOs are measured in days. Cold sites are low-cost because you are only paying for the space, not running infrastructure continuously. Appropriate for systems where 48–72 hour RTO is acceptable.

Warm standby. A warm standby environment has infrastructure pre-provisioned — servers configured, operating systems installed, application software ready — but it is not running continuously and data is not being replicated in real time. To activate: power on the systems, restore the most recent backup data, update configurations, and redirect traffic. RTOs of 4–24 hours depending on data volume and complexity. Warm standby in cloud is increasingly common: cloud VMs pre-configured and shut down (paying only for storage), started in a disaster and data restored from cloud backup.

Hot standby. A hot standby environment runs continuously with live or near-live data replication from the primary environment. The standby servers are powered on and current; when a disaster occurs, failover is a matter of redirecting DNS or load balancer traffic and promoting the standby databases. RTOs of minutes to 2 hours. Hot standby is significantly more expensive because you are running a full duplicate of your production environment at all times. Justified only for systems where the cost of downtime substantially exceeds the cost of the standby infrastructure.

Testing Methodology: The Only Part That Actually Matters

A DR plan that has never been tested is a hypothesis, not a plan. DR tests reveal gaps between what the documentation assumes and what the infrastructure actually does. The categories of DR tests, from least to most rigorous:

  • Tabletop exercise. Key stakeholders walk through the DR scenario in a conference room without touching any systems. Valuable for identifying process gaps, communication failures, and unclear decision authority. Should be done annually at minimum.
  • Component test. Test individual components of the DR plan: perform a test restore of specific VMs or databases from backup, verify that backup data is readable and complete, test the failover of a non-critical system to the DR environment. Low risk, can be done frequently.
  • Parallel test. Activate the DR environment in parallel with the production environment. Validate that systems come up correctly, data is current, and applications function. Production continues normally. This tests the DR environment without business disruption.
  • Full failover test. Actually cut over production traffic to the DR environment. The highest-confidence test — it proves the plan works end-to-end — but also the highest risk and disruption. Reserve for systems where the investment justifies it and schedule carefully. Full failover tests should be run at least annually for critical systems.

Documentation Requirements

DR documentation must be detailed enough that a competent administrator who has never run the specific recovery procedure before can execute it successfully under stress. Vague documentation like "restore from backup" is inadequate. The documentation should specify: which backup system and job to restore from, the exact restoration procedure including command-line steps or interface navigation, the order in which systems must be recovered (dependencies matter — Active Directory before file servers, database before application, etc.), post-recovery validation steps, and the contact list for key personnel and vendors.

Store DR documentation somewhere accessible when your primary systems are down — paper copies in a physically secure location, a cloud document system accessible from any device, or both. A DR plan stored only on the servers you are trying to recover is not useful.

IT Center's backup and recovery services include DR planning documentation, quarterly backup validation testing, and managed failover capabilities through our enterprise password management platform. See also our server management services for ongoing infrastructure monitoring that can detect failure conditions before they become disaster events.

Does Your Business Have a Tested DR Plan?

IT Center builds and tests disaster recovery plans for Southern California SMBs — RPO/RTO analysis, backup architecture, DR environment setup, and annual DR tests. Know you can recover before you need to.

Schedule a DR Assessment

Or call us directly: (888) 221-0098 | [email protected]

Back to All Articles