Thursday, May 17, 2018

A Reasonable DR

I wracked my brains over Disaster Recovery concepts for several months before I reached a viable plan. The key was when the light bulb went on that you must plan for two distinct types of disasters (call them major and catastrophic). You need to plan for both, but quite possibly you can be a little more lenient on yourself for the catastrophies. I'm not going to get too far into the weeds with my specific implementation as your's will likely be different. How you recover from each disaster really depends on the perceptions of your customers (what is their range of allowable excuses) and your employer's ownership contingencies (what is their insurance like and what are their long-term objectives). Things that you read in the "trades" about recovery time objective and recovery point objective (RTO and RPO) are still relevant, but they are different under each of the scenarios.

A "major" is when the halon failed and a fire in your server room melted your NAS. Maybe some other equipment survived, and all the hardware outside the server room is just fine. Under this scenario both your customers and your management will extend you a fair amount of sympathy but they still expect some kind of reasonable recovery efforts. It widely varies by circumstance, but in my case they would expect that after a couple of days our systems would be up and running again, and at worse we might have to enter a half day's data over.

A "catastrophic" is when Hurricane Jeff rolls in from the gulf, your whole building is flooded, half the state is without power for two weeks, trees are down, roads are impassible, and the national guard is called out to maintain order and provide drinking water. In my case I suspect management could actually be called in to decide whether to walk away from the business by declaring a total insurance loss. If they did decide to continue, they could probably convince our customers that a month long gap in service was totally justified (of course we'd waive a couple months of bills out of courtesy).

You have to plan for recovery from a major disaster with the resources at your near immediate disposal. It helps to have both onsite and offsite copies of your software and your data; onsite copies are typically faster to access and should be counted on as the primary recovery source. Offsite resources might be compromised by the disaster (such as if the fire took down your switch hardware servicing your cloud provider). Offsite resources however may provide some resilience should you not end up with a complete set of onsite backups after the disaster.

You have to plan for catastrophic recovery from solely the offsite resources. You may need to procure new hardware in a new location and your old location might be permanently inaccessible.

Once you've got these concepts firmly in mind and know for each the RTO and RPO, you can work backward to determine your backup strategy. It doesn't make sense to backup your log files locally on a frequency greater than your major disaster RPO, and to perform offsite database log copies more frequently than your catastrophic RPO.

Planning to meet the Recovery Time Objective is rather more nuanced, and likely takes actual practice to discover. How long does it take you to restore a copy of each of your databases? How long to restore the subsequent differentials and log backups? Until you try it you won't know for sure. Also consider that your charges are usually blessed with running on the fastest IO hardware; in a disaster or catastrophe you may well be restoring to considerably slower boxes.

There now, that was fun, wasn't it!