Disaster Recovery, Part 4 - Planning for the Future
Novell Cool Solutions: Feature
By Timothy Leerhoff
Digg This -
Posted: 27 Oct 2004
The Disaster (part 4 of 4)
by Timothy A. Leerhoff
OK - now it's time to figure out how to plan for a disaster and prepare to recover from it. An important thing to understand is that there's no single answer that works for everyone. There are multiple layers to be considered for every plan. These can range from "how do I handle a server hard drive that goes bad" to "what do we do if the ground opens up and swallows our building?"
Each of these disasters can become very expensive and possibly insurmountable if you are not well prepared to handle the situation. At a minimum, jobs may be on the line, and one of them might be yours. Industry analysts have said that a significant percentage of companies that experience disasters will go out of business in a relatively short time.
There is a strong need to balance the following elements in your requirements for data recovery:
- The time frame in which all or part of the company must be fully functional
- Price of software and/or hardware and maintenance of the solution
To find this balance, you must realize that each area of business requires a plan, and that within each area there can be several sectors that have different recovery plans.
Various business areas may include the following kinds of infrastructure:
Digital infrastructure is anything and everything connected to computers, including the computers. These items include PCs, servers (and any storage connected to the servers), switches, routers, firewall, PDAs, scanners, printers, etc.
Analog infrastructure can be considered to be the office equipment that is not connected to the computer systems. This includes but is not limited to phone systems, copiers, fax machines, and intercoms.
Physical infrastructure is primarily considered to be the internal and external building space. This would include desks, filing cabinets, lighting, parking, etc. In a disaster you could potentially lose the usage of an entire building or only a part of a building. You will need to be ready to move users to another building or area and re-attain productivity. This does not mean you will achieve 100% productivity, but at minimum you'll get an acceptable level.
The Digital Infrastructure
Note: This article discusses only the core digital infrastructure plan concepts, but these ideas also transfer to the other areas.
The first thing needed to start a plan is a complete inventory. How can you replace or recover functionality without knowing what you need to get running? The inventory needs to include model and serial numbers. For PCs and servers you will also want a list of all software installed on the computers.
The next major item to acquire or generate is an infrastructure map. Again, if you don't know what you have, how can you re-create your infrastructure in a hurry and under pressure, if at all?
Setting Recovery Priorities
In the digital area you must first break down each sector into components or component groupings. Then you should prioritize each area or component in order and speed of recovery. Speed is divided here into wanted and mandatory levels. While wanted recovery time for a crashed server might be 5 seconds, mandatory recovery may be 12 hours. This potentially large difference will help you develop an orderly, timeline-arranged checklist.
The next thing on the agenda should be to start calculating the recovery procedures and technologies. To do this we need to understand a few concepts.
Disaster Recovery Concepts
The terminology of Disaster Recovery Planning (DRP) can be quite confusing, even to experienced IT professionals who do not work on DRP every day. With cryptic acronyms, enigmatic technical jargon, and confusingly similar terms, DRP can seem like a technical minefield, but an understanding of the terms can be essential to maximizing your capability while minimizing costs. As explained in a Veritas article (http://www.veritas.com/van/articles/3943.jsp), strange and arcane terms are often used in proposals for protecting business systems, so a common understanding of the language is important.
During disaster planning you need to think in two separate, yet related time references. The first is the time period before the disaster, called Recovery Point Objective (RPO). The second is the time period after a disaster, called Recovery Time Objective (RTO). These references are important as they will help you define the disaster recovery processes and products as well as the costs involved.
These time periods are normally measured in a range of seconds, minutes, hours, days, or weeks. Each of the time period ranges will require different hardware and software, and the recovery system maintenance costs change.
Question 1: How Much Can You Lose?
The first question to contemplate is how much data can you afford to lose, at least temporarily, until it is re-keyed into your files. The data loss point is basically the maximum possible time from the last complete data backup or copy that can be reproduced after a disaster or RPO. The strategy or process used to maintain a reliable data backup will change as the allotted time frame decreases. Also, the price can increase exponentially. The sheer size of the total data store can compound the above issues.
When I visited one client and examined their existing recovery strategy, I found the company was using tape backups as the primary disaster recovery method. When I asked how much data could they afford to lose in a disaster, the knee-jerk reaction by the company executives was "none." My client's jaw dropped, and he sat in wide-eyed horror when I told him that his present data recovery method could easily result in a 24-hour data loss.
A nightly tape backup is the defacto norm I find at most clients I've had the pleasure to visit. The data rate of the backup procedure or process becomes a critical factor when larger and larger data stores need to be archived with shorter and shorter acceptable restore times. There are hefty differences in tape drives when you start looking at speed and capacity. Using multiple tape drives on a backup system concurrently will improve your total throughput, but there are limits to the improvements.
Each client has his own way to do backups. One has a tape drive on every server at every location. While the throughput is relatively high with the tape drive directly connected to the server, making sure all the tapes are changed every day is difficult, considering the multiple locations. With their centralized IT staff, a local secretary or administrative assistant changes the tapes, if they remember.
Another client has a centralized backup server that backs up all the data from all of the servers at all of the sites. This eliminates the tape switching issue, but with the slower backup speeds over the WAN links it takes 14-16 hours to complete the incremental backup for one day. This means that there may be open files that are missed, and the WAN links have backup traffic during peak traffic times during the day. This can slow down business processes.
The maximum loss timeframe for a normal nightly tape backup is roughly 24 hours. This can be quite acceptable for a company that can easily re-key one day's worth of information, while at the same time it would be very unacceptable for an Internet store like Amazon.com.
Question 2: How Long is Too Long?
The second question to contemplate is the maximum length of time your network and its data can be down before the monetary loss from system or data unavailability is unbearable. The shorter the RTO, the pricier the solution may be.
One of the latest waves is clustering, where multiple servers act as one virtual server. If one server has a problem, it can "fail over" the services it offers the users to another server. This allows upgrading systems without stopping user services. For example, all services are moved off the candidate server for an upgrade or maintenance, then the upgrade is completed and theserver is brought back into the cluster. The clustering solution requires multiple servers and, normally, a shared storage solution called a SAN or Storage Area Network.
My clients have clusters from 2 servers to 9 servers. Generally speaking, the more servers in the cluster the more flexible the cluster is. One of my clients uses multiple two-node clusters. These can be easier to implement in a more diverse company, whether that diverseness is physical or political.
Choosing the correct combination for your situation may be easier and more acceptable to the management if you opt for an outside consultant who is respected by your company's upper echelon. Basically, this would result in getting the importance of this topic impressed on the appropriate people in your organization. That would make it easier to get the needed buy-in from the management so your DR project has a probability of a demonstrable success.
Once you have picked the business continuity solution types, you need to find a vendor who can supply your needs. You can get business continuity solutions from many vendors - pricing will vary as capabilities and capacities increase. The increasing costs of the more complex recovery options are the reason for fully defining the RPO/RTO objectives that meet your office's requirements, without blowing the entire profit structure for the next decade.
Once you have identified your needs and the business continuity solutions that will work for you, write it all down. Cover all the possible levels of disasters, and work out the recovery processes. Make sure this is all in writing and kept in a secure location - this is very important!
Testing ... Testing ...
Now that you have a plan, it's time to test one small part of the plan. For example if a part of your DRP is using a tape backup of a server, evaluate how the test went. Modify the plan to incorporate the knowledge you acquired in testing. Move on and test another part of the plan. Continue testing until you have checked the entire plan. Then test the whole plan again in larger chunks and modify the plan to improve the process and procedures again.
Once you have a tested and verified the plan, guess what - you still need to test. Testing is like practice on an athletic team. To summarize: research, research, plan, and test.
Research items to consider:
- Research exactly what your office needs.
- Research the solutions that will provide for your needs.
- Plan out every part in writing.
Processes to consider:
- Emergency hardware procurement (rental or extra on-site)
- Replacement hardware procurement (purchase)
- Replacement software (cache of CD copies with keys or vendor)
Procedures to consider:
- Who does what (including backup personnel for each part of the plan)
- Who to call (vendors and management)
- What to get, and where
And the most important item - test! And test frequently.
I hope this helps at least one person to not go through what I did. Good luck and good planning to all!
Other articles in this series:
- The Disaster (Part 1) - Underwater Data
- The Disaster (Part 2) - Hard and Soft Data Recovery
- The Disaster (Part 3) - If You Rebuild It, They Will Come
Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com