You Need a Backup Plan
Novell Cool Solutions: Feature
Digg This -
Posted: 22 Sep 2004
One of our frequent forum contributors shares a recent real-life data nightmare - and how he survived it. The key: doing regular dsrepair -rc backups. The lesson learned: replicas won't always save the day ...
About 7 hours ago, someone inadvertantly pulled both redundant power sources from an entire rack of servers. In this rack were three of our servers that hold replicas for our 72,000 students. There were only three replicas left, as replicas #4 and #5 had to be yanked due to memory problems. Those servers were in different racks -- doh! Unfortunately, as I am watching Server 1 come up, the DIB will not open - it's inconsistent. No biggie. Server 2 comes up ... and the same thing. Server 3 comes up, and the same thing - oh $+#@*! At this point, the students are hosed. Dsrepair will not fix the DIBS. I am not sure what was going on when they lost power, but it has a been a long time since I have seen a DIB not open after a power outage. And this was three servers! They were running edir 873.
Fortunately, we run dsrepair -rc every day at different times. One DIB set had been archived at 2 a.m. that morning, another at noon, and the third one was archived five minutes before the power loss ... Rather than restoring all DIB sets, I got Novell to restore the one from five minutes before. We then disabled sync on the other two, restored the old DIB sets, xk2-ed them and removed them from the ring. We're in the process now of re-adding the last replica of the three. With all of that, there was only a little more than 90 minutes of downtime for the students. Now I've got to get sp2/dsloader out so I can bring replicas #4 and #5 up again on different power sources.
So the moral of the story - do your "dsrepair -rc" every day on every server and make sure the servers get backed up. The backups may save you one day.
Staggered Backups with Cron
I stagger the backup times throughout the day. That's so in case we get something very serious throughout the ring, we have multiple times to choose from. Basically the crontab for doing the backups looks something like this:# dib archive
0 16 * * * load dsrepair -rc
This will create an archive everyday at 4 p.m. Of course, this will also get backed up with the nightly backups. But in addition, we have a separate NetWare server with about 100GB on it. Every day, all DIB archives are copied to this central server under the server name. It overwrites the logs every 7 days so we have 7 days on this server to go back to, in case the tapes are hosed. This central server is also backed up. The structure is something similar to this:data:
So if we have to get to an RC dump, there have several places to go:
- The servers DSR_DIB directory---but they get overwritten every day
- The dib archive server for past 7 days
- The dib archive server on tape
This is the first time we have lost an entire partition, and I am glad this process was in place. I urge everyone to do something similar to this. I am so paranoid now, I may start burning DVD's once a week ...
Copying the DIBs to the Backup Server
In order to copy DIBs to the backup server, I simply use Toolbox. I keep it authenticated on a server and lock down the account for only that server. I also give that ID RF rights to sys:system\dsr_dib in the event someone gets the account. Even though all the servers are in the data center, I still get a little paranoid. I have 7 NCF files, each one cronned to run once a week and each one responsible for copying to a different day of the week. (I was going to write a PERL script to figure out the date and where to copy but decided for simplicity just to use the 7 NCF files for each day ...
Editor's note: Here's your chance! Send us a good PERL script for automating the backup process (timing, copying locations, etc.). If we publish yours, you'll win a free T-shirt.
Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com