Nervous about doing a Pool Rebuild
Novell Cool Solutions: Tip
Digg This -
Posted: 16 Mar 2004
Question: I am kind of nervous to do an nss /poolrebuild. I need to do this but I always see a little disclaimer added to this command -- (Have good backup).
I have good backup and it's fine but my volumes are massive and my datastore quite large. Doing a rebuild overnight and having it trash the volume is making me nervous. Is this a risky thing to do?
Answer: We checked with some experts and they all feel pretty secure about it. As one said, I've seen a rebuild trash a volume exactly once in 5 years." Anyone out there have a bad experience with this? Let us know.
- Randy Grein
- Tim Malloy
- Mark Batchelor
- Mark Steffen
- Roger Zan
- Chris Nevener
- Kenneth Fribert
- Erik Wiemann
- Fokke Stastra
- Curtis Hall NEW
- Kyle Wyrick NEW
Yes, as a matter of fact I've seen pool rebuilds fail 3 times - once on a marginal IDE drive server (mine!) and another time with a Compaq 2 node cluster on NetWare 5.1. I later found the administrator was failing the cluster by downing one of the nodes, which sometimes crashed both servers. The cumulative damage was more than a pool rebuild could fix, but to be fair by the time I got there there were all kinds of problems. Worst thing was they had failed to notify me that they had backup problems...
I had a bad experience with nss /poolrebuild. It trashed four different volumes after my upgrades to NW6 about a year ago. I ended up with about half the data after the rebuild and had to restore all four volumes from backup. Now this was with NSS2C patch level NLM's.
I think my issue was from the upgrade from NW5.1 NSS to NW6 NSS. I, of course, was stupid and didn't upgrade them one at a time. Won't make that mistake again. The servers ran for a while before deciding that the volumes were corrupt. I haven't had any problems with NSS since I recreated my volumes new and restored the data onto them, though. NSS has been solid as a rock. (Knock, knock on piece of wood...)
I can't say that I've actually seen it trash a volume. But, I have seen it take a very very very long time to complete. As high as 12 hours or more. Which in my opinion is just about as bad as trashing the volume. When a business has to be down for 12 hours or more to fix a minor problem with a file system there is a problem.
We have had to run plenty of /Poolrebuild operations on NW6 servers and the vast majority have not resulted in problems.
Our worst case involved a SCSI-based clustered storage device. We were running a popular backup software vendor's product which had its own issue with its open file solution on compressed NSS volumes. This 3rd party software bug was what corrupted the data. After doing the pool rebuilds we had many corrupt files. The volumes themselves would mount fine after the PoolRebuild, but the corrupted files were all 0 byte files that could not be opened. We ended up fielding lots of requests for individual file restores from users.
I've recently performed an NSS /POOLREBUILD on the SYS volume and all went well. The key is to make sure you have the latest OS and NSS patches for your version of NetWare. I am running NW6 SP3, so I had the post SP3 NSS3C patches installed accordingly. Although you should still have a backup of the SYS volume and do a DSREPAIR -RC to dump out a dibset if it ever comes down to restoring DS on it, I have never run into any problems yet (knock on wood).
As I have understood, when an NSS Pool can be identified as being corrupt, using the VerifyPool option, this is doing a comparison between the Beast Tree and all the other trees like the File Tree, Directory Tree, Free Tree, etc.
When corruption occurs on a journaled NSS volume, it is usually only a small number of sectors which are affected. Should this, for example, prevent the volume from being mounted, a RebuildPool is on order. The RebuildPool simply takes the File Tree, Free Tree, Directory Tree, etc., and throws them completely away and recreates them from the Beast Tree. This means that whatever accessible data which is properly cataloged in the Beast Tree, and ONLY this data referenced in the Beast Tree, is the data which is accessible after a volume mount. It is the RebuildPool recreating all the trees and linking them together which takes so much time.
We have not used this option on a Volume which is reported to be corrupt without hosing some files. We have lost at least "some" data every time.
I've got a server with a bunch of 250 GB discs.
First I used a Maxtor (Promise) Controller. It continually disconnected the discs, each time resulting in a rebuild requirement, and mostly during heavy writes to the server.
Then I switched to an Adaptec 2400A controller, using it just as a JBOD controller. This also disconnected discs several times. Adaptec couldn't help me, 'it must be netware', and a lot of running around testing all sorts of things.
So all in all I've had to run several rebuilds for this issue.
I'm now using a Promise RAID controller (still in JBOD config), and that has been very stable for a long time. I still have the occasional hard reset on this server as it's running a LOT of products (it's a test server), and each time NSS has cleared it without any problems. It's of course running the latest NSS level on SP3, which I think is important, though not critical (the Maxtor controller was running on NW6 beta).
A few years back I upgraded our NetWare 4.11 servers to NetWare 6. That was even before SP1 came out.
I had a habit of running VREPAIR on our NetWare 4.11 servers every time I had to reboot them. It always worked fine without any problems.
So now I had this new NetWare 6 server on new and shiny hardware and one evening I had to make a planned reboot due to a anti-virus software problem, so naturally I did a nss /poolrebuild to check for errors as a part of the process.
That was a big mistake. It ran for 8 hours and got to 98.7 percent done. Then it ground to a halt. The server was still working but the poolrebuild never got any further.
I had to remove the volumes on the pool - recreate them and restore all data from tape. I worked from 6 p.m. to 12 a.m. When our 550 employees started to show up for work in the morning, I had all the data recreated on a spare server, I just needed to setup the printers (64 of them).
What I learned from this experience:
- Don't run a server in a working environment without at least one support pack. (Up for debate - I know!).
- Don't run a nss /poolrebuild - use nss /poolverify. Rebuild if necessary.
- Don't repeat old habits on new software - study the manuals.
- Don't put volumes for different tasks in same pool. (I lost the printers because I used a volume in the pool for spooling.)
Since applying support pack 1 I have had no problems verifying or rebuilding pools.
I've done a poolrebuild on a couple of volumes and with no problems. The smallest volume was 50Gb and the biggest about 300Gb.
I once had a Pool continually deactivate and dismount its volumes about once a week with an error of (20444 (beastTree. c)) at block 0(file block 0).
I did /poolverify and /poolrebuild several times with various switches but the pool continued to deactivate. Ended up deleting the Pool and restoring all volumes. One volume hosted Home directories so I had to reapply these to all users.
After all this the Pool still kept deactivating.
Eventually created an addition Pool containing only one volume that hosted Arcserve. Problem has never returned. I always run Arcserve in its own Pool these days.
Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com