Troubleshooting System Issues with DS Expert
Novell Cool Solutions: Trench
By Juli Kerr
Digg This -
Posted: 27 Feb 2002
The team at NetPro are constantly creating award-winning tools to help systems administrators manage their network directories. For more information on their product offerings, visit their web site at www.netpro.com.
Most eDirectory administrators start out with the best of intentions. They take responsibility for the company's mission-critical tree and plan in earnest to keep close track of changes and additions, replica rings, and partitions.
Then reality hits. Within in a matter of weeks, administrators are in the thick of meetings and fire fighting, and documentation takes a back seat. Only when they stare down the mysterious path of a far-reaching directory error do they realize that prioritizing their best intentions could have saved them from a company-wide network crisis.
One company's recent experience is a case in point. It was a typical workday for a busy administrator until the calls started coming in to the help desk. Users were reporting both login and authentication issues, and upper management called because they got wind of increased call traffic to the help desk. A series of critical eDirectory servers had crashed, and the impact was devastating. In the end, the key to troubleshooting and fixing the problem was determining which servers held real copies of a partition. Lack of documentation made that task nearly impossible, so the administrator turned to several Novell tools --- DSRepair and NDSManager -- to assess replication status.
But, as this company later learned, there is another way -- NetPro's DS Expert. The industry leader in real-time eDirectory monitoring and alerting, DS Expert provides vital information regarding the status of the environment, including replica rings, partitions and replication status. DS Expert delivers information without time-consuming, manual processes that may be impossible to perform after a serious directory problem occurs.
Getting to the root of the problem.
Various circumstances made it especially difficult to determine which server had real copies of a partition. The company had Read/Write and Master replica holder servers near the top of the tree. The tree hosted approximately six of these servers. A few of the servers held as many 80 replicas, while other servers in the tree hosted 25 to 30. Even with that many replica holder servers, delivering the information should not have been a problem. But, because replication methodology and corresponding documentation was poorly executed, it was nearly impossible. In fact, the administrators didn't know for certain that Server 2 held the Master for the southeastern partition, and that Server 1 hosted the corresponding Read/Write. In addition, there was no documentation stating which of the four local servers could possibly hold a Read/Write replica. To make matters worse, one of the servers that held an equal amount of Read/Write replicas and Masters was one of the servers that crashed.
The system displayed --603 errors (indicating No_Such_Attribute associated with a server name), and the effort that ensued to get to the root of the problem, outlined below, were both manual and painful.
- First, the IT team determined that the problem was either the public key or the remote ID.
- They ran DSRepair to verify remote server IDs.
- The errors appeared, so the team ran DSRepair again to verify the remote server IDs and again, the --603 errors appeared.
- The -603 error likely occurred because public key problems cannot be repaired unless at least one server in the tree is authenticating without problems to the target server. The server authenticating properly to the target server must also have a real copy of the target server object, so it must have a replica (other than a sub ref) of the partition holding the target server object.
- At this point, the administrator called Novell Technical Support, which recommended that they run NDSManager.
- The administrator locates a server in another region where an authentication can occur, maps a drive and waits for NDSManager to load.
- The partition is selected, and in a matter of minutes, a --626 error reading (indicating that all_referrals_failed) appears on the screen.
There are times when administrators must know exactly what replicas and their types are stored on what servers. Within NDS Manager, the server view can take a considerable amount of time to load, depending on the sizes of the database. And when problems are brewing, even those mere minutes can feel like hours. And, even following this significant effort, the location of the server hosting the real partition is not known and must be repeated until it is identified.
Troubleshooting the problem with NetPro's DS Expert.
To find the server holding the real partition using DS Expert, the company completed the following simple steps:
- Install the DS Expert monitor on a server in the tree.
- Load the agents out on one or several replica holder servers.
- Using API calls and tree walking, DS Expert gathered partition, replica and server information from the other servers hosting replica holders.
- Within approximately 10 minutes, DS Expert showed the administrators which servers held real copies of the partitions.
- The administrator shared the information Novell required to begin the troubleshooting process in earnest.
- Using DSDump, Novell began the process of rebuilding the disabled section of the tree.
DSExpert derives information from all of the servers in a replica ring, not just the server holding the master. DS Expert also displays individual views of each server, which gives administrators vital information regarding the accurate state of important replica rings. So, not only does DS Expert ensure the health and performance of eDirectory by troubleshooting it 24x7, it can also serve as a critical alternative to NDS Manager when communication difficulties make enterprise-wide systems issues hugely challenging to troubleshoot.