The Art of Troubleshooting
Novell Cool Solutions: Feature
By Nancy Cadjan
Digg This -
Posted: 19 Jul 2000
OK, your morning started out great. You get to work early and enjoy the calming hum of your servers happily working away. It's almost like being the ruler of a peaceful kingdom of bits and bytes. Cool.
Then it happens. All hell breaks loose. People start logging in and they wreak havoc with your beautiful world. Suddenly, you find yourself going to war against the network monsters. Sally can't log in (and she actually remembers her password this time). Bob can't print to the Floor 3 South Printer. Users in the R&D division can't get access to a volume on the network. Your boss is trying to log in from a conference in Boca Raton (poor guy) and he says RADIUS isn't working. You are supposed to be testing a new version of Novell Client, but you can't seem to get the syntax right in the ACU login script you added to user login. By lunchtime, you are completely overwhelmed and you don't have answers to anything.
Sound familiar? Yup. We've been there. However, we also have gleaned some basic principles to troubleshooting you can use to make sense of what is going on. Even though you might feel like it some days, your network is not out to get you. There is (most of the time) a logical explanation for everything. However, finding that explanation may take more than facts. There is an art to troubleshooting that includes more than just technical knowledge.
A Few Troubleshooting SkillsTroubleshooting and the patience of Job are what make a good network administrator great (and what keeps him out of the insane asylum). There are a few techniques or personal traits that great troubleshooters have in common, including:
Remain Relatively CalmEven though it seems like the world is coming to an end, step back and be objective. This sounds New Age-ish, but take a deep breath and clear your mind of the static. Close the door on the screaming users and concentrate.
Talk out the ProblemIf you can, find someone to talk to about the situation. They might have the answer. If you can't find someone, talk out loud. So people think you are schizo-they will be happy when the network comes back up.
Keep Meticulous NotesOK, you flunked History 101 in college because your notes said something about George Washington fighting in Vietnam and George McCarther in the Revolutionary War. Time to hone your note-taking skills. Write down everything you know about the situation, what you have tried, what worked and what failed. List all potential causes and solutions. Then, keep taking notes as you troubleshoot and find the answer. If you've kept good notes in a safe place like a notebook, you can use them again when you have a similar situation. Over time, your notes may even help you find little glitches in your network.
Don't Focus Too SoonDon't jump to conclusions. Keep all options open. Look at the entire environment. List what has changed and what inconsequential events have recently occurred. Recently, a colleague got a new workstation. Every afternoon around 3:00, it just shut down. He couldn't figure it out and he looked at every "possible" conclusion and was about to return the workstation and tell the manufacturer that he had a defective hard drive. Later, he realized that it was the inconsequential act of turning his computer on its side that caused the problem since the vents were located on the side of the case.
If the answer is not forthcoming, let it go for a while and don't concentrate on it. Einstein said that he got the answers to questions in the space between thoughts in his mind. When he stopped thinking about the problem, the answer came. Some may think you are giving up, but you are a bulldog waiting for the opportunity to attack.
Try to Isolate the ProblemSometimes, if you can isolate the problem to one part of the network, the number of factors is reduced and the answer becomes clear. Here are a few steps to isolating problems:
- Decide if the error is temporary or persistent. You can learn if the problem is a response to some condition that affects the network or if its generated by some inconsistency that will need to be resolved.
- Decide if the problem is localized in one area of the network. Maybe there is one server that is a common denominator.
- Decide if the problem happens at a specific time of day. If so, what activities are happening at the same time that could cause or aggravate the problem.
- If you can, get your network back to "known good" state or a baseline like the default configuration or standard set of operations. If this eliminates your problem, start applying changes one at a time, until the change that causes the problem occurs.
- Try and understand how the cause and effect are related so that you can prevent this from happening in the future (and document your findings).
Try the most logical possibilities first, but don't underestimate any factors you can think of. And be patient. In the end, you'll feel like Sherlock Holmes-It's elementary, dear Watson.
Get More KnowledgeKnowledge is power. The more you know, the better you will be at seeing issues before they become problems. Certification and classes can help. Novell's Service and Support class has lots of great information. Novell Technical Support's web site has lots of troubleshooting information. Novell product document can give you great insight into how things are supposed to work. Novell Support Forums provide answers and feedback from other network administrators. Others in your company or colleagues in other companies may also be great sources of information.
Common NDS TroublesIn addition to these general troubleshooting guidelines, here are some specific things to look at when considering your NDS issues. While this is in no way meant to be an exhaustive list of issues, these are some of the issues that Novell Technical Support tells us are the most common NDS issues.
Replica RingInconsistencies in replica rings can be the source of numerous NDS errors. General steps in resolving these errors include the following:
- Using DS Trace or DS Repair, identify the partition affected by the replica ring inconsistencies.
- With DS Repair, identify all servers that host replicas of this partition and note the replica type on each server.
- Examine the server hosting the Master replica since it functions as the authoritative source for partition information. If the Master replica is the source of the problem then designate one of the Read/Write replicas as a new Master using NDS Manager or DS Repair.
- Once a healthy Master replica exists, perform a "Send All Objects" operation in DS Repair to eliminate any inconsistencies.
- Monitor the replica ring after making repairs to make sure that it is successfully sending updates between all replica-hosting servers.
Network Address ReferralsProblems with network address referrals will prevent NDS from properly traversing the tree from partition to partition in order to find an object that is not locally maintained. To resolve this type of referral problem do the following:
- Identify the actual assigned IP or IPX addresses for each server involved. Each platform (such as NetWare, NT or Solaris) will have a mechanism for reporting the network address that is being used by that server.
- Repair Network Addresses on the servers for which other NDS servers are reporting errors. This can be done from the NDS Manager Partition Continuity view. This operation will make sure that the server is properly transmitting its own network address information.
- More severe problems may require a rebuild of replicas that have received invalid network address information. This can be resolved by using the "Receive All Objects" operation in DS Repair on the server hosting the replica. Use "Send All Objects" if the replica is a Master.
SchemaIt is possible that an NDS server, due to communications problems or corruption of synchronization time stamps, will fail to receive schema updates as they are applied to the NDS environment. The resulting schema inconsistencies can be resolved by doing the following:
- Using DS Trace, identify the server that is reporting schema errors. This will be the server that has not received the schema updates properly.
- Once the server has been identified, there are three ways to resolve this problem:
- Declare a new epoch for the NDS tree. This will reset all time stamps and resolve any invalid entries or corruption.
- Remove and reinstall NDS on the server that has failed. Make sure that any Master replicas hosted on this server are reassigned if this option is used. This option can cause significant user impact since external references pointing to this server will have to be reset.
- Contact Novell Technical Services. They have special tools with which they can resolve time stamp inconsistencies so that the affected server can begin to receive schema updates again.
NDS Objects and AttributesNDS object and attribute inconsistencies involve replicas of the same partition that, for whatever reason, have different information stored about the same NDS object or object attribute. In order to isolate the server(s) that have the faulty information it is necessary to unload NDS on other servers. As such, this type of troubleshooting can only be done in off-hours. In order to troubleshoot this type of problem do the following:
- Identify each server that hosts a replica of the partition having problems.
- Unload NDS on every server in the replica except one. That way you know you are getting partition information from that server.
- Use ConsoleOne to query the tree for the faulty objects and/or attributes. If they are correct you know this server's replica is not faulty.
- Repeat step 3 until the faulty server(s) is/are found.
- To attempt to repair the problem, first attempt a Receive All Objects from the faulty server.
- If 5 fails, attempt to Send All Objects from one of the known good servers. If possible, use the Master for this operation.
- If 6 fails, the replica will have to be destroyed. At this point you may want to involve Novell Technical Support unless you are very comfortable with the use of advanced DS Repair switches. The replica can be eliminated by loading DS Repair with the -A option. You will then be able to remove the faulty server from the replica ring and then destroy the faulty replica.
- If the database itself is too corrupt to repair, it may also be necessary to use the DS Repair -XK2 and -XK3 switches options. These switches will destroy all database objects and eliminate all external references in preparation for restoring a new copy of the database on this server. Warning, this should only be done under the guidance of a Novell Support engineer to avoid irreparable damage to the NDS tree. You don't want the cure to be worse than the disease!
NDS TimestampsThe best-known NDS timestamp issue is synthetic time. Synthetic time is when an NDS object, or objects, has a modification timestamp ahead of current network time. If the period between current time and the synthetic time is small this problem will correct itself. However, if the period is large then it is possible to resolve the problem manually. To fix the problems manually do the following:
- Review the NDS communications processes to be sure that all replicas are communicating properly.
- Perform a check on the Master replica using DS Repair to be sure that it does not contain any errors and that it is receiving current updates properly.
- Timestamps can be repaired in two ways:
- Use DS Repair to Repair Time Stamps and Declare a New Epoch.
- Identify the replica(s) with the synthetic timestamps and rebuild those replicas using Receive All Objects in NDS Manager.
Consider the following when repairing timestamps with DS Repair:
- All non-master replicas will be restarted in a New status when this operation is performed. No partition operations or replica updates will be possible--except through the Master replica--until the replica(s) pass into the On status.
- This operation generates a large amount of NDS-related traffic as timestamps for all replicas are reset.
Prevent Things Early On!You're not lazy, but you may not have thought about what you can do as a network administrator to keep problems in the network to a minimum. A good program of preventative maintenance will go a long way toward getting you home on time at night. Some common preventative maintenance procedures include the following:
- Verify that Installed Versions of NDS are Current. (Quarterly) Review Novell's support web site regularly for updates to NDS-related files. You may not want to apply all updates immediately, but be aware that the updates exist and what issues they are intended to resolve.
- Verify that Time is Synchronized. (Biweekly) Use DS Repair to check the time sync status for each partition in the tree. Watch for synthetic time that might avoid background processes from completing normally.
- Verify that Replica Synchronization is Occurring Normally. (Biweekly) Use NDS Manager, DS Repair or DS Trace to monitor the replica synchronization process. Any of these utilities can also be used to activate the replica sync process manually so it can be monitored through DS Trace.
- Check Replica Ring Continuity. (Biweekly) Check Replica Ring information using NDS Manager or DS Repair to be sure that each server holds identical information concerning the members of its replica ring(s).
- Check Backlinks and External References. (Weekly) Use DS Repair to check external references. This is accomplished through DS Trace on NT. This procedure will make sure that queries are able to traverse the NDS tree properly.
- Check NDS Obituaries. (Bimonthly) Obituaries are references to deleted objects that are maintained until word of the deletion has been propagated to all servers hosting replicas of the affected partition. DS Repair will note undeleted obituaries during its External Reference check. When no longer needed, Obituaries should be automatically deleted by the Janitor process, which can be monitored though DS Trace.
- Check the NDS Schema. (Monthly or after extending schema) Use DS Trace for force a schema sync (*SS command) to make sure that schema updates are being received by all NDS servers.
- Review Tree for Unknown Objects. (Monthly) Use ConsoleOne to search for Unknown objects. Click the Edit dropdown menu and select Find. Start your search from the Tree level and select Unknown as the object type. Unknown objects can indicate resources that have not been properly installed or removed from the tree. However, it may also indicate ConsoleOne does not have a snap-in capable of recognizing that object type, so don't immediately assume that Unknown objects need to be deleted.
- Backup Server NDS Database Files. (Weekly) Use your preferred method for backing up NDS database files.
While performing these maintenance tasks regularly will not guarantee that problems will never surface, it will certainly help prevent catastrophic problems by allowing you to catch problems before they become too serious to resolve easily.
Smile, It's all In a Day's WorkThe God of networking (was his name Murphy?) is cousin to those infamous Greek gods that love to cause chaos. He will wreak havoc with your life just to see how you react. There will be some days you'd rather forget. But there is also a pretty good feeling associated with tackling a tough problem and coming out on top. And each new bit of experience, no matter how painfully gained, will take you another step up the ladder toward network guru.
Face it, your users are probably not going to know or appreciate all those little things you do to keep the network running. Hopefully, your boss is more cognizant of your efforts. Given today's job market, most organizations are happy to pay handsomely for a bright network admin that has the battle scars to prove that he or she has sacrificed their pound of flesh to the network Gods. We hope these tips will make those offerings on the TCP/IP altar a little easier to handle.
Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com