Novell Home

Troubleshooting Operating System Software

Novell Cool Solutions: Feature
By Cindy Stap

Digg This - Slashdot This

Posted: 4 Feb 2004
 

Introduction

Troubleshooting is a systematic analysis of symptoms that indicate a malfunction. In simple terms, it is problem solving. Good troubleshooting is partially made up from characteristics such as common sense, clear thinking, and determination. It is also part knowledge, skill, and experience in the area of troubleshooting. In other words, troubleshooting is a combination of talent you are born with and developed skill.

Troubleshooting is a skill that is needed in all areas of problem solving for computer hardware and software. As computer problems occur frequently, some level of troubleshooting skill is needed by all computer users. A particularly difficult area of troubleshooting for computers is the operating system. This is probably because the operating system is like an interface between the hardware and the applications. A problem in the function of an operating system can actually be a problem caused by the hardware, a bug in an application, or a bug in the operating system. Finding the true source of the problem can be a daunting task. This can be further complicated if the operating system is a network operating system (such as NetWare) because there is the additional interaction between the operating system and the network of other computers and devices.

About this Document

This document seeks to help the reader develop the skill of troubleshooting operating system software. There is no replacement for real world experience. However, it is also true that it is better to learn from other people's mistakes. It is in this spirit that this document is offered as a tool to help teach the skills of troubleshooting.

The information in the following sections provides ideas or strategies for you to attempt while troubleshooting a computer problem. This information is not meant to influence a direction or to suggest one strategy is better than another. By using your own troubleshooting experience combined with the following ideas and strategies, you can find the fastest resolution to most of your troubleshooting problems.

Of the many strategies and tactics available for troubleshooting, two are discussed in this document: Trial-and-Error and Systematic Troubleshooting.

Having troubleshooting skills can be best described as a technique and ability to quickly deduce a solution from the information about a circumstance or situation and resolve it in a professional and timely manner. The best professionals will tell you that the key to a timely resolution of problems can be broken down into four things: don't panic, technique, up-to-date environment documentation, and key information contacts.

Before You Begin Troubleshooting

Before a problem occurs, collect as much information about the computer system and its environment as possible and keep it where everyone who could be working on the system can access the information. Troubleshooting is a systematic analysis of symptoms that indicate a malfunction. The symptoms appear as deviations from the normal parameters. In order to properly trouble-shoot, you must be able to recognize normal operating conditions/parameters. That is why it is critical to gather this initial information about the computer system and environment.

A good way to start is to complete a general "System/Server Data Sheet" to document the environment. A sample is attached to this paper. This sample data sheet may or may not contain all the information needed for the computer's environment, but it does contain basic information on which to build.

Once documented, keep this information available and up-to-date as you make changes to your system and its environment. When you are asked what is normal for your system and environment, you can answer the question without having to build the information and/or wait a long time to obtain the correct information. If you make a new version of the document every time there is a change to the system you will have a historical report from the time the system was installed to the present time. This will be very helpful in troubleshooting.

Next, it is important to build a change management program into your organization. No matter how big or how small an organization is, all server and environment modifications and/or enhancements need to be scheduled. This will help with preventing problems like multiple changes going in at the same time, and it will help keep you and all impacted teams knowledgeable of the environment update(s) going on/ in your environment on any given day.

As You Begin Troubleshooting

As you begin troubleshooting, keep the following points and questions in mind:

  • View error logs and their debug codes.
  • Investigate hardware problems; then investigate software problems.
  • Keep basic troubleshooting strategies in mind.
  • Ask yourself questions. Asking yourself questions is a powerful troubleshooting tool. Use questions, such as the ones following, to move from trial-and-error troubleshooting to thoughtful, systematic troubleshooting (to be discussed in more detail later in this document):
    • What do I know and what do I not know about this problem?
    • What do I need to know about this problem and how can I find out?
    • What is the real problem (as opposed to the perceived problem)?
    • What are the parts of the problem? Of those parts, which parts are solved most easily?
    • What are the characteristics of the problem?
    • Have you seen other problems similar to this problem and what strategy worked then?
    • Which strategy should I use now?
  • Know the strategy with which you are most comfortable and to stay with that strategy throughout the problem you are troubleshooting. Changing strategies in the middle of solving a problem can cause other problems or issues that are unrelated to the original problem.
  • Isolate yourself from interruptions. These can greatly hinder the troubleshooting task. Interruptions can lead to skipped steps, rushed and/or wrong decisions, and, worst of all, a diagnosis reached when the problem has not truly been found and corrected. One way to address this problem when a team is troubleshooting is to designate one person who handles communication to people outside the troubleshooting team.

The following list provides keys to troubleshooting a problem:

  • Document the process. Keeping good records helps you quickly and efficiently trouble-shoot the problem.
  • Investigate log records. Investigate log records to find out information about the operation of the computer that might have led to the problem. Some of the records to investigate are those that:
    • Interact directly with other computers
    • Identify persons performing work on the computer
    • State the purpose of the work on the computer
    • State when the work started and finished
    • Label cables
  • Ask specific questions. Ask specific questions that lead you to the problem such as:
    • Exactly what happened?
    • How does it happen?
    • When does it happen?
    • Does it all happen all the time or intermittently?
    • What changed right before the problem occurred?
    • Can this be reproduced? Can you show me?
  • Check the obvious. Ensure that you check the obvious computer problems, such as:
    • Network cabling and physical connections. Checking this area first can save you time. Check bend radius of cables. Incorrect bend radius can cause connectivity problems. Also check for loose connections, weakened clips, and so on.
    • Ensure that all adapters are seated in their slots.

Troubleshooting Methods

The following describes two methods of troubleshooting: the "trial-and-error method" and the "systematic method". The method you use depends upon the errors you are troubleshooting, and your level of experience. You may use one or the other troubleshooting method first, or use both methods together. Regardless, remember that skipping steps can lead you to the wrong conclusion and make you think that you have found the root cause of the error when you have not. Improper troubleshooting can lead you to mask the real problem and create another symptom.

Trial-and-Error Troubleshooting

Trial-and-error troubleshooting is based on the idea that you have a problem and do not have enough information to give you a viable starting point. The trial-and-error method is a great for attempting to solve intermittent problems. When you use trail-and-error troubleshooting, begin by asking yourself, "What might work? What can I try?" You make assumptions based on the information that you do have.

  • Look at error logs to obtain ideas about where to start. Information you gain from examining error logs gives you insight about where to begin. For example, for NetWare 6x servers you can check the logger screen or use NetWare Remote Manager.
  • Note any server error messages you received before you noticed the server was having a problem.
  • Note any client error messages.

As with any approach to problem solving, the end goal is to find the root cause. Trial-and-error troubleshooting is not just guessing and stumbling around hoping to arrive at the correct conclusion. Trial-and-error trouble shooting just begins by making assumptions. Assumptions are necessary because they set limits, thereby simplifying the problem and making it easier to work with. Setting limits provides a framework to get you to the point where you can evolve to a defined, systematic approach.

Systematic Troubleshooting

Systematic troubleshooting involves breaking the error/problem into the smallest possible pieces. "What are the smallest pieces into which you can divide the error/problem and still have it make sense?" The best way to do this is the KISS method (Keep It Simple Stupid).

When you troubleshoot problems, dividing the problem into small parts helps you keep it simple. It is extremely important to keep it simple. If there are too many unknown parts, it is impossible to determine which part is causing the error. If there are too many possible causes, you will find it difficult to find the root cause of each part of the error.

  • Log updates - Logging updates is as important in a test environment as it is with networks. If you do not know the steps to get where you are, how are you to know what is causing them?
  • Stop non-critical 3rd party software or components - Stopping non-critical processes helps simplify the situation as much as possible. Then, it might become obvious what is responsible for the error. This step in troubleshooting facilitates reducing the number of variables you must consider.
  • Access the types of errors you are seeing - Accessing the types of error begins by making a list of possible causes. Refer to the list often to keep your self thinking and brainstorming. When you maintain a list of error alternatives, you will to begin see a correlation of events.
  • Enter information into a chart or spreadsheet - Entering information into a chart or spread sheet is important, especially if you are a visually oriented person. Whatever your choice, chart or spreadsheet, remember to document your findings.
  • Devise a plan of attack - Obtain a complete, accurate symptom description. Reproduce the symptom, and then start asking questions to narrow the possibilities.

Problem Symptoms

When troubleshooting, you will find two types of error/problem symptoms that are opposite and mutually exclusive: reproducible and intermittent symptoms.

A reproducible symptom can be consistently reproduced using a known procedure.

Reproducible symptoms can always be solved. As the troubleshooter, you can reproduce the symptom at any time. If you perform a test that stops the procedure from reproducing a symptom, you have then ruled out part of the troubleshooting search area. After a number of similar tests, you will have narrowed the cause to a single component.

Two requirements for solving reproducible symptoms are:

  • You have sufficient knowledge of the system to devise tests that can narrow the troubleshooting search area, and you have sufficient knowledge to interpret the resulting test correctly (sometimes having the technical documentation available for research is enough).
  • You use a procedure for the tests that guarantees that you do not "go around in circles.

Given these requirements, a reproducible symptom can always be traced to its root cause.

Note that a reproducible symptom might not be easily reproduced. You might discover that what you believe to be a reproducible symptom really is not. For example, you might discover that your server stops responding within an hour of beginning client stress tests. This symptom is not necessarily a reproducible symptom. The word "within" means that sometimes the computer stops responding in an hour; sometimes in 45 minutes. The exact time is governed by chance. It is probable that there are times that the computer takes more than an hour to occur; maybe much more. Therefore, it is difficult to be certain that each lockup is due to the same cause.

Intermittent Symptoms

An intermittent symptom cannot be consistently reproduced because there is not known procedure with which to reproduce the problem.

Intermittent symptoms can sometimes require detective work. With intermittent symptoms, you have no mathematical certainty of solving the problem. Sometimes, intermittent problems are never solved since reproducing the symptom is not in your control. You have no way of knowing whether a symptom disappeared because of a test you ran or because of random chance.

When you are troubleshooting an intermittent symptom, you must determine whether you have a hardware or software problem. Generally, hardware problems are intermittent and software problems are more consistent.

Note that an intermittent problem can be reproduced. You cannot, however, cause its reproduction because there are no known procedures to consistently reproduce it. The best you can do is to create an environment to increase the odds of the symptom occurring, and then wait. When the symptom occurs, it reproduces itself.

An intermittent symptom can become reproducible. An intermittent symptom becomes reproducible when you find a procedure to consistently reproduce the symptom. This is the goal of troubleshooting the intermittent problem, to make it more consistently reproducible and thus easier to root cause.

Intermittent Symptom Busting Strategies

The following list describes intermittent symptom busting strategies that you might use in your troubleshooting:

  • Define the problem.
  • Base decisions on data.
  • Measure and reduce variation.
  • Test and document theories.

Summary

Two methods of troubleshooting were described in this document: Trial-and-Error and Systematic. Trial-and-error troubleshooting relies on your past experience and your intuition about where to begin and where to end troubleshooting. This troubleshooting method begins by asking yourself: "What might work?" and "What can I try?". Systematic relies on breaking down a problem into smaller parts and attacking those parts.

As you trouble-shoot a computer problem, be aware of the following:

  • Do not rush the problem's resolution. Always take your time, follow the steps, and use your experience and judgment.
  • Beware of accepting help from people not connected to the troubleshooting. These outside influences may apparently resolve the problem without explaining what was done to resolve it. This situation can be unfortunate. The problem may not have really been fixed and may have only been temporarily fixed by some unknown procedure. If this occurs you may be further away from a solution than before. On the other hand, the problem may have truly been resolved, but without knowing what was done to resolve it, the same problem may occur in the future and you won't have the necessary information to resolve it the next time.
  • Test your problem theory. Once you have documented the steps used to replicate the problem, you can work to resolve it.
  • Do not hesitate to call your vendors. The important points here are in the details. For example, when stating that your computer has a NIC (Network Interface Card) problem, provide such details as:
    • Manufacturer (Intel, 3Com, Broadcom, etc.)
    • Model number
    • Topology (Ethernet, Gigabit Ethernet, Token Ring, etc.)
    • Interface (copper or fiber)
    • Teaming/load balancing software, if applicable
    • Number of NICs
    • Operating system software driver name, date, and version

    Provide as much information as possible. There is no such thing as too much information about your computer's environment and the problem your computer is having.

  • Document, document, document! Document what worked, what did not work, as well as the root cause of the problem. If the problem comes back, it is critical to replicate the problem. If you have the problem documented, you will know where you have been and what actions have not been tried. Ensure that you keep a log of the components contained in the computer, the computer's roll in your environment, operating system information, patches or service packs applied, slots, embedded devices, BIOS information, firmware, and so on. (See the System/Server Data Sheet attached to this document.) It is better to have too much information than too little information. Information is the key to troubleshooting. If you have all of the information you can get, you will have a strong starting point to troubleshooting your task instead of beginning by collecting the required information. A little work up front goes a very long way.
  • Do not automatically assume the most drastic action. Do not perform a radical action for a problem that might be as simple as reinstalling an application. You will know when you have taken all of the necessary actions prior to taking a radical action. For example, before re-imaging a server, consult your peers and vendors. Without finding the root cause of the problem, it is possible that the frequency of the outage can be increased when you have to re-image/restore a server. Or, you may introduce a new variable into the mix, making it more difficult to root cause. The System/Server Data Sheet can also be used here.

Supporting Documentation and Web sites

The following web sites can assist you in resolving technical issues and can prepare you with information. The sites provided are Dell and Novell technical services which can expedite resolution of your problem.

Figure 1 shows a screen from the Dell Support Web site.


Figure 1: Dell Support Website

Figure 2. Novell Support Knowledgebase Screen

Figure 3 shows the detail text that appears when you type "Before calling support" in the previous figure.


Figure 3: Novell Knowledgebase Search Results

Important information to have before calling Vendor's Technical Support

When you contact a vendor's product support, you should be at your computer and have the following information available:

  • The version of the operating system.
  • The type of hardware, including network hardware, if applicable.
  • The exact wording of any information messages or error messages that appeared on your screen.
  • A description of what happened and what you were doing when the problem occurred.
  • A description of how you tried to solve the problem.

Summary

The troubleshooting skills outlined in this article will give the reader a good foundation to build from as well as serve as a 'refresher' to those who are highly technical to remember the basics and 'keep it simple'.

For More Information

The Dell and Novell Web sites contain many documents to assist you with troubleshooting. Vendors keep knowledgebases of known issues and fixes.

References:
Harris, Robert A. "Creative Problem Solving - A Step-by-Step Approach." Pyrczak Publishing, 2002.

Sample System/SERVER Data Sheet
System Name:  
Computer Manufacturer Name:  
Computer Model:  
Processor Manufacturer, Model, and Speed:  
Memory:  
Hard disk drive(s):  
Optical drive(s) (CD-ROM, DVD-ROM, etc):  
Removable drive(s):  
BIOS Version:  
Firmware Version:  
Server Name:  
IP Address(es):  
System Role:  
System services:  
Operating System (and versions):  
Automated Processes -
Backup:

Login scripts:

Boot.ini:

Statup/autoexec.NCF:

 
Version numbers -
Drivers:
NICs:

Storage Controllers:

Operating system support/service packs (SP's):

Third party software/applications and versions:

 

Basic Troubleshooting Checklist for Simple Scenarios

Troubleshooting Server not communicating to clients:

  • Check to see if changes were made to network environment, server or application(s)
  • Check OS event for network issues, warnings or errors
  • Check TCP/IP stack by pinging 127.0.0.1 (Loopback address)
  • Check server for ability to see other network servers by pinging the address
  • Check server for ability to see gateway for subnet by pinging the address
  • Check server for ability to see other subnets by pinging the address on a different subnet
  • Check server network cards for activity and connectivity lights
  • Check all cable connections to server and network ports
  • Check all TCP/IP settings (DHCP versus Static IP)
    • DNS settings correct
    • Subnet/Gateways correct
  • Check Network Card Management features for errors or events
  • Check Network Card drivers
    • Driver version
    • Driver loaded and working
    • Stop and start Driver
    • Properly configure advanced features
  • If network card advanced features are on, check
    • Feature properly configured
    • Applications or solution supports configuration of cards
  • Check web site for manufacturer of NIC:
    • Known or like issues with this card or driver
    • Newer driver
    • Newer advanced or management features
  • Check the web site for the operating system
    • Issues with TCP/IP stack or drivers
    • Patch, Support Pack, Service Pack, or OS upgrade available
  • Check your Network Hub, Switch, or Router setup correctly to work with Network Card
  • Check news groups about issues you are having for additional ideas

Troubleshooting Server down and/or OS not coming up on reboot:

  • Make sure all power cables are connected to "live" electrical sources
  • Make sure no diskettes or CDs are in the media drives
  • Check to see if changes were made to network environment, server or application(s)
  • Check for any server conditions, warning lights or HDD amber lights, Raid Controller sound alert
  • Check Power Connections
  • Check memory boards, processors, and peripherals
    • Take out memory dims, cards and reseat them back into the server
    • Take out processors and reseat them back into the server
    • Take out peripherals and reseat them
  • Check hard disk drives and make sure they are spinning up
  • Check Raid configuration and the stability of the "virtual" drive
  • Check the Remote Access card, login, and check logs
  • Perform system manufacturer's hardware diagnostics and check for errors
  • Check hardware vendor site for issues with Server and fixes
    • Check vendor knowledge base for your Server's situation
    • Check for BIOS and Firmware for Server
    • Check for peripherals firmware updates
    • Check for driver updates for Server components
  • Check software vendor site for issues with Server setup; hardware, OS, and application
  • Call hardware vendor support:
    • make sure you have server data sheet
    • documentation on things tried, check list

Click image for larger view


Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com

© 2014 Novell