Linux system hangs or is unstable
This document (3301593) is provided subject to the disclaimer at the end of this document.
SUSE Linux Enterprise Server 10
SUSE Linux Enterprise Server 9
SUSE Linux Enterprise Server 8
Novell Open Enterprise Server 11 (OES 11) Linux
Novell Open Enterprise Server 2 (OES 2) Linux
Novell Open Enterprise Server 1 (OES 1) Linux
System is unstable
- Problem characterization
- Hardware layer
- BIOS / firmware layer
- Storage layer
- Software layer
IntroductionDue to the large number of different potential causes, system hangs are among the most difficult problems to troubleshoot and a systematic approach is required for troubleshooting to be effective. This document describes such an approach, in general terms.
Problem characterizationFirst of all, establish a detailed characterization of the problem which answers at a minimum the following questions:
- What is meant by a hang or instability? Is the system not providing a particular service (reliably) anymore, has the system as whole become completely inaccessible (both via network and via console), or is it still responsive to some forms of connection (e.g. SSH, VNC or ping) or commands?
- For a hang, is it a single occurrence or has the hang occurred multiple times?
- For a recurring hang, is there a pattern to the hangs? E.g. can the hang be triggered by a particular sequence of operations, or does it always occur around a particular time of day, after a particular period of system uptime, or when particular cron jobs are executed.
Hardware layerSystem hangs or instabilities can be caused by hardware that is defective or improperly configured. Unfortunately, this happens more than most people realize, for two main reasons:
- A ground rule with hardware is "Cheap, reliable, fast. Pick any two". Hardware that is cheap and reliable is not fast; hardware that is fast and cheap is not reliable; hardware that is reliable and fast is not cheap.
- Proper hardware configurationis difficult. Most hardware has many settings which can be tweaked, but knowing when and what to tweak can be something of a black art.
Fortunately, reputable hardware vendors offer diagnostics software that can and should be used to detect hardware problems. If hardware problems are incorrectly disregarded as a problem source, much time will be wasted on analysing the software level.
Aside from vendor hardware diagnostics software, for x86 and x86_64 systems there are very thorough diagnostic tools for the memory subsystem: Memtest86 and Memtest86+. These tools are often better at identifying memory subsystem issues than vendor hardware diagnostics software. A version of them is included on the boot CD of Novell's Linux products and these tools can also be obtained from the www.memtest86.org and www.memtest86.com web sites.
Consult vendor configuration guides
As for hardware configuration, some vendors (e.g. IBM) provide detailed configuration guides for Novell SUSE Linux products on specific hardware models on their support sites. When available, this type of guide should be followed, preferably from the initial installation onwards. Even when such a guide has not been followed during initial installation, it should be consulted later on to check the system configuration and bring it in line with the hardware vendor's recommendations.
Consult certification documentation
Additionally, for Novell YES CERTIFIED configurations, consult thecertification bulletin. Where applicable, the certification bulletins contain configuration details such as Linux kernel parameters.
Address power supply issues
In some regions or at some locations, power from the regular electrical grid may be too variable in voltage, frequency or current for hardware to operate reliably. In such locations, appropriate electrical hardware like surge protectors, voltage regulators, uninterruptible power supplies and/or generators should be used to provide reliable power for computer systems operation.
In some cases, stability issues and hangs are caused by specific extension cards. Remove all non-essential extension cards, test the system then put them back one by one, testing the system after every added card.
Best practice: "burn in" testing
In light of these considerations, it is considered best practice for hardware that is to be used for production services to undergo thorough "burn in" testing covering diagnostics and stress and load testing prior to being put into production use.
BIOS layerOn PC-based systems, the BIOS (Basic Input/Output System) is responsible for the initial setup of the system and devices up to the point where a boot loader can be started to boot the system. On other architectures, the term "BIOS" is not used, but equivalent embedded software exists, e.g. "Open Firmware" or "Extensible Firmware Interface".
The BIOS and its equivalents on non-PC architectures may also be involved in power management, hardware monitoring and hotplugging of extension cards.
A BIOS, like any other software, may contain general programming defects (bugs) and may not always be following or supporting relevant standards such as ACPI fully. Vendors regularly release updated versions of BIOSes to correct such defects. Given the central role of the BIOS, it is important to track such version updates and to ensure the most recent non-development version of the BIOS is installed.
Most reputable vendors provide a search interface on their support sites that make it easy to find the current BIOS revision for a particular hardware model as well as update instructions.
Other FirmwareWith modern hardware many components, for instance NICs, HBAs and storage controllers, include embedded software or firmware of their own. This firmware should be brought up to date as well.
Storage layerEnsure that your storage is consistent by performing filesystem checks (and recovery) on all storage areas, including the root filesystem. To check the root filesystem, use the rescue environment from the service pack or installation CDs or DVDs.
Software layerCheck for corrupted data
Even when the filesystems check out cleanly, data contained in them may be corrupted, including code and data vital to proper operation of the operating system. The package management system stores checksums of data under its control. Run
Check the output of this command for signs of changes in files that are not configuration files, like binaries and libraries.
Keep the software installation up to date
Novell actively maintains released products for long periods of time. This maintenance includes fixes for software defects in particular as well as the addition of drivers for newer hardware models. Use the tools supplied by Novell, in particular the SPident tool, the Novell Customer Center and the online update facilities of your product to check whether your software installation is up to date and to bring it up to date if it isn't.
Check recent updates
Unfortunately, updated packages can occasionally introduce new defects. You can use the package management system of your Novell SUSE Linux product to determine what updates have been installed recently, e.g. through
Support from Novell Technical ServicesBasic information
When opening a service request with Novell Technical Services for a server hang or instability issue, the following information may be vital to an efficient resolution process:
- A detailed characterization of the problem (as discussed above)
- A description of changes made to the system and its configuration during troubleshooting prior to the openening of a service request.
- A configuration report for the affected system, created using the tool from TID 10100285 - Config Report For Linux. This tool should be run with the "-v" argument to include additional package management information. Attach this report to your service request as soon as your service request has been opened.
During the handling of your service request, you may be asked to provide a system crash dump for analysis, which may require substantial setup (e.g. of a serial console and/or second server to receive dumps). You can prepare for this by consulting the relevant TIDs for details:
- TID 3374462 - Configure kernel core dump capturefor SLE10 products.
- TID 3044267 - HOWTO: Configure lkcd to capture a kernel core dump for SLES9 and OES/Linux.
This Support Knowledgebase provides a valuable tool for NetIQ/Novell/SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:3301593
- Creation Date:29-OCT-07
- Modified Date:16-JUL-12
- NovellOpen Enterprise Server
- SUSESUSE Linux Enterprise DesktopSUSE Linux Enterprise Server
Did this document solve your problem? Provide Feedback