Health Check Overview
Basic Environment Check
Basic Health Check
Basic Problem Solving
When problems arise with a server, the simple is often overlooked. Is the monitor plugged in? Is the install media in the DVD drive? Was the service started? Supportconfig is a tool designed to gather system information in a way that promotes resolving problems as quickly as possible. The goal of this article is to show the administrator how to use supportconfig to check the basic health of the server. A test case of a server with high CPU utilization will illustrate the process. Once you have created a supportconfig tar ball, you should perform a server health check. Checking the basics begins with the basic-environment.txt and basic-health-check.txt files.
Supportconfig has three primary purposes, 1) gather important system information, 2) reduce problem resolution time, and 3) teach useful system commands. Of course information is critical to any problem solving scenario. If there's a problem, the basic supportconfig philosophy is, gather as much information as possible, so we only have to ask for it once. Once the information is gathered, it should be organized in such a manner that problems can be solved quickly and efficiently. As a result of this objective, several pieces of information are replicated to create a kind of one-stop-shop environment. For example, all services and their current run level states are recorded in the chkconfig.txt file. However, the current state for services specific to Logical Volume Management (LVM) are also recorded in the lvm.txt file. The lvm.txt file is one location to review much of the LVM information. All files end with a '.txt' extension so they are easily recognized and opened with default editors across platforms. As far as a teaching tool, all commands used to gather information are first logged to the appropriate log file, and then the output is recorded. This way if the supportconfig ever "hangs," you know what command it hung on. You can also quickly repeat any piece of information you want because the exact command with it's path and options were recorded in the text file.
To get good at reading a supportconfig, you need experience. After you look at 100 supportconfig tar balls, you will learn what is normal. The abnormal will then stand out. How can you cut the learning curve if you don't have 100 supportconfigs or the time to look at them? After each of your systems is running smoothly and tuned to your needs, get a supportconfig. Copy the tar ball off the server for comparison when and if the server experiences a problem that needs troubleshooting. Compare the supportconfig tar ball taken when the problem occured with the saved good copy for that server. Make sure you submit both tar balls to Novell Technical Services if you need to open a service request.
Health Check Overview
Every time I get a supportconfig, I always check the basics for obvious problems. The goal in checking the health of the server is to note red and yellow flags. Red Flags are problems that must be explained before moving forward. They are issues that directly relate to the reported problem or may affect the overall server performance. Red flags are not necessarily bad, but just need to be explained. Yellow Flags are issues that probably should be addressed or at least monitored, but are not directly related to the problem. A basic server health check in it's simplest form confirms that the server is patched and up-to-date. The CPU utilization, memory usage and disk space are within normal limits, and the kernel and running services are healthy.
Basic Environment Check
Look at the basic-environment.txt file first. The goal for reviewing this file is to confirm the server is patched, major packages installed and if there may be any firewall concerns.
- Download and use the latest supportconfig. We want the best diagnostic information possible.
- Verify the script execution date is relative to the problem time frame to make sure we aren't looking at obsolete information.
Test Case: As of this writing, version 2.18-11 is the current supportconfig, and the date ran matches the time frame of the issue.
Check the hostname to confirm the supportconfig was run on the host with the problem.
Compare the running kernel version with TID3594951: Table of Kernel Versions for SUSE Linux Enterprise Server to ensure it's current. The kernel is the heart of the SUSE distribution. As such, security vulnerabilities and bugs are corrected to keep the distribution safe and effective. If you are running an older kernel, be sure the reasons for doing so out way the lost security and vulnerability. From a technical support perspective, it's also important to know the type of kernel running (default, smp, bigsmp, etc) and the system architecture.
Red Flag: The running kernel according to uname is not the latest kernel or is different than the installed kernel RPM package.
Click to view.
Figure 2 - Host name, running kernel and architecture
Test Case: The kernel is not up to date. Since the problem is CPU utilization, updating the kernel may be a valid troubleshooting step. However, we would want to first know how the CPU is being utilized; keep reading.
Make sure the kernel version, SuSE-release and SPident all show the same patch level. The kernel, SuSE-release and SPident all come from different RPM packages. So if the server has been installed and patched correctly, all of these packages will agree with one another. If they don't, you need to explain why.
Red Flag: There is a mismatch among the three.
Test Case: SPident and SuSE-release say the server is at SLES10 SP1. The running kernel is 188.8.131.52-0.16, which is newer than the SLES10 SP1 kernel 184.108.40.206-0.12. Since the server has been patched, all three are consistent.
Did SPident pass verification? If not, you may consider reinstalling the SPident RPM package so you can rely on it's output, or simply don't trust the output.
Verify that all RPM packages are current according to SPident. SPident compares the version of each RPM package installed on the system with a list of known versions for the shipping distribution and each service pack.
Red Flag: An RPM package relating to the problem is outdated or any other package that is very old.
Test Case: The SPident RPM package is fine, and there are no conflicting packages. Updates have been applied, but since the kernel is outdated, there are probably other package updates as well. It may be worth patching the server, rebooting and retesting the CPU utilization issue.
Are there any unsupported RPM package distributions installed that may related to the issue? Normally, third party packages simply add functionality to the server, but don't replace the packages distributed with SUSE.
Red Flag: A third party package replaces a SUSE Linux distributed package.
Test Case: The only third party package installed is from the "Novell NTS" distribution. It provides supportconfig itself and does not replace a distributed package, like apache or LVM. Sometimes third party packages don't list a distribution, so they show up as "(none)." SUSE also distributes some packages with a "(none)" distribution. You can see these packages by searching for "(none)" in the rpm.txt file.
Will the firewall come on after a reboot, and are there currently active rules? If there are and the problem is related to networking, maybe the firewall is interrupting service.
Test Case: The firewall services are turned off and there are not current active firewall rules. The firewall won't play a part in the problem.
Basic Health Check
Look at the basic-health-check.txt file next. The goal for reviewing this file is to check CPU utilization, memory, disk utilization, kernel taint status, and the health of running processes.
- Check the load averages. The load average is the average number of processes waiting to get on the run queue in the past one, five and fifteen minutes. It is a good indication of how busy the kernel is. A high load average may not be bad, but should be explainable and not impacting the overall server performance.
Red Flag: Load averages greater than 20.
Test Case: The uptime is 112 days. This is good to know since the first line of vmstat is an average over the uptime of the server. Subsequent vmstat lines are current snapshots in time. This allows us to observe which of the values have changed over time. The load averages show the kernel is busy, but probably nothing to worry about. It is consistent with the reported CPU utilization concern.
CPU utilization. I am more concerned about how the CPU is being utilized, than how much it's being utilized. To better understand how the CPU is being used, the vmstat and mpstat commands are helpful. The mpstat averages are over the mpstat samples, and not the server up time. If there is high user space CPU activity, then check the "Top 10 CPU Processes" to find the offending binaries. If it's system space, then look at the vmstat "system" columns. High interrupts (in) may indicate misbehaving hardware or an impending hardware failure. Look at procinfo in hardware.txt to track down which interrupt is causing the problem. A high number of context switches (cs) may indicate an application bug.
Red Flag: Values for "in" or "cs" greater than 10,000. NOTE: A high number of context switches is normal for the SLERT kernel.
Test Case: Comparing the first vmstat line with the other lines shows on average the CPU has been idle, but recently it has spiked to 100%. Each CPU is topped at 100% according to mpstat. The load averages are about the same for the past 1, 5 and 15 minutes. Notice that the user space is consuming the CPU. The interrupts (in) and context switches (cs) are not a concern. Since the problem seems isolated to user space, look at the top ten CPU processes. A program called "loop" is the major offender here. There are five of them, and they are all consuming heavy CPU time.
Memory utilization. A small amount free memory does not necessarily mean the server is running out of memory. Linux is efficient with memory usage and caches as much as it can. You can also look at the "Top 10 Memory Processes" to find out which applications are using the most memory.
Red Flag: The server is frequently swapping to disk, and free memory below 2MB.
Test Case: The server is not currently swapping to disk, there is a lot of cached memory and free memory is 39MB. I wouldn't worry about memory.
Disk utilization. It is bad to run out of disk space. Linux uses files as way to write to memory, disk, all sorts of things. If a temporary file or named pipe cannot be created, the system will be unreliable. This is particularly problematic if root, /tmp or /home get full.
Red Flag: Running out of disk space on the root "/", /tmp or /home partitions. Make sure you are not running out of inodes on these file systems either.
Test Case: There is plenty of disk space and free inodes.
Kernel taint status. Will Novell support a tainted kernel? Yes and No. If the kernel is tainted with third party drivers, then the kernel development teams will have a difficult time providing a patch for the kernel, since the kernel has changed from anything Novell provides. However, the support teams will do their best to help, regardless. If at all possible, reboot your server and duplicate the problem on an untainted kernel.
Red Flag: A tainted kernel.
Test Case: The kernel is tainted, but since the problem is a third party application, "loop", the taint status will not affect supportability. If it is ever determined that one of these drivers is part of the problem, then the taint status would affect supportability.
Health of system processes. Look for processes in a "D" (uninterruptible sleep) or "Z" (zombie/defunct) state. If you have several processes in a D state, this usually means the process is waiting on disk I/O. Any command that accesses the same disk I/O path, may appear to be hung while it waits on that particular disk I/O. This may explain why the server appears to "hang" at times. Processes in a Z state mean the process died or segfaulted without the parent process knowing about it. This may indicate an unhealthy parent process. A high number of D state processes put the server performance at risk, whereas Z state processes put running applications at risk.
Test Case: One of the "loop" processes is in a "D" state. Since this is one of the applications causing the utilization, this is a red flag. However, since there is only one and the others are working, I suspect this condition is normal and temporary.
AppArmor reject messages. If you have any AppArmor reject messages, try turning off AppArmor, rebooting the server and retesting the problem. AppArmor is powerful and can even prevent the root account from performing operations.
Test Case: The only thing I'm really concerned about is the presence of a REJECT message, not what kind of message it is. Since this server has one, you might consider turning off AppArmor, rebooting the server and retesting. Maybe the "loop" application will run properly without any AppArmor constraints.
List of running processes. It's good to know what is running on the system. Make sure you are only running necessary applications and daemons.
Test Case Summary: A basic server health check has revealed the root cause of the high CPU utilization issue. The "loop" processes are the cause. However, even if the CPU utilization is high, 100% in this case, I would still ask the question, "Is the server experiencing performance degradation?" If not, then I probably wouldn't worry about the high CPU utilization. Otherwise, consider offloading or splitting up those applications among other servers. The point is, you were able to create an action plan to address the problem from a basic server health check.
Basic Problem Solving
When troubleshooting a specific problem, I like to eliminate the obvious first. A general approach is to verify associated RPM packages, make sure the service is configured to start at boot time and is currently running, validate the configuration files and check it's log file for obvious errors.
The following is a summary of the red and yellow flags with their suggested limits.
||Third Party Packages
||Utilization - %Busy (%Idle)
||> 80% (< 20%)
||> 90% (< 10%)
||Context Switches/sec (cs)
||Percent Space Used
||Percent Inodes Used
||Number in D State
||Number in Z State
When a problem occurs on your server, you should first get a supportconfig tar ball. Perform a basic server health check using the basic-environment.txt and basic-health-check.txt files. Some problems can be corrected or minimized by simply checking the basic health of the server. The first basic action item is to ensure the server is patched and up-to-date. Next make sure the server is not overloaded, running out of memory, or out of disk space. Identify any Red Flags you might have, and make sure you can explain why they exist. If you open a service request with Novell Technical Services, include your health check results along with the supportconfig tar ball of the healthy server and a supportconfig taken when the problem occurs.
Disclaimer: As with everything else at Cool Solutions, this content is definitely not supported by Novell (so don't even think of calling Support if you try something and it blows up).
It was contributed by a community member and is published "as is." It seems to have worked for at least one person, and might work for you. But please be sure to test, test, test before you do anything drastic with it.