Article
Overview
Methodology
Health Check Overview
Basic Environment Check
Basic Health Check
Basic Problem Solving
Table Summary
Conclusion
Overview
When problems arise with a server, the simple is often overlooked. Is the monitor plugged in? Is the install media in the DVD drive? Was the service started? Supportconfig is a tool designed to gather system information in a way that promotes resolving problems as quickly as possible. The goal of this article is to show the administrator how to use supportconfig to check the basic health of the server. A test case of a server with high CPU utilization will illustrate the process. Once you have created a supportconfig tar ball, you should perform a server health check. Checking the basics begins with the basic-environment.txt and basic-health-check.txt files.
Methodology
Supportconfig has three primary purposes, 1) gather important system information, 2) reduce problem resolution time, and 3) teach useful system commands. Of course information is critical to any problem solving scenario. If there's a problem, the basic supportconfig philosophy is, gather as much information as possible, so we only have to ask for it once. Once the information is gathered, it should be organized in such a manner that problems can be solved quickly and efficiently. As a result of this objective, several pieces of information are replicated to create a kind of one-stop-shop environment. For example, all services and their current run level states are recorded in the chkconfig.txt file. However, the current state for services specific to Logical Volume Management (LVM) are also recorded in the lvm.txt file. The lvm.txt file is one location to review much of the LVM information. All files end with a '.txt' extension so they are easily recognized and opened with default editors across platforms. As far as a teaching tool, all commands used to gather information are first logged to the appropriate log file, and then the output is recorded. This way if the supportconfig ever "hangs," you know what command it hung on. You can also quickly repeat any piece of information you want because the exact command with it's path and options were recorded in the text file.
To get good at reading a supportconfig, you need experience. After you look at 100 supportconfig tar balls, you will learn what is normal. The abnormal will then stand out. How can you cut the learning curve if you don't have 100 supportconfigs or the time to look at them? After each of your systems is running smoothly and tuned to your needs, get a supportconfig. Copy the tar ball off the server for comparison when and if the server experiences a problem that needs troubleshooting. Compare the supportconfig tar ball taken when the problem occured with the saved good copy for that server. Make sure you submit both tar balls to Novell Technical Services if you need to open a service request.
Health Check Overview
Every time I get a supportconfig, I always check the basics for obvious problems. The goal in checking the health of the server is to note red and yellow flags. Red Flags are problems that must be explained before moving forward. They are issues that directly relate to the reported problem or may affect the overall server performance. Red flags are not necessarily bad, but just need to be explained. Yellow Flags are issues that probably should be addressed or at least monitored, but are not directly related to the problem. A basic server health check in it's simplest form confirms that the server is patched and up-to-date. The CPU utilization, memory usage and disk space are within normal limits, and the kernel and running services are healthy.
Basic Environment Check
Look at the basic-environment.txt file first. The goal for reviewing this file is to confirm the server is patched, major packages installed and if there may be any firewall concerns.
- Download and use the latest supportconfig. We want the best diagnostic information possible.
- Verify the script execution date is relative to the problem time frame to make sure we aren't looking at obsolete information.
Test Case: As of this writing, version 2.18-11 is the current supportconfig, and the date ran matches the time frame of the issue.
Red Flag: The running kernel according to uname is not the latest kernel or is different than the installed kernel RPM package.
Test Case: The kernel is not up to date. Since the problem is CPU utilization, updating the kernel may be a valid troubleshooting step. However, we would want to first know how the CPU is being utilized; keep reading.
Red Flag: There is a mismatch among the three.
Test Case: SPident and SuSE-release say the server is at SLES10 SP1. The running kernel is 2.6.16.53-0.16, which is newer than the SLES10 SP1 kernel 2.6.16.46-0.12. Since the server has been patched, all three are consistent.
Red Flag: An RPM package relating to the problem is outdated or any other package that is very old.
Test Case: The SPident RPM package is fine, and there are no conflicting packages. Updates have been applied, but since the kernel is outdated, there are probably other package updates as well. It may be worth patching the server, rebooting and retesting the CPU utilization issue.
Red Flag: A third party package replaces a SUSE Linux distributed package.
Test Case: The only third party package installed is from the "Novell NTS" distribution. It provides supportconfig itself and does not replace a distributed package, like apache or LVM. Sometimes third party packages don't list a distribution, so they show up as "(none)." SUSE also distributes some packages with a "(none)" distribution. You can see these packages by searching for "(none)" in the rpm.txt file.
Test Case: The firewall services are turned off and there are not current active firewall rules. The firewall won't play a part in the problem.
Basic Health Check
Look at the basic-health-check.txt file next. The goal for reviewing this file is to check CPU utilization, memory, disk utilization, kernel taint status, and the health of running processes.
- Check the load averages. The load average is the average number of processes waiting to get on the run queue in the past one, five and fifteen minutes. It is a good indication of how busy the kernel is. A high load average may not be bad, but should be explainable and not impacting the overall server performance.
Red Flag: Load averages greater than 20.
Test Case: The uptime is 112 days. This is good to know since the first line of vmstat is an average over the uptime of the server. Subsequent vmstat lines are current snapshots in time. This allows us to observe which of the values have changed over time. The load averages show the kernel is busy, but probably nothing to worry about. It is consistent with the reported CPU utilization concern.
Red Flag: Values for "in" or "cs" greater than 10,000. NOTE: A high number of context switches is normal for the SLERT kernel.
Test Case: Comparing the first vmstat line with the other lines shows on average the CPU has been idle, but recently it has spiked to 100%. Each CPU is topped at 100% according to mpstat. The load averages are about the same for the past 1, 5 and 15 minutes. Notice that the user space is consuming the CPU. The interrupts (in) and context switches (cs) are not a concern. Since the problem seems isolated to user space, look at the top ten CPU processes. A program called "loop" is the major offender here. There are five of them, and they are all consuming heavy CPU time.
Red Flag: The server is frequently swapping to disk, and free memory below 2MB.
Test Case: The server is not currently swapping to disk, there is a lot of cached memory and free memory is 39MB. I wouldn't worry about memory.
Red Flag: Running out of disk space on the root "/", /tmp or /home partitions. Make sure you are not running out of inodes on these file systems either.
Test Case: There is plenty of disk space and free inodes.
Red Flag: A tainted kernel.
Test Case: The kernel is tainted, but since the problem is a third party application, "loop", the taint status will not affect supportability. If it is ever determined that one of these drivers is part of the problem, then the taint status would affect supportability.
Test Case: One of the "loop" processes is in a "D" state. Since this is one of the applications causing the utilization, this is a red flag. However, since there is only one and the others are working, I suspect this condition is normal and temporary.
Test Case: The only thing I'm really concerned about is the presence of a REJECT message, not what kind of message it is. Since this server has one, you might consider turning off AppArmor, rebooting the server and retesting. Maybe the "loop" application will run properly without any AppArmor constraints.
Test Case Summary: A basic server health check has revealed the root cause of the high CPU utilization issue. The "loop" processes are the cause. However, even if the CPU utilization is high, 100% in this case, I would still ask the question, "Is the server experiencing performance degradation?" If not, then I probably wouldn't worry about the high CPU utilization. Otherwise, consider offloading or splitting up those applications among other servers. The point is, you were able to create an action plan to address the problem from a basic server health check.
Basic Problem Solving
When troubleshooting a specific problem, I like to eliminate the obvious first. A general approach is to verify associated RPM packages, make sure the service is configured to start at boot time and is currently running, validate the configuration files and check it's log file for obvious errors.
Table Summary
The following is a summary of the red and yellow flags with their suggested limits.
| Category | Description | Yellow Flag | Red Flag |
|---|---|---|---|
| RPM | Outdated Packages | Unrelated Packages | Related Packages |
| RPM | Third Party Packages | N/A | Replacements |
| Kernel | Running Version | N/A | Old Kernel |
| Kernel | Load Averages | > 5 | > 20 |
| CPU | Utilization - %Busy (%Idle) | > 80% (< 20%) | > 90% (< 10%) |
| CPU | Interrupts/sec (in) | > 8000 | > 10000 |
| CPU | Context Switches/sec (cs) | > 8000 | > 10000 |
| Memory | Free | < 4MB | < 2MB |
| Disk | Percent Space Used | > 80% | > 90% |
| Disk | Percent Inodes Used | > 80% | > 90% |
| Kernel | Taint Status | N/A | tainted |
| Processes | Number in D State | > 3 | > 5 |
| Processes | Number in Z State | > 5 | > 10 |
Conclusion
When a problem occurs on your server, you should first get a supportconfig tar ball. Perform a basic server health check using the basic-environment.txt and basic-health-check.txt files. Some problems can be corrected or minimized by simply checking the basic health of the server. The first basic action item is to ensure the server is patched and up-to-date. Next make sure the server is not overloaded, running out of memory, or out of disk space. Identify any Red Flags you might have, and make sure you can explain why they exist. If you open a service request with Novell Technical Services, include your health check results along with the supportconfig tar ball of the healthy server and a supportconfig taken when the problem occurs.
Disclaimer: As with everything else at Cool Solutions, this content is definitely not supported by Novell (so don't even think of calling Support if you try something and it blows up).
It was contributed by a community member and is published "as is." It seems to have worked for at least one person, and might work for you. But please be sure to test, test, test before you do anything drastic with it.
Related Articles
User Comments
Some suggestions
Submitted by konecnya on 13 September 2010 - 7:58pm.
Some suggestions
Do something to make the file names stand out from the text they are in, it all blends together and my first read of the first one was "Look at the basic environment file first."
I guess just part of the fuzzy the nature of the old wetware, and am also dealing with a (hopefully temporary) vision problem that makes presision reading a challenge.
in "Basic Environment Check" item 2, second bullet point on kernel version, I think you mean 'outweigh', vs 'out way'
I see that I have a newer kernel than showing on the TID, but then I see the document was last updated a whole 40 days ago, so you are 'only' every couple of months, not with every patch. It might be worth a comment to that effect in that TID and/or this document as I doubt you'll be able to keep that TID 100% up to date all the time.
A very nice run done of the basics, now to get my Linux skills up to the point where I can fix the things I am finding, Thank You.
- Be the first to comment! To leave a comment you need to Login or Register

















1