System crash or unexpected reboot - What information is needed by Customer Support for a root cause analysis?

  • 7010249
  • 05-Mar-2012
  • 13-Sep-2019

Environment

SUSE Linux Enterprise Desktop 11
SUSE Linux Enterprise Desktop 10
SUSE Linux Enterprise Server 15
SUSE Linux Enterprise Server 12
SUSE Linux Enterprise Server 11
SUSE Linux Enterprise High Availability Extension
SUSE Linux Enterprise Real Time Extension
SLES Expanded Support Platform
SUSE Linux Enterprise Desktop 12

Situation

A SLE / SLES Expanded Support Platform based system encountered a crash or rebooted unexpectedly. In order to identify the root cause for this issue, a request with Customer Care is about to get opened. This article is intended as help to answer important questions that arise with each system crash. Providing as much details as possible and system information will contribute to identify the cause.

Resolution

When opening a request about a system crash, please provide answers to the following questions:
  1. When did the crash occur?
    Please provide the exact time and date.

  2. What is the system main task?

  3. Was this a one time crash or did the system encounter this issue several times?
    In case the system crashed several times please provide all known occurrences.

  4. At the time the system crashed, were any particular log entries noticed?

  5. In case no entries can be found in /var/log/messages, were any entries written to the logs of the hardware management board?

  6. What was the situation on the system before it crashed?
    Please report any observation like an increase e.g. in CPU/RAM usage or high I/O wait.

What kind of system data is needed by Customer Care?

SUSE Customer Support uses for troubleshooting a tool called supportutils ( https://www.suse.com/c/free_tools/supportconfig-linux/). In order to create a system report, please run as root

supportconfig -l

This will collect all relevant system data (even older, already rotated messages files) and create a compressed file in /var/log with the following file name:

nts_$HOSTNAME_$DATE_$TIME.tbz

Please always run the most recent version of supportutils for better results and append this file to the service request. If outbound FTP traffic has been allowed in the corporate firewall, the archive may get uploaded directly to the service request using

    supportconfig -lur <11digit servicerequest number>

For SUSE Expanded Support based systems please provide a sosreport.

In case the crash happens in a clustered environment (Novell Cluster Services or SLE11 High Availability Extension) please provide a system report for all involved nodes.

Steps to trace system reboots

In certain situations it is possible that no crash messages can be found in /var/log/messages (especially in case of situations where the system management board reset the hardware). Please also check if /var/log/mcelog contains any reports. If this is the case a hardware check should be started in the first place and all hardware components should get patched to the most recent BIOS / firmware level. If the system crashed more often without leaving evidence connect a second system via a serial connection as outlined in TID 3456486 - Configuring a Remote Serial Console for SLES.

Kernel Core Dump capture

If a system crashes, the possibility of capturing a kernel core dump is given using kdump. Its configuration is explained in TID 3374462 - Configure kernel core dump capture. A best practices document about providing kernel core dumps to Customer Care is available at TID 7010056 - Best practice for providing kernel core dumps to support incidents.

For SLES Expanded Support based system please consult the corresponding online documentation for RHEL5 or RHEL6 on configuring kdump.

Please note: kernel core dumps must have been written completely to the dump device. To ensure this is the case, set KDUMP_IMMEDIATE_REBOOT to "yes" in /etc/sysconfig/kdump and wait for the system to reboot itself. Note that cores can be very large, so this may take a while. Forcing a reboot manually could interrupt the writing and result in an incomplete core. If the dump is incomplete for whatever reason an analysis will not be possible.


Additional information

sysstat is a tool which collects system data (e.g. CPU, RAM, i/o usage) in regular intervals. Its output is also a valuable source of information when it comes to troubleshooting crash situations. Please consider to install the package sysstat and enable its service by using

chkconfig boot.sysstat on
/etc/init.d/boot.sysstat start

If this service is activated before the system crashes, supportconfig and sosreport will include its output into the system report.