System freezes after more than 208 days uptime

This document (7009834) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise 11 Service Pack 1

Situation

The system crashes with an uptime of around 209 days and the following can be found in /var/log/messages:

------------[ cut here ]------------ WARNING: at /usr/src/packages/BUILD/kernel-default-2.6.32.29/linux-2.6.32/kernel/sched.c:3847 update_cpu_power+0x151/0x160() [...] Call Trace: [<ffffffff810061dc>] dump_trace+0x6c/0x2d0 [<ffffffff813974e8>] dump_stack+0x69/0x71 [<ffffffff8104d754>] warn_slowpath_common+0x74/0xd0 [<ffffffff8103d6e1>] update_cpu_power+0x151/0x160 [<ffffffff8103e323>] find_busiest_group+0xa83/0xce0 [<ffffffff8104604d>] load_balance_newidle+0xcd/0x380 [<ffffffff813982db>] thread_return+0x2a7/0x34c [<ffffffff813992fd>] do_nanosleep+0x8d/0xc0 [<ffffffff81068628>] hrtimer_nanosleep+0xa8/0x140 [<ffffffff81068730>] sys_nanosleep+0x70/0x80 [<ffffffff81002f7b>] system_call_fastpath+0x16/0x1b [<00007f77d8469da0>] 0x7f77d8469da0 ---[ end trace 63f382152a7c7034 ]---

As far as we know the issue happens under the following conditions:

CPU vendor is Intel
/proc/cpuinfo contains both of the following CPU flags:
```
constant_tsc
```
```
nonstop_tsc
```
dmesg and/or /var/log/boot.msg does not contain the string
```
Marking TSC unstable
```
kernel flavor is not xen

Only if all four conditions are met the system is affected.

Be aware that this also affect any virtualisation that forward the CPU Flags to the guest (Xen,KVM,VMware)

The freeze/crash can happen anytime after 208 days uptime. It cannot happen before the system reaches 208 days uptime.

Resolution

For critical production systems that cannot update their kernel version, we currently recommend a cold reboot before the system reaches 208 days uptime.

A fix for this issue has been released since kernel version 2.6.32.59-0.7.1. We always recommend to install the latest version of the kernel.

The entry in the changelog for this issue is (rpm -q kernel-default --changelog):

x86: Avoid unnecessary overflow in sched_clock (bnc#725709).

Additional Information

The issue is caused by an overrun when converting Time Stamp Counter ( short TSC ) ticks to nanoseconds. This can result in passing arbitrary values to all functions using sched_clock(). The overrun happens after 208 days. The freeze/crash depends on the function that works with the arbitrary sched_clock value.

Since the issue happens due to an overflow of a CPU register it is strongly recommend to do a cold reboot. A warm reboot might not clear CPU/hardware registers.

A similar issue exists on SUSE Linux Enterprise Server on System z, except that it occurs after 417 days. Please refer to TID 000018212 System freezes after more than 417 days uptime for details and resolution.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.