Cat'ting through "/admin/Novell/Cluster/EventLog.xml" crashes the cluster node.

  • 7015908
  • 19-Nov-2014
  • 19-Nov-2014

Environment

Novell Open Enterprise Server 11 (OES 11) Linux Support Pack 2
Novell Cluster Services

Situation

A large 12-Node OES11 SP2 NCS cluster, with a SBD partition off 1 GB in size, containing a large amount of cluster event log entries, customer was searching for a specific series of events, using  the "cat /admin/Novell/Cluster/EventLog.xml" command.

Whilst quickly scrolling through a number of pages in the cluster event log, in order to get to the correct date and time for a particular event, the node crashed, with the following error :
[1811450.978550] CLUSTER-<FATAL>-<6121>: SbdReadRecord: I/O failed, mmStatus = FFFFFFFFFFFFFFFC.
[1811464.869551] ncs_check_fire: woke up on cpu 19/19 (current=4748673728, last=4748670065), gipc_tsc=4748670065, tsc_diff=3663(14652/14000 msecs).
[1811464.869559] Kernel panic - not syncing: ncs_timer is halting the machine.
[1811464.869561]
[1811464.869574] Pid: 0, comm: kworker/0:1 Tainted: GF          N  3.0.101-0.29-default #1
[1811464.869583] Call Trace:
[1811464.869606]  [<ffffffff81004935>] dump_trace+0x75/0x310
[1811464.869624]  [<ffffffff8145e083>] dump_stack+0x69/0x6f
[1811464.869635]  [<ffffffff8145e11c>] panic+0x93/0x201
[1811464.869653]  [<ffffffffa04c30cb>] ncs_check_fire+0xcb/0x150 [ncs_timer]
[1811464.869681]  [<ffffffff8106f4db>] call_timer_fn+0x6b/0x120
[1811464.869694]  [<ffffffff810708f3>] run_timer_softirq+0x173/0x240
[1811464.869710]  [<ffffffff8106770f>] __do_softirq+0x11f/0x260
[1811464.869724]  [<ffffffff81469fdc>] call_softirq+0x1c/0x30
[1811464.869740]  [<ffffffff81004435>] do_softirq+0x65/0xa0
[1811464.869750]  [<ffffffff810674d5>] irq_exit+0xc5/0xe0
[1811464.869761]  [<ffffffff81026588>] smp_apic_timer_interrupt+0x68/0xa0
[1811464.869773]  [<ffffffff81469773>] apic_timer_interrupt+0x13/0x20
[1811464.869789]  [<ffffffff8137a732>] poll_idle+0x32/0x70
[1811464.869801]  [<ffffffff8137a9cb>] cpuidle_idle_call+0x11b/0x280
[1811464.869812]  [<ffffffff81002126>] cpu_idle+0x66/0xb0
The ncs_check_fire message shown here would indicate a CPU clock jump triggering the node to crash, but the actual difference was showing a negligible difference, which (converted to seconds) was set to 0 seconds even, so this is determined to not be the reason for the crash.

Resolution

A new sbdutil switch has been introduced going forward to display event log details, so that we read the event log details directly from the file,

This   'sbdutil -r'  command will print cluster event log information, by default from oldest to newest.

Cause

The problem is a side effect of allowing NCS to create larger SBD partition sizes than it was initially designed for.

Accessing the SBD log through adminfs is a 'costly' operation as each entry moves through the kernel different times.
In the case where there are many log entries, continuously reading through the log may hog the CPU, so that the heartbeat process could be starved and the NCS timer would cast the node out of the cluster, which is what has happened here.

The  SbdReadRecord I/O error appears to have occurred only _after_ the node was already cast out, but was not the root cause for the initial crash.