Novell Cluster Services nodes rebooting randomly after March 2010 updates.

  • 7005916
  • 06-May-2010
  • 08-Nov-2012

Environment

Novell Cluster Services
Novell Open Enterprise Server 2 (OES 2) Linux Support Pack 1
Novell Open Enterprise Server 2 (OES 2) Linux Support Pack 2

Situation

After applying the January 2010 (seems more prevalent after the March 2010) updates Novell Cluster Services nodes reboot randomly.

Symptoms:
  • Both SLES10SP2/OES2SP1 and SLES10SP3/OES2SP2 (both i386 & x86_64).
  • Server load does not cause the reboots.
  • No good information in the log files.  You simply see that it rebooted.
  • In /var/log/messages of node holding the Master_IP_Address_Resource you will see it issued a poison pill, but this happened after the server was already rebooted.
  • Unable to get a kernel core to see what is causing the reboot.
  • Not having Novell Cluster Services loaded prevents the reboot.
  • Systems with 4 or more processors as seen in /proc/cpuinfo.
  • Only on Intel processors.  If you find this on any others please contact Novell Technical Support to let us know.
Happens most often on the following kernels
  2.6.16.60-0.60.1 on SLES 10 SP3
  2.6.16.60-0.42.9 or 2.6.16.60-0.42.8 on  SLES 10 SP2

Known hardware that shows the problem.  If you see this on other hardware please contact Novell Technical Support.
  Dell PowerEdge R710
  Dell PowerEdge 2950
  Dell PowerEdge 1950
  HP ProLiant DL380 G5
  HP ProLiant DL380 G6
  HP BL460c G1

Resolution

This was a top issue for Novell. 

UPDATE MAY 11, 2010:
A fix is available from Novell Technical Support.  Please open a service request so we can give you the fix.  Once the fix has been confirmed from multiple customers we will publish it in the update channel.

UPDATE JUNE 1, 2010
A few customers have seen this issue after getting the May 11 fixes from Novell.
A clock issue has been identified that will cause the server to reboot in this case.
On 32 bit systems use the "clock=hpet" boot parameter that can be added to the grub menu.  This changes to clock to use High Precision Event Timer (HPET).
Below is a link to HP's documentation:
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=110&prodSeriesId=428936&prodTypeId=15351&prodSeriesId=428936&objectID=c00781086
Novell is also working on a solution for this issue.

UPDATE JUNE 9, 2010
Fixes have been confirmed by multiple customers that it does indeed fix the issue.

UPDATE JUNE 16, 2010 FIX IS NOW AVAILABLE
Fix is in the appropriate OES channel, dated June 14, 2010.
"clock=hpet" parameter may still be necessary on some systems.
To make that change permanent, alter the /boot/grub/menu.lst of the affected server to contain clock=hpet as an additional boot parameter.
This is also valid for affected 64-bit servers.

Additional Information

On a Novell Open Enterprise server 2 SP2 running novell-cluster-services-1.8.7.660-0.61 or a Novell Open Enterprise Server 2 SP1 running novell-cluster-services-1.8.6.647-0.4.1 or later the reboot happens because the monotonic clock interface used by the NCS drivers reports a backward jump which is correctly identified by the NCS driver as a HW/SW abnormality and a reboot is initiated.
 
The monotonic and other clocks in the kernel can be driven by different hardware. Formerly this was PIC (Programmable Interrupt Controller) which was rather slow.
Im more recent distributions the hpet (High Precision Timer) and TSC (Time Stamp Counter) is used to do this job.
While the TSC is a bit faster as it is located on the CPU itself, in some cases there are harware limitations that do not allow to use TSC on specific CPU and hardware verions:
  - TSC gets wrong with CPU frequency changing
  - TSC stops with deeper processor sleep modes (C2/C3)
  - TSC gets out of sync between CPU cores
 
Novell Cluster Services uses the time from the system's time source to verify if there are abnormalities or interrupts with all nodes, to ensure cluster stability and reachability.
This is achieved twofold, on the Network and on the Shared Storage side.
 
In a Default configured Novell Cluster environment the Master node broadcasts a heartbeat on the Network, containing the Panning ID of the cluster every second. All Cluster Nodes reply to this broadcast and if one node does not respond for 8 seconds, it is deemed offline, and receives a Poison Pill, causing an immediate reboot (as if the power cord was pulled and reinserted).
In case that the Master node itself does not broadcast within the defined threshold an other node becomes the new Master node, increases the epog of the cluster and sends the previous Master node a poison pill.

On the Shared storage there is also a mechanism to ensure cluster stability and connectability. This is the Split Brain Detection (SBD) partition.
Each node places a time stamp on it's socket in the SBD partition. The Master node checks all active sockets of the SBD partition, and verifies if the time stamp is within the fault tolerance of the cluster, by default this is 8 seconds.
If the time stamp of one node is off more then the tolerance allows, that node receives a poison pill.
 
Novell Cluster Services relies on a reliable and steady time and therefor time source. If the time source that is being used on the cluster nodes is unreliable, and makes huge jumps (more then 8 seconds is sufficient) the mechanism to maintain cluster stability is heavily crippled and false split brains can occur.
 
Please note that novell-cluster-services-1.8.7.660-0.61 or later and the clock=hpet boot parameter only addresses false split brains caused by jumps of the time source.
There are still sufficiant scenarios for genuine Split Brains and Poison Pills.
Several, if not most of them are captured in KB 100583882: "The Gory details of Heartbeats, Split Brains and Poison Pills"
 
 
Be aware that some HP Proliant Servers, like the HP Proliant DL380 G6 have been reported to automatically reboot as well, even without Novell Cluster Services installed or active.
More information on this can be found at:
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&objectID=c01955503&jumpid=reg_R1002_USEN
 
 
SUSE Linux Enterprise Server 10 SP3 withouth the Novell Open Enterprise Server 2 add-on and Novell Open Enterprise Server 2 SP2 not running Novell Cluster Services can also suffer from these time jumps, but the resuls are not as dramatic as with a False Split Brain.
Be aware though that eDirectory and it's ndsd also rely on the system time source. Huge jumps in time may cause syntetic time and future timestamps in the eDirectory.