Consequences of converting a service from starting automatic to manual.

  • 7010787
  • 17-Sep-2012
  • 18-Sep-2012

Environment

Novell Open Enterprise Server 2 (OES 2) Linux
Novell Open Enterprise Server 11 (OES 11) Linux
Novell Cluster Services
SUSE Linux Enterprise Server 10
SUSE Linux Enterprise Server 11
SUSE Linux Enterprise Desktop 10
SUSE Linux Enterprise Desktop 11

Situation

A service currently configured to start automatically when the system starts requires to be reconfigured to start manually.

Changing a Service from starting automatically to manually implies that this service also changes from being stopped automatically to manually.

Resolution

There are several ways to reconfigure any service to start manual.
  1. insserv -r [service name]
  2. chkconfig [service name] off
  3. yast runlevel, then search for [service name] and disable it.
    (replace the [service name] with the actual name of the service).

This however has the consequence that this service from now it is strongly recommended to stop the service manually prior to rebooting or shutting down the server.

For instance disabling Novell Cluster Services (novell-ncs) on a Novell Open Enterprise Server (OES) may cause the server to suffer from "Split Bain" crashes when failing to stop the service manually prior to initiate a system reboot or shutdown.

Cause

Disabling any service on SuSE Linux Enterprise Server (or Desktop) causes the /etc/init.d/rc*.d symbolic links for that service to be removed. Once these are gone the system is unaware of the priority or sequence these services require to be stopped. This may cause services that this service relies upon are stopped while they are still needed.
From this moment on, when the service is active when the system is being brought down, the service is killed in the final phase of the shutdown.

For instance, for novell-ncs:
When this service is configured to start automatically, these symbolic links are in place:
  • /etc/init.d/rc3.d/K01novell-ncs
  • /etc/init.d/rc3.d/S14novell-ncs
  • /etc/init.d/rc5.d/K01novell-ncs
  • /etc/init.d/rc5.d/S14novell.ncs

These basically cause the Novell Cluster Services to start as one of the last services in runlevel 3 and 5 and to stop the service as one of the first when the system is shutting down.
When disabling the novell-ncs service from starting automatically these symlinks are removed. From that moment on services like the network and multipathd are potentially stopped before novell-ncs is halted.
As Novell Cluster Services relies on these services for it's split brain mechanisms, this can cause "GIPC link is down" message and the server to suffer from kernel cores during shutdown caused by a poison pill or Novell Cluster Services initiated suicide.

Therefor it is recommended to stop all services, that were not started automatically, manually before initializing a system shutdown or reboot.

Additional Information

As the services that Novell Cluster Services relies upon are stopped underneath novell-ncs, this behavior is considered normal, for sure on the Secondary cluster nodes. The master node in general does not show this behavior due to the design of Novell Cluster Services.

Novell Cluster Services uses the time from the system's time source to verify if there are abnormalities or interrupts with all nodes, to ensure cluster stability and reach-ability.
This is achieved twofold, on the Network and on the Shared Storage side.
 
In a Default configured Novell Cluster Services environment the Master node broadcasts a heartbeat on the Network, containing the Panning ID of the cluster every second. All Cluster Nodes reply to this broadcast. Each time and if one node does not respond for the given tolerance, with default settings; 8 times (hence 8 seconds), it is deemed offline and receives a Poison Pill, causing an immediate reboot (as if the power cord was pulled and reinserted) or the node notices it lost connection to the LAN and commits suicide with the same end-result.
In case that the Master node itself does not broadcast within the defined threshold (in a default configured environment 8) and all secondary nodes don't receive the configured amount (by default 8) broadcasts from the current master node, an other cluster node becomes the new Master node, increases the epoch of the cluster and sends the previous Master node a poison pill.
 
On the Shared storage there is also a mechanism to ensure cluster stability and connect-ability. This is the Split Brain Detection (SBD) partition.
Each node places a time stamp on it's socket in the SBD partition. The Master node checks all active sockets of the SBD partition and verifies if the time stamp is within the fault tolerance of the cluster, by default this is 8 seconds.
If the time stamp of one node is off more then the tolerance allows, that node receives a poison pill.
In case a secondary node notices it can not update the SBD partition for the given fault tollerence (by default 8) it commits suicide and reboots.
In case that the Master node does not update it's slot on the SBD partition and that node does not send any broadcasts over the LAN for the given fault tolerance (in a default setup 8) the Master node is deemed offline and one of the secondary nodes assumes the role of the new master node. The new master node then increases the epoch of the cluster and starts broadcasting the new panning id.
In case the old master node re-attaches the SBD or LAN with the old clusters epoch and panning ID, this node receives a poison pill from the new, current master node.

More details on split brains can be found in "The Gory details of Heartbeats, Split Brains and Poison Pills".

When performing "chkconfig novell-ncs off", it doesn't automatically shut down novell-ncs. Just like executing "chkconfig novell-ncs on" doesn't automatically start novell-ncs.