Article

Martin Prikryl's picture
article
Reads:

5086

Score:
2
2
1
 
Comments:

1

Troubleshooting Groupwise High Availability in a Linux Cluster Environment

Author Info

29 May 2007 - 2:02am
Submitted by: Martin Prikryl

(View Disclaimer)

Problem

The Groupwise High Availability Service runs really great, but there is a "not so nice" feature in the following situation.

First, let's set up the environment. Let's suppose we have an Open Enterprise Server 4 Node Cluster on Linux Kernel. There are six post offices, one Domain, and one Internet Agent, with two Cluster Resources on Groupwise 7 SP2. Groupwise Monitor runs on a single server with the Availability Service enabled.

And now the "not so nice" feature:

When you migrate the Resource from one node to another, depending on the Groupwise Monitor pooling time and the time the agents needs to stop, the GWHA Service may restart the already stopped Agent on the same Clusternode again. This happens because the Groupwise Monitor detects a failed (in this case a stopped) Agent and calls the GWHA Service to start this Agent.

The Cluster does not recognize this, so it dismounts the volumes, unbinds the IP address, and calls the Cluster load script on the next node. The "new" node then starts all agents.

The problem now is that some of the agents are running on two nodes. This is a very risky condition. In our environment, Groupwise performance was very poor in this situation, and we needed several hours to diagnose the problem.

Solution

Here is the really simple solution:

1. At the beginning of the cluster unload script, add this line:

ignore error /etc/init.d/xinetd stop 

Now the GWHA Service, which uses the Xinetd Daemon, cannot start the Agents during the unload Process.

2. At the end of the cluster unload script, put this line:

ignore error /etc/init.d/xinetd start

This will start the Xinetd Daemon again, and the GWHA Service on this node will work for future migrations.

Example

. /opt/novell/ncs/lib/ncsfuncs

# stop services
ignore_error /etc/init.d/xinetd stop
ignore_error /media/nss/GW/._CLUSTER/bin/stop-gw

# NCP server and IP address
ignore_error ncpcon unbind --ncpservername=CNW-VIE-01_GW_SERVER --ipaddress=$IP
ignore_error del_secondary_ipaddress $IP

# disk
exit_on_error nss /pooldeact=GW

# start xinetd
ignore_error /etc/init.d/xinetd start


Disclaimer: As with everything else at Cool Solutions, this content is definitely not supported by Novell (so don't even think of calling Support if you try something and it blows up).

It was contributed by a community member and is published "as is." It seems to have worked for at least one person, and might work for you. But please be sure to test, test, test before you do anything drastic with it.




User Comments

ehanley's picture

New Syntax for gwha temp disablement in Cluster Unload Scripts

Submitted by ehanley on 29 January 2011 - 2:13am.

Using the above stop of xinetd will stop other services associated under xinetd (i.e. vnc, etc.) other than the gwha service you are targeting to stop. I propose the use of this syntax instead in the Cluster Unload Script for GroupWise services monitored by the gwha GW Monitor solution.

At start of Unload Script:
ignore_error /sbin/chkconfig -s gwha off
kill -HUP `pidof xinetd`

At end of Unload script:

ignore_error /sbin/chkconfig -s gwha on
kill -HUP `pidof xinetd`

This way you only stop the gwha service.

© 2013 Novell