Sometimes we all experience services that die randomly. The ideal solution in those cases can take some time, like a patch, rebuild the server or wait for a service window. Being able to quickly implement a watchdog for that service makes our life as admins so much better. The following solution is simple, quick and really works in most cases. I have it in production use right now with very good results.
The solution I use isn't really my own invention but I really like its simplicity. It's basically just a shell script called from cron. The script watches the service and restarts it in case of a crash. Saves the users on our network from loads of grief.
This is what a sample script for LUM on SLED10 looks like:
MYPROC=namcd #The name of the process
INITS=namcd #The name of the /etc/init.d/ file
COUNT=$(UNIX95=1 ps -C $MYPROC -o pid= -o args= | wc -l) #This command gets
the number of occurances of the command $MYPROC. If its running it gives 0.
if [ $COUNT -lt 1 ] #Checks if the service seems like its running or not.
/etc/init.d/$INITS start # The command to start the service
If we want to check for an open port, we get a script that looks like this:
PORT=:445 #The port, the : makes it easy to snag only ports and not other
numbers in the output.
INITS=samba #The name of the service in /etc/init.d/
COUNT=$(netstat -lpn | grep $ | wc -l)
if [ $COUNT -lt 1 ]
We can also change the actions taken when we find out the service isn't running. For example with GroupWise we probably want to add a command after "then" to remove the leftover pid file:
Where pidfile.pid is the name of the service that has crashed. Otherwise the agent won't restart.
This script should work everywhere on any SUSE version.
Disclaimer: As with everything else at Cool Solutions, this content is definitely not supported by Novell (so don't even think of calling Support if you try something and it blows up).
It was contributed by a community member and is published "as is." It seems to have worked for at least one person, and might work for you. But please be sure to test, test, test before you do anything drastic with it.