Preventive Maintenance for System Administrators
Novell Cool Solutions: Feature
By Hanny Kraa
Digg This -
Posted: 7 Aug 2007
At customer sites, I often discover that they don't really have standardized procedures to prevent incidents or problems happening in their server environments.
Generally, their server maintenance procedure is a reactive one: for them, incidents and problems are the only trigger to act. The actions they perform on the environment are only to repair and make things work again. Then they lean back, waiting for the next problem to occur.
Changing this to a proactive form of maintenance has two big advantages:
- First, your environment will be healthier and because you've documented the results of former checks and because your system is kept clean, it's easier to troubleshoot when a problem occurs.
- Second, you'll know your system a lot better and will see the signs of prospective problems, and you'll be able to prevent a lot of possible problems from happening.
Your users will be a lot happier (and quieter). Gone are the days when you plan to go see a nice movie but instead end up working through the night trying to restore a 150 GB volume from a backup tape with questionable quality, keeping the people wearing ties who ask nasty questions off your back.
I've made a standardized walkthrough to maintain your systems proactively. It's a very simple way to keep the environment healthy and to foresee any oncoming problems. The schedule is divided into daily, weekly and monthly tasks. The intervals depend on the complexity of your system, so you'll have to tweak this using your own insights.
To make some predictions about future system behaviour, you'll have to gather data, so it's very important to note the results of your actions in some sort of schema. After a couple of months you'll see patterns: errors that keep coming back, a spurt of growth on a certain volume, a sudden fall back in backup speed. This data will help you act before the problem surfaces.
A lot of these actions can be automated by using scripts, automatic e-mail reports, or a monitoring system.
Check the following things each day:
- Server health status of all the servers
- Backup results - normal
- E-mail queue and throughput - normal
- Virus scan results
- Time synchronization on the servers
Check the following things each week:
- Novell web site and other vendor's web sites for critical patches
- Security issues - for example, use the weekly reports from secunia
- Log files of the scheduled GroupWise maintenance - take action if needed
Do the following things each month:
- Perform an NDS health check using TID 10060600. Note the results.
- Perform a GroupWise health check. I couldn't find a TID like the one with the NDS health check, but this is what I do: 1) Validate the database on the domain objects; 2) Recover the database on the domain objects; 3) Validate the database on the post office objects, and 4) recover the database on the post office objects. Note the results.
- Clear up old user accounts (don't forget to delete their home directory as well).
- Note the size and duration of your full back-up jobs. Check with former results and see if the growth is as usual.
- Restore some files from back-up and check if they're accessible.
- Note the size of the volumes and the percentage of free space. Take action if the space is less than 10%. (e.g. free up space, enlargen the volume or take other actions). Check with former results and see if the growth is as usual.
- Check if your spam filter needs tweaking. Look at the numbers of false positives and if possible the numbers of false negatives (check with the users).
- Check for updates and patches and decide if they need to be installed. (This, of course, depends on the policy at your site: it will be something between "never change what's working" and "patches aren't released for nothing, so install every new one!")
This is a basic procedure to keep your environment healthy and to predict possible problems in the future. Keep your environment healthy and tidy and know it well.
I'd also like to make the following recommendations:
- Review your GroupWise scheduled tasks and see if you can customize them for better maintenance on your specific GroupWise system.
- You can label one week a year as "cleanup week." Tidy up the ZENworks objects and other eDirectory objects, delete or archive old large log files, check for old home directories of users that don't exist any more or old user accounts that haven't been used for months without any reason, and so on.
- Use a monitoring system, especially the ones that can monitor almost everything in your system. This will save lots of work, and it's handy because you don't need to maintain five different monitoring systems. If there is a cost issue involved, there are some very nice free monitoring systems on the market. An example is Nagios, which can monitor almost your entire system.
Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com