A.0 Sentinel Troubleshooting Checklist

This checklist is provided to aid in diagnosing a problem. By filling in this checklist, you can solve common issues or reduce the amount of time needed to solve more complex issues.

Table A-1 Checklist

Checklist Item

Information

Example

Novell Version:

V6.0

Novell Platform and OS Version:

SuSE Linux Enterprise Server 10

Database Platform and OS Version:

Oracle 10.2.0.3 with critical patch #5881721

Sentinel Server Hardware Configuration

  • Processor

  • Memory

  • Other

4 CPU @ 3 GHz

5 GB RAM

Database Server Hardware Configuration

  • Processor

  • Memory

  • Other (if separate Box)

4 CPU @ 3.0 GHz

8 GB RAM

Database Storage Configuration (NAS, SAN, Local and so on.)

Local with offsite backup

Reporting Server OS and Configuration

(Crystal Server)

Crystal XI

SuSE Linux Enterprise Server 10 with MySQL

NOTE:Depending upon how your Sentinel system is configured (distributed), you might need to expand the above table. For instance additional information might be needed for DAS, Advisor, Sentinel Control Center, Collector Builder and communication layer.

  1. Check the Novell Customer Center for your particular issue:

    • Is this a known issue with a work-around?

    • Is this issue fixed in the latest patch release or hot-fix?

    • Is this issue currently scheduled to be fixed in a future release?

  2. Determine the nature of the problem.

    • Can it be reproduced? Can the steps to reproduce the problem be enumerated?

    • What user action, if any, will cause the problem?

    • Is the issue periodic in nature?

  3. Determine the severity of this problem.

    • Is the system still useable?

  4. Understand the environment and systems involved.

    • What platforms and product versions are involved?

    • Are there any non-standard or custom components involved?

    • Is it a high event rate environment?

    • What is the rate of events being collected?

    • What is the event rate of insertion into the database?

    • How many concurrent users are there?

    • Is Crystal reporting used? When are reports run?

    • Is correlation used? How many rules are deployed?

    Collect configuration files, log files and system information from appropriate subdirectories in $ESEC_HOME or %ESEC_HOME%. Assemble this information for possible future knowledge transfer.

  5. Check the health of the system.

    • Can you log into the Sentinel Control Center?

    • Are events being generated and inserted into the database?

    • Can events be seen on the Sentinel Control Center?

    • Can events be retrieved from the database using quick query?

    • Check the RAM usage, disk space, process activity, CPU usage and network connectivity of the hosts involved.

    • Verify all expected Sentinel processes are running. Microsoft Task Manager can be used in a Windows environment. In UNIX, the command ps –ef|grep esecadm can be used.

    • Check for any core dumps in any of the sub-directories of ESEC_HOME. Find out which process core dumped. (cd $ESEC_HOME, find . –name core –print)

    • Check for the sqlplus net access. Check for the tablespaces.

    • Make sure the Sonic broker is running. Connectivity can be verified using the Sonic management console. Check that the various connections are active from Novell processes. Make sure that a lock file is not preventing Sonic from starting. Optionally telnet to that server on the sonic port (that is telnet sentinel.company.com 10012)

    • Check whether the wrapper service is running on the server. (ps –ef | grep wrapper)

    • Are any errors visible in the Servers View of the Sentinel Control Center? Are any errors visible in the Event Source Management Live View in the Sentinel Control Center? What is the OS resource consumption on the Collector Managers?

  6. Is there a problem with the Database?

    • Using sqlplus, can you log into the database?

    • Does the database allow a sqlplus login using the Novell dba account into the ESEC schema?

    • Does querying on one of the table succeed?

    • Does a select statement on a database table succeed?

    • Check the JDBC drivers, their locations and class path settings.

    • If Oracle, do they have Partitioning installed (provide “select * from v$version;”) and used?

    • Is the database being maintained by an administrator? By anyone?

    • Has the database been modified by that administrator?

    • Is SDM being used to maintain the partitions and archive/delete the partitions to make more room in the database?

    • Using SDM what is the current partition? Is it P_MAX?

  7. Inspect whether the product environment settings are correct.

    • Verify the sanity of User login shell scripts, environment variables, configurations, java home settings.

    • Are the environment variable set to run the correct jvm?

    • Verify the proper permissions on the folders for the installed product.

    • Check if any cron jobs are setup causing interference with our product’s functionality.

    • If the product is installed on NFS mounts, check the sanity of NFS mounts & NFS/NIS services.

  8. Is there a possible memory leak?

    • Obtain the statistics on how fast the memory is being consumed and by which process.

    • Gather the metrics of the events throughput per Collector.

    • Run the prstat command on Solaris. This will give the process runtime statistics.

    • In Windows you can check the process size and handle count in task manager.

    This issue, if not resolved, is now ready for escalation. Possible results of escalation are:

    • Configuration file changes

    • Hot fixes or patches to your system

    • Enhancement request

    • Temporary workaround.