How To Troubleshoot GroupWise Abends - Part II
Novell Cool Solutions: Tip
Digg This -
Posted: 2 Jan 2002
Versions: GroupWise 6
GroupWise Abend Troubleshooting Guide
A theoretical explanation of GroupWise Abends and Abend definitions. (TID-10021982)
(See Part I of this guide for a list of the actual steps a GroupWise administrator should go through when troubleshooting a GroupWise Abend.)
1. Abend Definition
Abnormal + End = AbEnd
NT server abends are usually called General Protection Faults (GPF) or Blue Screen of Death
'In simplest terms, an Abend is caused either by a hardware failure or by misbehaving NLM's. In either case the result is usually corrupted memory.'
"When an Abend message appears on the server console, either NetWare or the server CPU has detected a critical error condition (fault) and jumped into the NetWare's fault handler. This handler idles NetWare and displays the Abend message on the server console for immediate action by the server administrator....." (TID 2917538).
'The primary reason for Abends in NetWare is to ensure the stability and integrity of the internal operating system data. For example, if the operating system detected invalid pointers to cache buffers and yet continued to run, data would soon become unusable or corrupted. Thus an Abend is NetWare's way of protecting itself and users against the unpredictable effects of data corruption." (Resolving Critical Server Issues. Feb. 1995 Application Notes. Page 37.)
2. Abend Types
There are 2 main types, software (IE. NetWare, GroupWise, 3rd Party) generated abends and hardware generated (CPU generated) abends. Oftentimes we approach resolution to those types of abends differently.
CPU Detected Errors:
The processor detects a problem and interrupts program execution by issuing an exception. Intel defines an exception as a synchronous event which is the response of the processor to a certain condition detected during the execution of an instruction.
Exceptions are classified as faults, traps, or aborts based on how they are reported and whether restart of the failed instruction is possible.
Examples of CPU detected errors:
See Intel manual for list of exceptions
Page Fault Exceptions:
Registers are preserved/restored
SET parameters for page fault emulation
Allows choice between continuing execution or ABENDing
Internal tests are well placed in the code. These tests ensure the stability and integrity of internal operating system data. Numerous consistency checks are interlaced throughout NetWare to validate critical disk, memory, and communications processes. The ABEND errors that result from failed consistency checks are code-detected errors, as opposed to CPU-detected errors.
A failed consistency check is always a serious error because it indicates some degree of memory corruption.
Possible causes for consistency check ABENDS:
Corrupt operation system file
Corrupt or outdated drivers and NLM'S
Bad packets formed at the client
Defective memory chips
Static electricity discharges
Faulty power supplies
Power surges or spikes
3. Intel ArchitectureIt is important to understand how the Intel CPU architecture relates to abends. A CPU 'runs' code using internal registers, which is memory inside the CPU itself. For example, EAX, EBX, ECX, EIP, ESP, EBP are all listed in abend.log files and represent different CPU registers. EIP, for example, always points to the instruction currently being executed. This is basic but important to understand in order to effectively read abend.log files to establish whether or not there is a true pattern to the abends, and to identify the module that was running (the 'running process') when the abend occurred.
For additional information get a copy of the I.ntel Architecture manual by contacting Intel.
4. The StackThe stack is a history of where memory had been prior to the server being halted due to an ABEND or user intervention. We concentrate on Returns to trace through and map the stack. Each time a function is called the return address calling that function is pushed onto the stack and stays there until a 'RET' is issued from the new function. Therefore the stack contains a series of returns traceable back to the beginning of the stack. Between the returns is simply data values that are placed onto the stack to be used by different functions. (Stack Primer Advanced Debugging Training Course, Novell, page 2)
This is also important to understand because when we look at an abend.log file, we are looking at a portion of the stack of the process that was running when the abend occurs. It is important to understand that a stack is a process' 'scratch pad' if you will. It is a place where return addresses to function calls, local variables, and function parameters are stored. It is important to know that when we look at the stack information in an abend.log file, we can readily see what functions the process called and in what order it called them. Knowing the code path a process takes before it abends is essential to identifying a cause, and very helpful in knowing when looking at multiple abends that there is a pattern to specific abends and thus a reason to get a core dump.
For additional information get a copy of the Intel Architecture manual by contacting Intel.
5. GroupWise Abend Examples
Two specific types of GW abends: GroupWise: Page fault abends and CPU hog abends.
A page fault abend is a CPU generated abend, meaning the memory paging features of the Intel CPU include detection for situations where a process tries to access a memory address that is either invalid (like an address of a 150MB memory location on a server that only has 128MB of RAM) or that is illegal (a process is trying to read from a memory location that doesn't 'belong' to it). CPU Hog abends (we have seen these types with QuickFinder processes) are software generated abends, generated by the OS. They occur when a process keeps control of the CPU for a time period that is longer than the time indicated by the SET CPU HOG TIMEOUT AMOUNT parameter. Sometimes these types of abend situations can be improved by (carefully) bumping up this set parameter. Because the NetWare OS is still pseudo non-preemptive, we need this set parameter to prevent a process from completely hogging the CPU and disallowing other processes from running.
GroupWise Abend Examples fixed in GW 55 SP2:
- CPU Hog Timeout was detected. The offending process is the QuickFinder Indexing thread of the POA (a process `hogs' the CPU and disallows other processes from running)
- LLFree called with a memory block that has a null resource tag
- POA abend during QuickFinder indexing
- Page Fault Processor Exception abend in the POA TCP_Handler process
- Double Fault Processor Exception abend in the GWPOA-Worker Process
- Page Fault Processor Exception abend in the GWPOA-Worker Process
- POA abend while performing an Address Book conversion on addresses with single quotes
- Abend caused by high number of nested attachments
- CPU Hog Detected by Timer abend in GWPOA-Worker Process
- CPU Hog Detected by Timer abend in POA TCP_Handler Process
- CPU Hog Detected by Timer abend in POA MTPListen Process
- Fix to prevent CPU Hog Detected by Timer abend in POA GWPOA-QF Indexer Process
- LLFree abend in ADA
Explanation of some of the Abend Examples listed above:
Abend: CPU Hog Timeout was detected. The offending process is the QuickFinder Indexing thread of the POA
Explanation: A process 'hogs' the CPU and disallows other processes from running
Abend: LLFree called with a memory block. that has a null resource tag
NW 4.11 When you allocate memory for your process and the OS gives it to you, the OS labels the memory (this is called a Resource Tag), every piece of memory for that process has the same resource tag. Part of cleaning up is to Null the resource tag, the problem is a process that owns the piece of memory right 'before' it, nulls out the resource tag (IE. steps on it), which is illegal and hence the server Abends NW 5: has an improved process. NW 5 lets you load a process NLM in Debug mode, lets you load a NLM with a blank page on it. This helps us catch the process that causes the abends.
Abend: Page Fault Processor Exception abend in the POA TCP_Handler process
Explanation: Element of the POA process. POA client/server has TCP handler which handles inbound/outbound TCP processes. Referencing a piece of memory which does not make sense (Non-existent memory, mem which does not belong to it)
Abend: Double Fault Processor Exception abend in the GWPOA-Worker Process
Explanation: Double GPF ??
Abend: Page Fault Processor Exception abend in the GWPOA-Worker Process
Explanation: GPF in the GWPOA-Worker Process
Abend: LLFree abend in ADA Explanation: LLFree is a low level OS process. App calls CLIB Free and CLIB calls the LLFree. Process that tries to free the memory.
Note: All of the above abends have been addressed in 55SP2 (see g552en.exe readme.txt)
For details or updates on this tip, see TID-10021982.
Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com