Obviously not being an OS engineer I don’t know it all, but we get enough questions along the lines of “hey that FTF says it fixed ‘THE’ GWIA abend, but mine is still abending” that I thought I should share with you what we look for in an abend.log.
Yeah, abend.logs are not real easy reading and they normally don’t contain enough information for us to write a code fix (though I have seen it done, and I am in awe of that) but they can be very useful nonetheless. There are a few things that we look for to get us closer to the source of the problem, or to identify if we have seen the problem before:
The first part of the log:
Server AEVANSOES halted Monday, February 27, 2006 3:14:34.434 pm
Abend 1 on P00: Server-5.70.04: Page Fault Processor Exception (Error code 00000000)
Important parts here:
Abend 1 – we only care about Abend 1 as any subsequent abends can be caused by the first abend. So, if yours is Abend 8 it is probably not worth reporting to us.
The time – maybe you can correlate this with something in the logs, or some process that runs at that time
The abend type – Page Fault Processor Exception in this case, means that it is a hardware detected abend and is a pointer (though not proof) to a bug.
CS = 0060 DS = 0068 ES = 0068 FS = 007B GS = 007B SS = 0068
EAX = 4E53D6A0 EBX = 4E530120 ECX = 4E530120 EDX = 00000001
ESI = 00000000 EDI = 00000000 EBP = 4E74B684 ESP = 4E74B670
EIP = 61B11B4B FLAGS = 00010202
61B11B4B 837E6900 CMP [ESI+69]=?, 00000000
EIP in GWIA.NLM at code start +0002EB4Bh
Access Location: 0x00000069
This is what is stored in all the CPU registers at the time of the abend. An EIP (Extended? Instruction Pointer) is the point at which we abended. The value of EIP can be different on different servers, however, the actual instruction should be the same, eg CMP [ESI+69]=?, 00000000 and the code start should be the same also, if the exact same module is loaded on the servers. What’s a code start? It is the HEX address of the line of code that we abended on, counted from the beginning of that module. In the above example it is in GWIA.NLM at +0002EB4B. It’s important to compare the exact same module versions/dates because, as we make changes in the code, the code start for the same line of code can move – if you imagine that we add 10 lines of code somewhere earlier in the code for something else then the point at which we abend moves 10 lines further down or old code start + 10.
So far, if I was looking for an exiting TID or an existing bug, I would be searching on abend, page fault, gwia, and 0002EB4B (sometimes you need to include the leading + and/or the trailing h).
This is going to be a long post
The violation occurred while processing the following instruction:
61B11B4B 837E6900 CMP [ESI+69], 00000000 this is where we abended
61B11B4F 7429 JZ 61B11B7A that it’s the same abend
61B11B58 6A0D PUSH 0D
61B11B5A E831700800 CALL GWIA.NLM|MMSSubmitCommand
61B11B5F 59 POP ECX
61B11B60 31FF XOR EDI, EDI
61B11B62 EB0F JMP 61B11B73
61B11B64 837E6D00 CMP [ESI+6D], 00000000
61B11B68 7417 JZ 61B11B81
Next comes the ‘stack’ :
Running process: GWIA-Main Process This is the name of the thread that abended. It should match if the
Thread Owned by NLM: GWIA.NLM abend is the same
Stack pointer: 4E74B2BC
OS Stack limit: 4E743A60
Scheduling priority: 67371008
Wait state: 3030070 Yielded CPU
Stack: –4E53DB44 ?
61AE73F9 (GWIA.NLM|GweMainForNLM+1CB) This bit is complicated to explain – pop to the bottom of the stack
–4E530120 ? for the rest
-BF4C0750 (THREADS.NLM|(Data Start)+2750)
Everywhere that you see (MODULE.NLM|FunctionName+###) is a place where the value in memory matches a point in code. Let me expand, everything stored in memory is either code or data. If we start at memory address 0 and load a module that is 100Kb then (and I am over simplifying this) memory addresses 0 though 99 are occupied by this module, and the OS tracks this. This is code space.
As a program executes it writes the data it needs and the code addresses to functions on the stack (eg, 0 to 99 as above), this is data space. When we abend we write out the data part of memory as the stack like above and the abend.log tries to help by telling you when it finds an value that matches an address where it knows code is stored in memory (0 to 99 in my example). The problem is that it’s not always accurate as the value stored may actually be data that just happens to match a code address.
At this point, if I was searching for tids or bugs I would possibly also be searching on some of the function names above, as they can get you to a relevent hit quicker – though the rest of the abend should match somewhat closely too.
And now the last bits:
The CPU encountered a problem executing code in GWIA.NLM. The problem may be in that module or in data passed to that module by a process owned by GWIA.NLM.
This is the module that abended and what passed the data to that module. This one was definitely a GWIA abend
GWIA.NLM GroupWise Internet Agent (Beta release version)
Version 7.00.01 February 8, 2006
Code Address: 61AE3000h Length: 002007EAh
Data Address: 5024C000h Length: 00062B03h
The loaded modules section tells us two things – the version and build date of the modules and the order in which they were loaded, with the most recent at the top of the list and going backwards. On my server the last module loaded was GWIA.NLM and it was abending on startup – I don’t remember the specific abend but I know it’s on startup due to the function names on the stack NgwThrdCreate, TcoNewSystemThreadEntryPoint and RegisterToIPMgmt are all things that a module does on startup.
If you are experiencing an abend that you can’t find anything about elsewhere then what we are going to need is a coredump. Another pointer is, if you look in your abend.log, and the abends are all over the place then it is often a sign of a corrupt memory module. And, as you can see, ‘THE’ GWIA abend doesn’t really cut the mustard as a problem description.